International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(1): 59-68
Published online March 25, 2020
https://doi.org/10.5391/IJFIS.2020.20.1.59
© The Korean Institute of Intelligent Systems
Seoung-Ho Choi1 and Sung Hoon Jung2
1Department of Electronics and Information Engineering, Hansung University, Seoul, Korea
2Division of Mechanical and Electronics Engineering, Hansung University, Seoul, Korea
Correspondence to :
Sung Hoon Jung (shjung@hansung.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Acquisition of fine-grained segments in semantic segmentation is important in most sementic segmentation applications, especially for clothing images composed of fine-grained textures. However, most existing semantic segmentation methods based on fully convolutional network (FCN) were not enough to acquire fine-grained segments because they are based on a single resolution and can not well distinguish between objects in the images. To stabilize the acquisition of fine-grained segments, we propose a method that is composed of two additional components in the U-Net structure for processing multi-scale fine-grained segments. The first component is to use normalization at all layers. We found from experiments that normalization is a key process in stabilizing the acquisition of fine-grained segments, especially in the U-Net based methods because they operate on a multi-scale fine-grained segment. An additional component is to use model prediction correction using focal loss with L1 regularization. Focal loss can be used to control the model prediction term as regularization in the training process. From experiments, we found that our method was better than the existing methods.
Keywords: Multi-scale segments, Batch normalization, Model prediction correction
Most images consist of various sizes, shapes, and textures. Therefore, it is necessary for specific image processing to operate appropriately under images with various characteristics. Since the textures of clothing images are very fine-grained and diverse, acquisition of fine-grained segments is particularly important. Finally, obtaining accurate semantic segments of images is an important goal to achieve in semantic segmentation researches [1].
However, there are not many studies that explore the acquisition of fine-grained segments of semantic segmentation in images. Most existing researches on image semantic segmentation has adopted fully convolutional network (FCN) [2, 3]. However, their segmentation results were not good at finding fine-grained segments, semantic segments because the convolution of FCN [2] didn’t maintain spatial information. Even though there were related works on fully convolutional network-conditional random field (FCN-CRF) [3], FCN [2] using Atrous convolution, and FCN using the skip diagram in the FCN upsampling process to improve the performances of FCN, most of FCN showed poor results. In the methods, obtaining various sizes of segments are not reflected in model training. Therefore, obtaining various sizes of fine-grained segments is a crucial factor in processing segmentation for acquisition of fine-grained segments in images.
To overcome the problem, we adopt two additional components on a U-Net [4] based structure for acquisition of fine-grained segments of semantic segmentation. The first additional component is the use of normalization at all layers in the training process. Normalization alleviates the variation of multi-scale information, especially in the multi-scale processing of U-Net. We took the well-known batch normalization (BN) [5] in all U-Net layers. As second additional component, we take the model prediction correction using focal loss [6] with L1 [7] regularization in training. Existing loss of cross-entropy does not reflect the fine-grained segments because it does not balance the model prediction space. From observation, we consider the model prediction correction for fine-grained segments and propose a novel loss composed of focal loss [6] with L1 regularization [7]. As a result, we introduce a novel structure based on U-Net [4] that trained with BN and a devised novel loss.
To measure performances of our method and previous methods, we experimented with three models such as U-Net [4], Attention U-Net [8], and U-Net BN including the existing FCN [2] models on the ATR dataset [9]. U-Net BN refers to a model that applies BN to all layers of the existing U-Net. We also tested our method with various combinations of loss and provided their results with some measures such as intersection over union (IOU), precision, and recall. From extensive experiments, we find that our method with two additional components have generated correct fine-grained segments, especially in small and complex textures such as hands, feet, and glasses. Also, the overall performances of our method were considerably better than those of the existing methods based on FCN in terms of accuracy using IOU, precision, and recall measures.
The rest of the paper is organized as follows. In Section 2, we introduce related works. The U-Net structure with two additional components of BN and novel combined focal loss with L1 regularization is addressed in Section 3. Section 4 shows the experiment results and discussion. We conclude in Section 5 with further works.
The segmentation model is divided into semantic segmentation [2, 4] and instance segmentation [10]. Semantic segmentation [2–4, 8, 11] is to provide a same label to semantically same objects, while instance segmentation is to assign a new label for each instance. Since the aim of the paper is for the acquisition of fine-grained segments in semantic segmentation, we focus on semantic segmentation. In semantic segmentation, FCN [2] is a conventional method. Although FCN has used in semantic segmentation, there is a limitation that it can only extract features without maintaining the shapes of images because it uses a single resolution. To solve limitations, Hao and Kang [12] have proposed a method for extracting multi-feature maps using pyramid pooling based on FCN-VGG for semantic segmentation. It is confirmed that the improved features enhanced the performance through fusion with the extracted features.
However, their method has another limitation since the pyramid pooling is performed after the last layer of FCN-VGG, only abstract features are used for pyramid pooling. As a result, deconvolution using these abstract features in deconvolution layers generated poor segments with much distortion. There are a lot of researches on segmentation without distortion, especially for medical images [4, 13, 14]. Even if they showed good segmentation results with less distortion, they also had other limitations in the viewpoints of finding fine-grained segments. In other words, these methods did not find fine-grained segments, although their distortion was small.
It was well known that multi-scale processing helped segmentation algorithms and a hierarchical structure could implement multi-scale processing. Salembier [15] showed that morphology could be efficiently differentiated using a hierarchical structure. Wang et al. [16] propose spawning and using low-rank subspace features from the hierarchical structure. There are no previous researches on how to affects multiscale processing as acquisition of fine-grained segments of semantic segmentation. We propose a new method to use two additional components to improve performance in U-Net.
We propose a U-Net [4] based semantic segmentation method for acquisition of multi-scale fine-grained segments in semantic segmentation with two additional components. U-Net is a well-known model that can find fine-grained information through efficient reflection of multi-scale information. However, semantic segmentation using only U-Net does not find fine-grained segments, especially for the images composed of various shapes and textures. To overcome the limitation, we adopt two additional components, BN to efficiently learn stabilized multi-scale fine-grained segments and model prediction correction using focal loss with L1 regularization. We found from our extensive experiments that BN was a crucial process in finding multi-scale fine-grained segments because the normalization of data was more important to precisely calculate fine-grained segments for multi-scale data in all layers. Also, we use a combining method of focal loss and L1 regularization to balance the model prediction correction that enables the segmentation to stabilize acquired fine-grained segments in multi-scale data. We confirmed the results through extensive experimentation in Section 4
Figure 1 shows the structure of our U-Net based semantic segmentation. The overall structure is nearly the same as the original U-Net except for the BN of all layers, depicted as yellow arrows and model prediction correction using focal loss [6] with L1 [7] regularization.
The experimental environments are as follows. We conducted our experiments with a learning rate of 0.001 and a batch size of 4. We chose an ATR dataset [9] of clothing images because they are composed of various sizes, shapes, and textures and are therefore adequate to get fine-grained segments. Background images excluded during the evaluation. In order to evaluate the performances with various measures, we used recall [17], precision [17], F1 score [17], and IOU.
In all experiments, we used zero-padding to make all the images the same size as the long side of the image. We compared our method with the existing method of FCN and experimented on three U-Net models: U-Net without BN and focal loss with L1 regularization, U-Net BN and focal loss with L1 regularization, and attention U-Net using focal loss with L1 regularization. Attention U-Net has an U-Net structure with an attention gate. To verify focal loss with L1 regularization for optimization using regularization, we compared the performance of existing cross-entropy and proposed method, that is, focal loss with L1 regularization. In experiments, we tested proposed focal loss with L1 regularization, focal loss with L2 regularization, and focal loss with L1 and L2 to compare the regularization. L1 and L2 regularization coefficient are added at a ratio of 0.5.
We tested the performances of acquistion of fine-grained segment of semantic segmentation using previous cross-entropy loss and proposed focal loss with L1 regularization on three semantic segmentation models such as FCN, Attention U-Net, and U-Net BN on the clothing images. Figure 2 shows the experimental results of the three models on four combination of regularization coefficient parameters. As mentioned before, we choose the the regularization coefficient parameters combining 0 and 0, 0 and 0.5, 0.5 and 0, 0.5 and 0.5. In Figure 2, the blue and red circles are incorrect predictions, and red circles are the best among incorrect predictions. As you can see, the most results using focal loss with L1 regularization are better than those using the cross-entropy loss, especially on fine-grained segments shapes such as hands, foods, shoes, and neck.
The best result of all experiments is shown in panel (VII) of Figure 2, which uses focal loss with L1 regularization. That is, the L2 regularization is not helpful for acquistion of fine-grained segments of semantic segmentation. L1 regularization was more robust than L2 regularization concerning the outlier. Many fine-grained segments in the images with various sizes, shapes, and textures are outliers. Therefore, L1 regularization is better than L2 regularization for acquistion of fine-grained segments of semantic segmentation.
The attention U-Net and U-Net BN are better than the FCN. Because the U-Net stably acquires multi-scale fine-grained segments. Of the three methods, the U-Net BN shows the best results of all experiments because the normalization of all layers helps find fine-grained segments. From these results, we can ascertain that the focal loss with L1 regularization and the normalization are useful for acquisition of fine-grained segmentation. In our experiments, we tested the intrinsic U-Net structure with cross-entropy loss, but the training loss does not decrease, as shown in the result Figures 3 and 4 As a result, the U-Net BN using focal loss with L1 regularization shows excellent results.
To ensure whether the U-Net BN shows the best result for various clothes images, we tested four methods on four clothing images, as shown in Figure 5. In these experiments, we added the results of U-Net using focal loss with L1 0.5 regularization. Figure 5 shows the experimental results of four methods on four images. As shown in Figure 5, U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization shows the best results for the leg, skirt, and neck. The results of Figure 5(a) are the worst because the colors in the background are similar to the colors wrong by the women. Even the U-Net BN using focal loss with L1 0.5 and L2 0.0 generates the best quality because the localization effect of normalization makes it possible for the method to better distinguish the colors in the background from those of the women. In Figure 5(c), the U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization also shows the best results, especially on the small parts of the shoes, when compared to the other methods. The U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization in Figure 5(d) generates more precise fine-grained segments in the area of the sunglasses than the other methods and even better than the ground-truth mask.
To more specifically analyze the effects of BN with two loss and two loss with some regularization coefficients on the ATR dataset, we showed the loss for the four models in Figure 3. Orange line indicates loss of U-Net structure. As you can see, the loss of the U-Net structure showed inferior results and nearly did not decrease. The intrinsic U-Net with cross-entropy and without normalization does not work well for fine-grained segments of semantic segmentation. The variation of loss of intrinsic U-Net is not very large in the cases of cross-entropy loss but is quite large in the focal loss because the model prediction correction of focal loss is sometimes successful in training the intrinsic U-Net.
FCN showed poor performances in the case of L1 0.5 regularization. In the process of obtaining the segments in the FCN, the calculated difference between the predicted value and the correct value is not reflected in the process of reflecting the difference value. The variation of loss of U-Net BN becomes too large when L1 0.5 and L2 regularization. The L1 regularization reflects the differences that have diverse values according to experiments. Therefore, it is crucial to reflect the difference value efficiently. Overall, the methods with focal loss showed better results than those with cross-entropy loss. As shown in panel (VII) of Figure 3, the methods using both L1 and L2 regularization showed the worst performances. Because too much regularization provides a reverse effect.
We analyzed the influence of the attention gate in terms of an F1 score. Figure 4 showed the results of experiments: Figure 4(a) is with attention gate and Figure 4(b) is without attention gate. The result of U-Net without the attention gate was higher than U-Net with the attention gate as shown in Figure 4. Because the attention gate in the initial steps of training tries to find the strongest characteristics in the images, but it find erroneous characteristics.
Table 1 shows the impact of four regularization methods along with IOU, precision, and recall measurements for three models: U-Net, attention U-Net, and U-Net BN. The values shown in Table 1 are the mean and standard deviation of the results obtained three times on the same model. As shown in Table 1, the overall performance of U-Net is worse than those of the other models. However, the U-Net with L1 0.5 and L2 0.0 regularization Table 1, which showed about 23% improvement over U-Net with other regularization methods. We think that the regularization with L1 0.5 and L2 0.0 coefficient is effective in the attention of U-Net. From the results, the attention is not helpful, and the performance of U-Net BN is improved by about 22% over those of U-Net. Therefore we can be deduced that attention gate in the process of learning induces learning in the wrong direction. From these results, it is confirmed that a focal loss with L1 0.5 and L2 0.0 regularization proves useful for the acquisition of fine-grained segments in semantic segmentation.
Figure 6 shows regularization effects of cross-entropy and focal loss using non-regularization and regularization with L1 and L2. This confirms that regularization does not significantly affect on cross-entropy results of U-Net. However, the performance of acquistion of fine-grained segments of semantic segmentation is improved when the regularization is combined to focal loss. The focal loss has a term that reflects the reverse of the probability from the model prediction. That means that model correction prediction is more affected by focal loss unlike the existing cross-entropy.
We proposed a method based on the U-Net structure with two additional components to acquisition of fine-grained segments. For acquisition of fine-grained segments, we added normalization to all layer of the U-Net structure and proposed combined component composed of focal loss with L1 regularization. We experimented with proposed methods on an ATR dataset and analyzed their results. Experimental results showed that proposed methods were better than the previous FCN and intrinsic U-Net. These results allowed us to know that the U-Net was a structure for semantic segmentation, adopted normalization about all layer on the U-Net was beneficial for semantic segmentation, and the model prediction correction using focal loss with L1 regularization was good at acquiring the fine-grained segments in semantic segmentation. In the future, we will proceed with the semantic segmentation structure using the generative model to obtain a more robust acquisition of fine-grained segment of semantic segmentation.
No potential conflict of interest relevant to this article was reported.
This research was financially supported by Hansung University.
Result of focal loss regularization model: (a) FCN, (b) attention U-Net, and (c) U-Net BN. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, V) Focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) focal loss with L1 0.5 and L2 0.5 regularization coefficient.
Comparison with or without BN about two loss function types and various regularization coefficients in loss function within training time. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, (V) focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) Focal loss with L1 0.5 and L2 0.5 regularization coefficient.
Comparison of F1 score according to addition of attention gate on segmentation model within training time: (a) in the attention gate included and (b) in the attention gate non-included.
Four methods on four clothing images: (a) case 1, (b) case 2, (c) case 3, and (d) case 4. The proposed method is focal loss L1 0.5 regularization coefficient. (I) FCN, (II) U-Net, (III) attention U-Net, and (IV) U-Net BN.
Comparison of regularization effect using non-regularization and regularization with L1 and L2 regularization: (a) cross-entropy loss and (b) focal loss. (i) L1 0.0 and L2 0.0 regularization coefficient, (ii) L1 0.5 and L2 0.5 regularization coefficient.
Table 1. Comparison of both focal loss about U-Net models.
IOU | Precision | Recall | ||
---|---|---|---|---|
U-Net | (i) | 0.496 ± 0.000 | 0.001 ± 0.000 | 0.000 ± 0.000 |
(ii) | 0.496 ± 0.000 | 0.001 ± 0.001 | 0.000 ± 0.000 | |
(iii) | 0.632 ± 0.011 | 0.344 ± 0.029 | 0.333 ± 0.004 | |
(iv) | 0.496 ± 0.000 | 0.003 ± 0.002 | 0.000 ± 0.000 | |
Attention U-Net | (i) | 0.715 ± 0.008 | 0.536 ± 0.002 | 0.420 ± 0.002 |
(ii) | 0.724 ± 0.005 | 0.555 ± 0.005 | 0.444 ± 0.013 | |
(iii) | 0.723 ± 0.002 | 0.554 ± 0.006 | 0.441 ± 0.004 | |
(iv) | 0.723 ± 0.005 | 0.548 ± 0.012 | 0.437 ± 0.013 | |
U-Net BN | (i) | 0.731 ± 0.002 | 0.565 ± 0.007 | 0.464 ± 0.005 |
(ii) | 0.728 ± 0.004 | 0.564 ± 0.004 | 0.458 ± 0.004 | |
(iii) | 0.729 ± 0.003 | 0.567 ± 0.003 | 0.459 ± 0.007 | |
(iv) | 0.730 ± 0.001 | 0.564 ± 0.006 | 0.461 ± 0.001 |
(I) L1 0.0 and L2 0.0 regularization coefficient, (ii) L1 0.0 and L2 0.5 regularization coefficient, (iii) L1 0.5 and L2 0.0 regularization coefficient, and (iv) L1 0.5 and L2 0.5 regularization coefficient..
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(1): 59-68
Published online March 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.1.59
Copyright © The Korean Institute of Intelligent Systems.
Seoung-Ho Choi1 and Sung Hoon Jung2
1Department of Electronics and Information Engineering, Hansung University, Seoul, Korea
2Division of Mechanical and Electronics Engineering, Hansung University, Seoul, Korea
Correspondence to:Sung Hoon Jung (shjung@hansung.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Acquisition of fine-grained segments in semantic segmentation is important in most sementic segmentation applications, especially for clothing images composed of fine-grained textures. However, most existing semantic segmentation methods based on fully convolutional network (FCN) were not enough to acquire fine-grained segments because they are based on a single resolution and can not well distinguish between objects in the images. To stabilize the acquisition of fine-grained segments, we propose a method that is composed of two additional components in the U-Net structure for processing multi-scale fine-grained segments. The first component is to use normalization at all layers. We found from experiments that normalization is a key process in stabilizing the acquisition of fine-grained segments, especially in the U-Net based methods because they operate on a multi-scale fine-grained segment. An additional component is to use model prediction correction using focal loss with L1 regularization. Focal loss can be used to control the model prediction term as regularization in the training process. From experiments, we found that our method was better than the existing methods.
Keywords: Multi-scale segments, Batch normalization, Model prediction correction
Most images consist of various sizes, shapes, and textures. Therefore, it is necessary for specific image processing to operate appropriately under images with various characteristics. Since the textures of clothing images are very fine-grained and diverse, acquisition of fine-grained segments is particularly important. Finally, obtaining accurate semantic segments of images is an important goal to achieve in semantic segmentation researches [1].
However, there are not many studies that explore the acquisition of fine-grained segments of semantic segmentation in images. Most existing researches on image semantic segmentation has adopted fully convolutional network (FCN) [2, 3]. However, their segmentation results were not good at finding fine-grained segments, semantic segments because the convolution of FCN [2] didn’t maintain spatial information. Even though there were related works on fully convolutional network-conditional random field (FCN-CRF) [3], FCN [2] using Atrous convolution, and FCN using the skip diagram in the FCN upsampling process to improve the performances of FCN, most of FCN showed poor results. In the methods, obtaining various sizes of segments are not reflected in model training. Therefore, obtaining various sizes of fine-grained segments is a crucial factor in processing segmentation for acquisition of fine-grained segments in images.
To overcome the problem, we adopt two additional components on a U-Net [4] based structure for acquisition of fine-grained segments of semantic segmentation. The first additional component is the use of normalization at all layers in the training process. Normalization alleviates the variation of multi-scale information, especially in the multi-scale processing of U-Net. We took the well-known batch normalization (BN) [5] in all U-Net layers. As second additional component, we take the model prediction correction using focal loss [6] with L1 [7] regularization in training. Existing loss of cross-entropy does not reflect the fine-grained segments because it does not balance the model prediction space. From observation, we consider the model prediction correction for fine-grained segments and propose a novel loss composed of focal loss [6] with L1 regularization [7]. As a result, we introduce a novel structure based on U-Net [4] that trained with BN and a devised novel loss.
To measure performances of our method and previous methods, we experimented with three models such as U-Net [4], Attention U-Net [8], and U-Net BN including the existing FCN [2] models on the ATR dataset [9]. U-Net BN refers to a model that applies BN to all layers of the existing U-Net. We also tested our method with various combinations of loss and provided their results with some measures such as intersection over union (IOU), precision, and recall. From extensive experiments, we find that our method with two additional components have generated correct fine-grained segments, especially in small and complex textures such as hands, feet, and glasses. Also, the overall performances of our method were considerably better than those of the existing methods based on FCN in terms of accuracy using IOU, precision, and recall measures.
The rest of the paper is organized as follows. In Section 2, we introduce related works. The U-Net structure with two additional components of BN and novel combined focal loss with L1 regularization is addressed in Section 3. Section 4 shows the experiment results and discussion. We conclude in Section 5 with further works.
The segmentation model is divided into semantic segmentation [2, 4] and instance segmentation [10]. Semantic segmentation [2–4, 8, 11] is to provide a same label to semantically same objects, while instance segmentation is to assign a new label for each instance. Since the aim of the paper is for the acquisition of fine-grained segments in semantic segmentation, we focus on semantic segmentation. In semantic segmentation, FCN [2] is a conventional method. Although FCN has used in semantic segmentation, there is a limitation that it can only extract features without maintaining the shapes of images because it uses a single resolution. To solve limitations, Hao and Kang [12] have proposed a method for extracting multi-feature maps using pyramid pooling based on FCN-VGG for semantic segmentation. It is confirmed that the improved features enhanced the performance through fusion with the extracted features.
However, their method has another limitation since the pyramid pooling is performed after the last layer of FCN-VGG, only abstract features are used for pyramid pooling. As a result, deconvolution using these abstract features in deconvolution layers generated poor segments with much distortion. There are a lot of researches on segmentation without distortion, especially for medical images [4, 13, 14]. Even if they showed good segmentation results with less distortion, they also had other limitations in the viewpoints of finding fine-grained segments. In other words, these methods did not find fine-grained segments, although their distortion was small.
It was well known that multi-scale processing helped segmentation algorithms and a hierarchical structure could implement multi-scale processing. Salembier [15] showed that morphology could be efficiently differentiated using a hierarchical structure. Wang et al. [16] propose spawning and using low-rank subspace features from the hierarchical structure. There are no previous researches on how to affects multiscale processing as acquisition of fine-grained segments of semantic segmentation. We propose a new method to use two additional components to improve performance in U-Net.
We propose a U-Net [4] based semantic segmentation method for acquisition of multi-scale fine-grained segments in semantic segmentation with two additional components. U-Net is a well-known model that can find fine-grained information through efficient reflection of multi-scale information. However, semantic segmentation using only U-Net does not find fine-grained segments, especially for the images composed of various shapes and textures. To overcome the limitation, we adopt two additional components, BN to efficiently learn stabilized multi-scale fine-grained segments and model prediction correction using focal loss with L1 regularization. We found from our extensive experiments that BN was a crucial process in finding multi-scale fine-grained segments because the normalization of data was more important to precisely calculate fine-grained segments for multi-scale data in all layers. Also, we use a combining method of focal loss and L1 regularization to balance the model prediction correction that enables the segmentation to stabilize acquired fine-grained segments in multi-scale data. We confirmed the results through extensive experimentation in Section 4
Figure 1 shows the structure of our U-Net based semantic segmentation. The overall structure is nearly the same as the original U-Net except for the BN of all layers, depicted as yellow arrows and model prediction correction using focal loss [6] with L1 [7] regularization.
The experimental environments are as follows. We conducted our experiments with a learning rate of 0.001 and a batch size of 4. We chose an ATR dataset [9] of clothing images because they are composed of various sizes, shapes, and textures and are therefore adequate to get fine-grained segments. Background images excluded during the evaluation. In order to evaluate the performances with various measures, we used recall [17], precision [17], F1 score [17], and IOU.
In all experiments, we used zero-padding to make all the images the same size as the long side of the image. We compared our method with the existing method of FCN and experimented on three U-Net models: U-Net without BN and focal loss with L1 regularization, U-Net BN and focal loss with L1 regularization, and attention U-Net using focal loss with L1 regularization. Attention U-Net has an U-Net structure with an attention gate. To verify focal loss with L1 regularization for optimization using regularization, we compared the performance of existing cross-entropy and proposed method, that is, focal loss with L1 regularization. In experiments, we tested proposed focal loss with L1 regularization, focal loss with L2 regularization, and focal loss with L1 and L2 to compare the regularization. L1 and L2 regularization coefficient are added at a ratio of 0.5.
We tested the performances of acquistion of fine-grained segment of semantic segmentation using previous cross-entropy loss and proposed focal loss with L1 regularization on three semantic segmentation models such as FCN, Attention U-Net, and U-Net BN on the clothing images. Figure 2 shows the experimental results of the three models on four combination of regularization coefficient parameters. As mentioned before, we choose the the regularization coefficient parameters combining 0 and 0, 0 and 0.5, 0.5 and 0, 0.5 and 0.5. In Figure 2, the blue and red circles are incorrect predictions, and red circles are the best among incorrect predictions. As you can see, the most results using focal loss with L1 regularization are better than those using the cross-entropy loss, especially on fine-grained segments shapes such as hands, foods, shoes, and neck.
The best result of all experiments is shown in panel (VII) of Figure 2, which uses focal loss with L1 regularization. That is, the L2 regularization is not helpful for acquistion of fine-grained segments of semantic segmentation. L1 regularization was more robust than L2 regularization concerning the outlier. Many fine-grained segments in the images with various sizes, shapes, and textures are outliers. Therefore, L1 regularization is better than L2 regularization for acquistion of fine-grained segments of semantic segmentation.
The attention U-Net and U-Net BN are better than the FCN. Because the U-Net stably acquires multi-scale fine-grained segments. Of the three methods, the U-Net BN shows the best results of all experiments because the normalization of all layers helps find fine-grained segments. From these results, we can ascertain that the focal loss with L1 regularization and the normalization are useful for acquisition of fine-grained segmentation. In our experiments, we tested the intrinsic U-Net structure with cross-entropy loss, but the training loss does not decrease, as shown in the result Figures 3 and 4 As a result, the U-Net BN using focal loss with L1 regularization shows excellent results.
To ensure whether the U-Net BN shows the best result for various clothes images, we tested four methods on four clothing images, as shown in Figure 5. In these experiments, we added the results of U-Net using focal loss with L1 0.5 regularization. Figure 5 shows the experimental results of four methods on four images. As shown in Figure 5, U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization shows the best results for the leg, skirt, and neck. The results of Figure 5(a) are the worst because the colors in the background are similar to the colors wrong by the women. Even the U-Net BN using focal loss with L1 0.5 and L2 0.0 generates the best quality because the localization effect of normalization makes it possible for the method to better distinguish the colors in the background from those of the women. In Figure 5(c), the U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization also shows the best results, especially on the small parts of the shoes, when compared to the other methods. The U-Net BN using focal loss with L1 0.5 and L2 0.0 regularization in Figure 5(d) generates more precise fine-grained segments in the area of the sunglasses than the other methods and even better than the ground-truth mask.
To more specifically analyze the effects of BN with two loss and two loss with some regularization coefficients on the ATR dataset, we showed the loss for the four models in Figure 3. Orange line indicates loss of U-Net structure. As you can see, the loss of the U-Net structure showed inferior results and nearly did not decrease. The intrinsic U-Net with cross-entropy and without normalization does not work well for fine-grained segments of semantic segmentation. The variation of loss of intrinsic U-Net is not very large in the cases of cross-entropy loss but is quite large in the focal loss because the model prediction correction of focal loss is sometimes successful in training the intrinsic U-Net.
FCN showed poor performances in the case of L1 0.5 regularization. In the process of obtaining the segments in the FCN, the calculated difference between the predicted value and the correct value is not reflected in the process of reflecting the difference value. The variation of loss of U-Net BN becomes too large when L1 0.5 and L2 regularization. The L1 regularization reflects the differences that have diverse values according to experiments. Therefore, it is crucial to reflect the difference value efficiently. Overall, the methods with focal loss showed better results than those with cross-entropy loss. As shown in panel (VII) of Figure 3, the methods using both L1 and L2 regularization showed the worst performances. Because too much regularization provides a reverse effect.
We analyzed the influence of the attention gate in terms of an F1 score. Figure 4 showed the results of experiments: Figure 4(a) is with attention gate and Figure 4(b) is without attention gate. The result of U-Net without the attention gate was higher than U-Net with the attention gate as shown in Figure 4. Because the attention gate in the initial steps of training tries to find the strongest characteristics in the images, but it find erroneous characteristics.
Table 1 shows the impact of four regularization methods along with IOU, precision, and recall measurements for three models: U-Net, attention U-Net, and U-Net BN. The values shown in Table 1 are the mean and standard deviation of the results obtained three times on the same model. As shown in Table 1, the overall performance of U-Net is worse than those of the other models. However, the U-Net with L1 0.5 and L2 0.0 regularization Table 1, which showed about 23% improvement over U-Net with other regularization methods. We think that the regularization with L1 0.5 and L2 0.0 coefficient is effective in the attention of U-Net. From the results, the attention is not helpful, and the performance of U-Net BN is improved by about 22% over those of U-Net. Therefore we can be deduced that attention gate in the process of learning induces learning in the wrong direction. From these results, it is confirmed that a focal loss with L1 0.5 and L2 0.0 regularization proves useful for the acquisition of fine-grained segments in semantic segmentation.
Figure 6 shows regularization effects of cross-entropy and focal loss using non-regularization and regularization with L1 and L2. This confirms that regularization does not significantly affect on cross-entropy results of U-Net. However, the performance of acquistion of fine-grained segments of semantic segmentation is improved when the regularization is combined to focal loss. The focal loss has a term that reflects the reverse of the probability from the model prediction. That means that model correction prediction is more affected by focal loss unlike the existing cross-entropy.
We proposed a method based on the U-Net structure with two additional components to acquisition of fine-grained segments. For acquisition of fine-grained segments, we added normalization to all layer of the U-Net structure and proposed combined component composed of focal loss with L1 regularization. We experimented with proposed methods on an ATR dataset and analyzed their results. Experimental results showed that proposed methods were better than the previous FCN and intrinsic U-Net. These results allowed us to know that the U-Net was a structure for semantic segmentation, adopted normalization about all layer on the U-Net was beneficial for semantic segmentation, and the model prediction correction using focal loss with L1 regularization was good at acquiring the fine-grained segments in semantic segmentation. In the future, we will proceed with the semantic segmentation structure using the generative model to obtain a more robust acquisition of fine-grained segment of semantic segmentation.
No potential conflict of interest relevant to this article was reported.
This research was financially supported by Hansung University.
Proposed U-Net structure.
Result of focal loss regularization model: (a) FCN, (b) attention U-Net, and (c) U-Net BN. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, V) Focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) focal loss with L1 0.5 and L2 0.5 regularization coefficient.
Comparison with or without BN about two loss function types and various regularization coefficients in loss function within training time. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, (V) focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) Focal loss with L1 0.5 and L2 0.5 regularization coefficient.
Comparison of F1 score according to addition of attention gate on segmentation model within training time: (a) in the attention gate included and (b) in the attention gate non-included.
Four methods on four clothing images: (a) case 1, (b) case 2, (c) case 3, and (d) case 4. The proposed method is focal loss L1 0.5 regularization coefficient. (I) FCN, (II) U-Net, (III) attention U-Net, and (IV) U-Net BN.
Comparison of regularization effect using non-regularization and regularization with L1 and L2 regularization: (a) cross-entropy loss and (b) focal loss. (i) L1 0.0 and L2 0.0 regularization coefficient, (ii) L1 0.5 and L2 0.5 regularization coefficient.
Table 1 . Comparison of both focal loss about U-Net models.
IOU | Precision | Recall | ||
---|---|---|---|---|
U-Net | (i) | 0.496 ± 0.000 | 0.001 ± 0.000 | 0.000 ± 0.000 |
(ii) | 0.496 ± 0.000 | 0.001 ± 0.001 | 0.000 ± 0.000 | |
(iii) | 0.632 ± 0.011 | 0.344 ± 0.029 | 0.333 ± 0.004 | |
(iv) | 0.496 ± 0.000 | 0.003 ± 0.002 | 0.000 ± 0.000 | |
Attention U-Net | (i) | 0.715 ± 0.008 | 0.536 ± 0.002 | 0.420 ± 0.002 |
(ii) | 0.724 ± 0.005 | 0.555 ± 0.005 | 0.444 ± 0.013 | |
(iii) | 0.723 ± 0.002 | 0.554 ± 0.006 | 0.441 ± 0.004 | |
(iv) | 0.723 ± 0.005 | 0.548 ± 0.012 | 0.437 ± 0.013 | |
U-Net BN | (i) | 0.731 ± 0.002 | 0.565 ± 0.007 | 0.464 ± 0.005 |
(ii) | 0.728 ± 0.004 | 0.564 ± 0.004 | 0.458 ± 0.004 | |
(iii) | 0.729 ± 0.003 | 0.567 ± 0.003 | 0.459 ± 0.007 | |
(iv) | 0.730 ± 0.001 | 0.564 ± 0.006 | 0.461 ± 0.001 |
(I) L1 0.0 and L2 0.0 regularization coefficient, (ii) L1 0.0 and L2 0.5 regularization coefficient, (iii) L1 0.5 and L2 0.0 regularization coefficient, and (iv) L1 0.5 and L2 0.5 regularization coefficient..
Wang-Su Jeon, and Sang-Yong Rhee
Int. J. Fuzzy Log. Intell. Syst. 2017; 17(3): 170-176 https://doi.org/10.5391/IJFIS.2017.17.3.170Proposed U-Net structure.
|@|~(^,^)~|@|Result of focal loss regularization model: (a) FCN, (b) attention U-Net, and (c) U-Net BN. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, V) Focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) focal loss with L1 0.5 and L2 0.5 regularization coefficient.
|@|~(^,^)~|@|Comparison with or without BN about two loss function types and various regularization coefficients in loss function within training time. (I) Cross-entropy with L1 0.0 and L2 0.0 regularization coefficient, (II) cross-entropy with L1 0.0 and L2 0.5 regularization coefficient, (III) cross-entropy with L1 0.5 and L2 0.0 regularization coefficient, (IV) cross-entropy with L1 0.5 and L2 0.5 regularization coefficient, (V) focal loss with L1 0.0 and L2 0.0 regularization coefficient, (VI) focal loss with L1 0.0 and L2 0.5 regularization coefficient, (VII) focal loss with L1 0.5 and L2 0.0 regularization coefficient, and (VIII) Focal loss with L1 0.5 and L2 0.5 regularization coefficient.
|@|~(^,^)~|@|Comparison of F1 score according to addition of attention gate on segmentation model within training time: (a) in the attention gate included and (b) in the attention gate non-included.
|@|~(^,^)~|@|Four methods on four clothing images: (a) case 1, (b) case 2, (c) case 3, and (d) case 4. The proposed method is focal loss L1 0.5 regularization coefficient. (I) FCN, (II) U-Net, (III) attention U-Net, and (IV) U-Net BN.
|@|~(^,^)~|@|Comparison of regularization effect using non-regularization and regularization with L1 and L2 regularization: (a) cross-entropy loss and (b) focal loss. (i) L1 0.0 and L2 0.0 regularization coefficient, (ii) L1 0.5 and L2 0.5 regularization coefficient.