International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 114-124
Published online June 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.2.114
© The Korean Institute of Intelligent Systems
Seoung-Ho Choi
Faculty of Liberal Arts, Hansung University, Seoul, Korea
Correspondence to :
Seoung-Ho Choi (jcn99250@naver.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The loss function is instrumental in learning by adjusting the disparity between the predicted and actual distributions, thereby necessitating a more precise measurement of these distributions to enhance model training. We propose a novel loss function comprising three components termed scale correction cascade smooth loss (SCCSL). First, a smooth loss component enhances performance by optimizing the margin. Second, a cascade component mitigates training misdirection. Third, the scale correction component addresses sparse values in the actual and predicted distributions. We conducted an ablation study to validate our approach by employing two models (Vanilla GAN and LSGAN) across four datasets (MNIST, Fashion MNIST, CIFAR-10, and CIFAC-100), demonstrating superior model optimization and image generation performance compared with existing methods. Our proposed method improved performance and image generation and proved to be better than existing methods.
Keywords: Smooth method, Cascade method, Scale correction, Loss function
The field of generative models has seen significant advancements in recent years, with various approaches and loss functions being employed to optimize model training. Existing methods such as generative adversarial network (GAN) [1] and deep convolutional GAN (DCGAN) [2] commonly utilize cross-entropy loss, whereas others such as least squares GAN (LSGAN) rely on the mean square error (MSE) method [3]. Furthermore, techniques such as super-resolution GAN (SRGAN) [4], Wasserstein GAN (WGAN) [5], and Wasserstein GAN with gradient penalty (WGAN-GP) [6] use the Wasserstein loss to further analyze the distribution. DRAGAN [7] attempted to develop a loss function independent of previously observed losses by introducing convexity during the learning process using both min-max and convex loss terms to address the minimization problem. Several studies have been conducted on various production models, initially focusing on learning only one loss per model, and subsequently evolving to models utilizing two losses. In the context of generative models, DualGAN [8] introduced a method for simultaneously incorporating two losses into each model, enabling the learning of two-generation models within the same framework. In addition, researchers have explored the use of linear combinations of loss functions to efficiently determine adversarial features [9]. Although many studies have focused on improving the generation performance through two loss learning methods, we propose a single loss method that enhances the GAN performance by constraining Lipschitz [10], utilizing neighbor embedding and task matching, and applying spectral normalization [11]. However, relying solely on a single loss for model learning may not fully capture the capacity of the model and the complexity of the loss-based methods explored in the literature.
In image conversion tasks, the utilization of adversarial loss along with individual losses for various roles is a prevailing approach. Isola et al. [12] constructed a loss by including conditions on the loss of existing constructors for image-to-image translation. Zhu et al. [13] combined adversarial loss and cycle consistency loss for unpaired image-to-image translation, cycle-consistent adversarial networks to create one full loss, and cross-domain relations were learned by processing data in different domains. Kim et al. [14] used them as a single generation model loss. In this study, the discriminator combines the discrimination model loss of the domain and the discrimination model loss of another domain to constitute the discrimination loss. Choi et al. [15] augmented the existing adversarial loss with domain classification and reconstruction losses for multi-domain image-to-image translation tasks, enriching the discriminator’s capability through a hybrid loss mechanism. However, evaluating model performance based solely on a single loss may lack precision in delineating the actual and predicted boundaries since these models may not effectively capture the underlying distributions. The essence of the learning process in many deep learning models lies in the optimization guided by the loss function [16–22], which serves as a compass directing the model towards convergence. Although numerous deep learning models exist, it is crucial to note that the loss formula plays a pivotal role in error determination, utilizing the actual and predicted distributions when processing new data. Inadequately trained models may suffer from issues such as overfitting or underfitting. Therefore, the accurate measurement of the actual and predicted distributions within the loss function of deep learning is of paramount importance for effective model training. In response to these challenges, we propose the scale correction cascade smooth loss (SCCSL), driven by the necessity for a more robust and precise loss function in deep learning tasks.
• SCCSL comprises three key components designed to address the limitations of existing loss functions and to enhance the efficacy of model training. First, we propose a smooth loss component that maximizes the margin between actual and predicted value in the loss function. The second component is a cascading component that calibrates the first loss and gradually reflects the new characteristics. The third component involves scale correction, which involves learning by scaling the corrected sparse value.
• We further verified the experimental method through ablation studies that the loss change method, initial distribution change, adopted the other loss function, optimization method change, and adopted weight decay function.
• We verified using two models and two datasets. As a result of verification, it showed better performance than the loss in optimization and image generation.
The experimental setup is shown in Figure 1. The red letter in Figure 1 is the contribution. The orange circle indicates the intersection point. The orange plus sign indicates a combination of three optimizer learning and weight decay. The Z latent space is the initial distribution of the generation model in Step 1. Step 2 comprises the generation model used in the experimental model. Step 3 describes the new loss of the cascade scale calibration. Steps 4 and 5 are the three optimizer learning and regularization steps for the verification of proposal losses. Finally, Step 6 measures the image-generation performance.
The existing landscape of loss functions encompasses various methodologies, each with its strengths and weaknesses. These loss functions serve as pivotal tools for comparing predicted values with actual data, facilitating the training and optimization of machine learning models, e.g., Ranking loss [16], Absolute loss [17], Edit distance loss [18], Negative log-likelihood loss [19], Focal loss [20], Hinge loss [21]. The edit distance loss [18] measures the number of shifts.
Ranking loss [16] does not measure the loss of a particular result but only determines the correct order of the predicted results of the model. We can see that absolute loss [17] uses square and absolute values such that the output and predicted values of the model are not negative. These cases were considered when there was significant noise. In the negative log-likelihood loss, the negative log-likelihood loss [19] is the conditional probability of how well the given data fits into the model. In other words, the higher the likelihood, the better the model represents the observed data.
Hinge loss [21] was designed to find the decision boundary with the most significant distance from the data by classifying each learning data category. If the correct answer class is larger than the incorrect answer class, the loss is zero.
Focal loss [20] gives a small weight to well-classified examples, while a significant weight is given to some examples that are difficult to classify to focus on difficult examples. Therefore, the problem of learning overwhelmed by negative samples can be alleviated. In other words, the class imbalance problem can be alleviated. Focal loss reduces their impact by down-weighting easy examples.
The MSE loss [3] is the square of the difference between the predicted and actual values of the model. It has the advantage of enabling differentiation. One disadvantage is that an incorrectly predicted outlier significantly influences the value because the error is squared. The Huber loss [23] method is less sensitive to outliers because it treats errors as squares only within the interval to overcome the disadvantages. The Huber loss [23] is a method for reducing the influence on loss by down-weighting outliers with large errors.
The Kullback-Leibler divergence (KLD) loss [24] calculates the difference between the two probability distributions. KLD loss is calculated in a direction that reduces the difference in information entropy by using another distribution that approximates the ideal distribution. When a precise probability distribution exists, the entropy change is reduced when a probability distribution that approximates the distribution in the sampling process is used instead of another probability distribution.
The cross-entropy loss [1] measures a probability value between 0 and 1. When the predicted probability deviated from the actual correct answer, the cross-entropy loss increased. This value was proportional to the difference between the predicted and correct answer label values. The cross-entropy loss is primarily used for classification. Binary loss occurs when classifying these two types. Moreover, categorical loss occurs when separating multiple types.
Entropy is expressed in bits when certain information is given, cross-entropy is expressed in bits when incorrect information is given, and the KLD is minimized by indicating the difference between the entropy and cross-entropy. Finally, there is a constant among the KLD terms.
Diverse methodologies characterize the landscape of loss functions, each presenting its own set of strengths and weaknesses. For example, MSE loss [3] allows for differentiation but is sensitive to outliers owing to its squared error calculation. However, Huber loss [23] addresses this issue by treating errors as squares only within a certain interval, making it less sensitive to outliers. Understanding the strengths and weaknesses of these existing loss functions underscores the necessity for advancements, such as SCCSL. SCCSL presents a refined approach that addresses the limitations inherent in previous loss functions. Its ability to maximize the margin between the predicted and actual values, correct erroneous training directions, and mitigate sparse value issues makes it an alternative to conventional loss functions.
SCCSL operates as a novel loss function in GANs, enhancing model optimization and image generation performance. SCCSL consists of three components. The first component of SCCSL is the Smooth Loss Component. It maximizes the margin between the real and predicted values to enhance the performance of the model, aiding in the removal of redundant information and facilitating precise decision-making. The second component is the Cascade Component, which operates in a cascading manner. It utilizes the output of existing losses as the input for new losses and modifies the direction of learning. This preserves some of the existing information while reflecting new features. The final component of SCCSL is the Scale Correction Component, which proposes a novel scale correction method for model optimization. This corrects sparse information by adjusting the scale influenced by existing losses, stabilizing model learning, and enhancing performance. SCCSL improves model optimization and enhances the image generation capabilities within GANs by integrating the Smooth Loss Component, Cascade Component, and Scale Correction Component.
We proposed a smooth loss in
The smooth component indicates that it can be differentiated, the derivative is, and the loss can be differentiated
The cascade component of the expression used the result of the existing loss as the second loss input. This method has the advantage of learning in the correct direction through correction if the existing loss is incorrect. By correcting the value correctly, it is better to obtain a better effect; however, incorrect corrections may cause side effects.
The cascade component uses the output of the existing loss as the input of a new loss. This method can be changed to the right direction when the present information is incorrect, by applying the new method with the present information partially preserved.
We propose a novel scale correction for model optimization. Cascade loss means an existing loss.
The scale correction component is a calibration method for sparse values. We attempted to learn effectively by calibrating sparse values.
We analyzed the SCCSL by dividing the interval on the numerical value; in the case of total loss at infinity, nan generated was not learned. A positive value indicates that the sum of the weights is less than one. If 0, no learning. The negative values do not match pred. Negative infinity does not occur. SCCSL does not learn when the positive value is zero; however, it learned in the positive and negative cases.
The proposed method has the following three advantages: First, it helps judge information efficiently by maximizing the margin between the distribution characteristics. The Proposed method helps make a precise determination by eliminating duplicate information determined by the model. Second, if it comes out through one characteristic, it is calibrated using another characteristic to obtain generalized characteristics by reflecting the characteristics. When learning in the wrong direction during the model learning process, the mixture is corrected using different characteristics; thus, it is in a generalized direction. Third, it can help students learn more stably during the learning process by correcting sparse information in learning.
The visualization result of the proposed loss is shown in Figure 2. To analyze the proposed method’s formula’s influence, we tried to explain the loss through a brief model with a distribution from −1 to 1. We also analyzed the results of the analysis using the same and equal size values. In Figure 2, the proposed loss is the Smooth calibration loss, which has the following effects when learning values with unequal size. It is composed of thicker values than the boundary owing to other losses and a smaller value representing the boundary value.
It can be observed that the batch size converged later in Figure 3, the Adam optimizer is well-used, but we analyzed the effects of the three optimization methods to confirm the proposed method. In Figure 3(i), the result of AdaDelta shows that generator loss learns the original method and focal loss with similar loss values. The loss value may appear to be somewhat higher for smooth loss than for the two methods for reliably obtaining model optimization information.
It can be observed that when the weight decay is incorporated, as shown in Figure 5(ii), the disparity between generator loss and discriminator loss widens over repeated epochs. Weight decay facilitates learning in the correct convergence direction while accentuating the gap between the generator and discriminator losses. The weight decay is expected to cause a shift from unstable to stable learning patterns. However, empirical assessments of actual performance demonstrate that image generation is enhanced regardless of the presence of weight decay. The learning performance was consistent with the application of focal loss. Nevertheless, it appears that improvements in the actual performance lead to a reduction in the image generation performance when the model consistently generates images through stable learning. In the evaluations under single-loss conditions, both smooth and focal losses consistently outperformed existing loss functions.
In Figure 5, we evaluated performance evaluation and loss results with or without weight decay through optimization methods Adagrad Adam, Adam delta, and Adam.
The effect of batch size on smoothing loss was analyzed using weight decay. As we can see in Figure 6(a), we can see that applying weight decay reduces loss fluctuations and maximizes discriminator and generator losses.
The image-generation performance did not improve when focal loss was used. However, it appears that when the actual performance improvement is distinguished and the generated image is stably identified more accurately by learning, the performance image generation of the development is reduced. When measuring the performance under a single loss condition, smooth and focal losses tend to be better than existing losses.
Figure 6 is a scale correction method based on various batch sizes according to the initial distribution. For batch sizes 4 and 16, the Gumbel distribution training course showed that the average psnr performance was more stable than that of the other distributions. 32 Batch size and Batch size 64 indicate that the information reflected in the distribution of logistics.
Figure 7 performed through four distributions for two methods. The experimental results showed that the average peak signal-to-noise ratio (PSNR) decreased with the Random and Laplace distributions for Smooth Correction.
Various other loss methods were analyzed according to the smooth correction cost function shown in Figure 8. Figure 8(i) shows the result as the L1 coefficient value increases. As L1 increases, vibration frequently occurs as the epoch increases. However, Figure 8(ii) shows the result according to the increase of the L2 coefficient value. In the case of L2, the vibration tended to increase with an increasing number of epochs. It also shows that the cost functions of the concatenated parts of the initial and middle parts maximize each other and learn. Figure 8(iii) shows the tendency of cost function when L1 and L2 increase simultaneously. At this time, the magnitude of the vibration width increased as the value increased. The figure also shows that the number of oscillations increased and then decreased, as shown in Figure 8(ii)(a).
First, we aimed to verify the experimental results using a method that can affect the acquisition of model optimization information with a single loss. The effect of batch size on single smoothing loss was analyzed using the focal loss. The change in loss was reduced by applying a focal loss, and the discriminator and generator losses were maximized. As we can see from this phenomenon, the weight decay is less sensitive to the batch size. However, without focal loss, the batch size converges more later. The Adam optimizer used it well; however, we analyzed the impact of the three optimization methods to confirm the proposed method. The results of AdaDelta show that the generator loss learns cross-entropy and focus loss with similar loss values. In smooth loss, the loss value may appear somewhat higher than that of the two methods for reliably obtaining model optimization information. The knowledge of the collected model optimization information assumes that it is incorrect in the right direction because only the maximum value is obtained in a smooth loss calculation. Smooth correction loss was analyzed using three optimization methods and latent variable spaces to determine the impact of generation performance under the influence of the Z-latent space size as an input to the generation model. As the latent variable space increased, convergence appeared later when the effect of the Z-latent space was influenced by the size of the latent space area. According to the data, it is essential to map and store the optimized values in memory space using the optimized storage space parameters, which are necessary for model convergence through model optimization.
In summary, the smooth component provides smooth information. In addition, in the case of focal loss, the model converges by stably acquiring knowledge about a single loss owing to the correction. However, there are some limitations in obtaining model optimization information with a single loss. Here are the cascade component’s experimental results, a way to solve constraints using a new environment: we experimented with the proposed method when training the Fashion-MNIST dataset to verify the cascade components. The experimental results obtained using the cascade component are as follows.
We optimized the validation method of the proposed cascade component consisting of a Vanilla-GAN using four batch sizes: L1 0.0 loss, L2 0.5 loss, Z-latent space size of 100, and the grayscale Fashion-MNIST dataset in Figure 8. If the cascade component is applied and only a single composition is applied, the error range is 0.5 to 1.4, but the cascade component has a small error value of 0.2 to 0.4. When the existing loss was learned in the wrong direction, the loss reflected later was reliably learned through the correction effect. Cascade components have the advantage of being able to reuse losses with the same properties and apply losses with different properties. It consists of pipelines that can be used for parallelization. This is a method of applying multiple losses rather than applying them. By applying the loss, we can create a loss that can be performed more accurately through the loss and shorten learning time. Cascaded components have advantages such as improved performance, parallelism, and reflection of characteristics; however, they have the disadvantage of increased losses.
We experimented with three optimization methods and obtained the best results using the Adam optimizer. In particular, the proposed method demonstrates that the initial loss is better than the existing loss when the correction loss is corrected. The scale correction loss confirmed rapid convergence.
Covariate changes were checked using the scale correction method. Alpha increased from 0.25 to 0.75 by 0.25, and beta increased from 2 to 4 to 1 in the smooth correction method, as shown in Figure 9. The experimental results show that the effect of attention on alpha and gamma does not affect the creation of the generated image. The effect of the parameters on the scale term had a smaller effect on the creation of images in Vanilla-GAN.
We proposed a new loss function composed of smooth, cascade, and scale correction components to optimize image generation. First, the smooth components displayed by maximizing the margin between the feature information reduce errors. Second, it shows the parallelism loss characteristics of the cascade component and demonstrates that feature loss can be applied. Model optimization using cascade components plays an essential role in the formation of new information. Third, we confirmed that the learning was good by stably reflecting on work information through scale correction. Scale correction was performed using a sparse value scale.
In the future, it will be crucial to create a loss function so that the model can be optimized based on various paths and a path-guide base to go to the optimization path.
No potential conflict of interest relevant to this article was reported.
Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam, a) Origin loss, b) Smooth loss, c) Correction loss.
Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam.
Compare of batch size on smooth loss in optimization process using MNIST dataset on learning rate 0.0007, L1 0.0 loss, L2 0.75 loss, and 1000 Z latent space size in Adam decay. a) 16 batch size, b) 32 batch size, c) 64 batch size, i) Non Weight decay, ii) Weight decay.
Smooth correction, learning rate 0.0007, 4 batch sizes, L1 0.0 loss, L2 0.0 loss, 100 Z latent space size, Adagrad optimizer, LSGAN, analysis of effects on various batch sizes with initial distribution for model optimization. a) Gumbel distribution, b) logistic distribution, i) 4 batch size, ii) 16 batch size, iii) 32 batch size, iv) 64 batch size.
The performance of the dependency analysis in color image generation measures psnr of various distributions over 8 batch sizes. a) Smooth method, b) Smooth correction, i) Random distribution, ii) Laplace distribution, iii) Logistic distribution, iv) Gumbel distribution.
Analysis of smooth correction about the effect of various other loss methods by batch size 4, Z latent space 100, scale correction coefficient at Alpha 2.0, Gamma 0.25, Vanilla-GAN, CIFAR-100 dataset, i) L1 loss and L2 loss about the L1 loss increase, ii) L1 loss and L2 loss about L2 loss increase, iii) L1 loss and L2 loss about L1 loss and L2 loss increase.
Learning rate 0.0007, batch size 4, L1 0.25 loss, L2 0.25 loss, Z latent space size 100, Adam optimizer, comparison of the effects of smooth correction loss on the coefficients in the Gumbel distribution in Vanilla-GAN i.a) Alpha 0.25 i.b) Alpha 0.5 i.c) Alpha 0.75 ii.a) Beta 2.0 ii.b) Beta 3.0 ii.c) Beta 4.0
E-mail: jcn99250@naver.com
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 114-124
Published online June 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.2.114
Copyright © The Korean Institute of Intelligent Systems.
Seoung-Ho Choi
Faculty of Liberal Arts, Hansung University, Seoul, Korea
Correspondence to:Seoung-Ho Choi (jcn99250@naver.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The loss function is instrumental in learning by adjusting the disparity between the predicted and actual distributions, thereby necessitating a more precise measurement of these distributions to enhance model training. We propose a novel loss function comprising three components termed scale correction cascade smooth loss (SCCSL). First, a smooth loss component enhances performance by optimizing the margin. Second, a cascade component mitigates training misdirection. Third, the scale correction component addresses sparse values in the actual and predicted distributions. We conducted an ablation study to validate our approach by employing two models (Vanilla GAN and LSGAN) across four datasets (MNIST, Fashion MNIST, CIFAR-10, and CIFAC-100), demonstrating superior model optimization and image generation performance compared with existing methods. Our proposed method improved performance and image generation and proved to be better than existing methods.
Keywords: Smooth method, Cascade method, Scale correction, Loss function
The field of generative models has seen significant advancements in recent years, with various approaches and loss functions being employed to optimize model training. Existing methods such as generative adversarial network (GAN) [1] and deep convolutional GAN (DCGAN) [2] commonly utilize cross-entropy loss, whereas others such as least squares GAN (LSGAN) rely on the mean square error (MSE) method [3]. Furthermore, techniques such as super-resolution GAN (SRGAN) [4], Wasserstein GAN (WGAN) [5], and Wasserstein GAN with gradient penalty (WGAN-GP) [6] use the Wasserstein loss to further analyze the distribution. DRAGAN [7] attempted to develop a loss function independent of previously observed losses by introducing convexity during the learning process using both min-max and convex loss terms to address the minimization problem. Several studies have been conducted on various production models, initially focusing on learning only one loss per model, and subsequently evolving to models utilizing two losses. In the context of generative models, DualGAN [8] introduced a method for simultaneously incorporating two losses into each model, enabling the learning of two-generation models within the same framework. In addition, researchers have explored the use of linear combinations of loss functions to efficiently determine adversarial features [9]. Although many studies have focused on improving the generation performance through two loss learning methods, we propose a single loss method that enhances the GAN performance by constraining Lipschitz [10], utilizing neighbor embedding and task matching, and applying spectral normalization [11]. However, relying solely on a single loss for model learning may not fully capture the capacity of the model and the complexity of the loss-based methods explored in the literature.
In image conversion tasks, the utilization of adversarial loss along with individual losses for various roles is a prevailing approach. Isola et al. [12] constructed a loss by including conditions on the loss of existing constructors for image-to-image translation. Zhu et al. [13] combined adversarial loss and cycle consistency loss for unpaired image-to-image translation, cycle-consistent adversarial networks to create one full loss, and cross-domain relations were learned by processing data in different domains. Kim et al. [14] used them as a single generation model loss. In this study, the discriminator combines the discrimination model loss of the domain and the discrimination model loss of another domain to constitute the discrimination loss. Choi et al. [15] augmented the existing adversarial loss with domain classification and reconstruction losses for multi-domain image-to-image translation tasks, enriching the discriminator’s capability through a hybrid loss mechanism. However, evaluating model performance based solely on a single loss may lack precision in delineating the actual and predicted boundaries since these models may not effectively capture the underlying distributions. The essence of the learning process in many deep learning models lies in the optimization guided by the loss function [16–22], which serves as a compass directing the model towards convergence. Although numerous deep learning models exist, it is crucial to note that the loss formula plays a pivotal role in error determination, utilizing the actual and predicted distributions when processing new data. Inadequately trained models may suffer from issues such as overfitting or underfitting. Therefore, the accurate measurement of the actual and predicted distributions within the loss function of deep learning is of paramount importance for effective model training. In response to these challenges, we propose the scale correction cascade smooth loss (SCCSL), driven by the necessity for a more robust and precise loss function in deep learning tasks.
• SCCSL comprises three key components designed to address the limitations of existing loss functions and to enhance the efficacy of model training. First, we propose a smooth loss component that maximizes the margin between actual and predicted value in the loss function. The second component is a cascading component that calibrates the first loss and gradually reflects the new characteristics. The third component involves scale correction, which involves learning by scaling the corrected sparse value.
• We further verified the experimental method through ablation studies that the loss change method, initial distribution change, adopted the other loss function, optimization method change, and adopted weight decay function.
• We verified using two models and two datasets. As a result of verification, it showed better performance than the loss in optimization and image generation.
The experimental setup is shown in Figure 1. The red letter in Figure 1 is the contribution. The orange circle indicates the intersection point. The orange plus sign indicates a combination of three optimizer learning and weight decay. The Z latent space is the initial distribution of the generation model in Step 1. Step 2 comprises the generation model used in the experimental model. Step 3 describes the new loss of the cascade scale calibration. Steps 4 and 5 are the three optimizer learning and regularization steps for the verification of proposal losses. Finally, Step 6 measures the image-generation performance.
The existing landscape of loss functions encompasses various methodologies, each with its strengths and weaknesses. These loss functions serve as pivotal tools for comparing predicted values with actual data, facilitating the training and optimization of machine learning models, e.g., Ranking loss [16], Absolute loss [17], Edit distance loss [18], Negative log-likelihood loss [19], Focal loss [20], Hinge loss [21]. The edit distance loss [18] measures the number of shifts.
Ranking loss [16] does not measure the loss of a particular result but only determines the correct order of the predicted results of the model. We can see that absolute loss [17] uses square and absolute values such that the output and predicted values of the model are not negative. These cases were considered when there was significant noise. In the negative log-likelihood loss, the negative log-likelihood loss [19] is the conditional probability of how well the given data fits into the model. In other words, the higher the likelihood, the better the model represents the observed data.
Hinge loss [21] was designed to find the decision boundary with the most significant distance from the data by classifying each learning data category. If the correct answer class is larger than the incorrect answer class, the loss is zero.
Focal loss [20] gives a small weight to well-classified examples, while a significant weight is given to some examples that are difficult to classify to focus on difficult examples. Therefore, the problem of learning overwhelmed by negative samples can be alleviated. In other words, the class imbalance problem can be alleviated. Focal loss reduces their impact by down-weighting easy examples.
The MSE loss [3] is the square of the difference between the predicted and actual values of the model. It has the advantage of enabling differentiation. One disadvantage is that an incorrectly predicted outlier significantly influences the value because the error is squared. The Huber loss [23] method is less sensitive to outliers because it treats errors as squares only within the interval to overcome the disadvantages. The Huber loss [23] is a method for reducing the influence on loss by down-weighting outliers with large errors.
The Kullback-Leibler divergence (KLD) loss [24] calculates the difference between the two probability distributions. KLD loss is calculated in a direction that reduces the difference in information entropy by using another distribution that approximates the ideal distribution. When a precise probability distribution exists, the entropy change is reduced when a probability distribution that approximates the distribution in the sampling process is used instead of another probability distribution.
The cross-entropy loss [1] measures a probability value between 0 and 1. When the predicted probability deviated from the actual correct answer, the cross-entropy loss increased. This value was proportional to the difference between the predicted and correct answer label values. The cross-entropy loss is primarily used for classification. Binary loss occurs when classifying these two types. Moreover, categorical loss occurs when separating multiple types.
Entropy is expressed in bits when certain information is given, cross-entropy is expressed in bits when incorrect information is given, and the KLD is minimized by indicating the difference between the entropy and cross-entropy. Finally, there is a constant among the KLD terms.
Diverse methodologies characterize the landscape of loss functions, each presenting its own set of strengths and weaknesses. For example, MSE loss [3] allows for differentiation but is sensitive to outliers owing to its squared error calculation. However, Huber loss [23] addresses this issue by treating errors as squares only within a certain interval, making it less sensitive to outliers. Understanding the strengths and weaknesses of these existing loss functions underscores the necessity for advancements, such as SCCSL. SCCSL presents a refined approach that addresses the limitations inherent in previous loss functions. Its ability to maximize the margin between the predicted and actual values, correct erroneous training directions, and mitigate sparse value issues makes it an alternative to conventional loss functions.
SCCSL operates as a novel loss function in GANs, enhancing model optimization and image generation performance. SCCSL consists of three components. The first component of SCCSL is the Smooth Loss Component. It maximizes the margin between the real and predicted values to enhance the performance of the model, aiding in the removal of redundant information and facilitating precise decision-making. The second component is the Cascade Component, which operates in a cascading manner. It utilizes the output of existing losses as the input for new losses and modifies the direction of learning. This preserves some of the existing information while reflecting new features. The final component of SCCSL is the Scale Correction Component, which proposes a novel scale correction method for model optimization. This corrects sparse information by adjusting the scale influenced by existing losses, stabilizing model learning, and enhancing performance. SCCSL improves model optimization and enhances the image generation capabilities within GANs by integrating the Smooth Loss Component, Cascade Component, and Scale Correction Component.
We proposed a smooth loss in
The smooth component indicates that it can be differentiated, the derivative is, and the loss can be differentiated
The cascade component of the expression used the result of the existing loss as the second loss input. This method has the advantage of learning in the correct direction through correction if the existing loss is incorrect. By correcting the value correctly, it is better to obtain a better effect; however, incorrect corrections may cause side effects.
The cascade component uses the output of the existing loss as the input of a new loss. This method can be changed to the right direction when the present information is incorrect, by applying the new method with the present information partially preserved.
We propose a novel scale correction for model optimization. Cascade loss means an existing loss.
The scale correction component is a calibration method for sparse values. We attempted to learn effectively by calibrating sparse values.
We analyzed the SCCSL by dividing the interval on the numerical value; in the case of total loss at infinity, nan generated was not learned. A positive value indicates that the sum of the weights is less than one. If 0, no learning. The negative values do not match pred. Negative infinity does not occur. SCCSL does not learn when the positive value is zero; however, it learned in the positive and negative cases.
The proposed method has the following three advantages: First, it helps judge information efficiently by maximizing the margin between the distribution characteristics. The Proposed method helps make a precise determination by eliminating duplicate information determined by the model. Second, if it comes out through one characteristic, it is calibrated using another characteristic to obtain generalized characteristics by reflecting the characteristics. When learning in the wrong direction during the model learning process, the mixture is corrected using different characteristics; thus, it is in a generalized direction. Third, it can help students learn more stably during the learning process by correcting sparse information in learning.
The visualization result of the proposed loss is shown in Figure 2. To analyze the proposed method’s formula’s influence, we tried to explain the loss through a brief model with a distribution from −1 to 1. We also analyzed the results of the analysis using the same and equal size values. In Figure 2, the proposed loss is the Smooth calibration loss, which has the following effects when learning values with unequal size. It is composed of thicker values than the boundary owing to other losses and a smaller value representing the boundary value.
It can be observed that the batch size converged later in Figure 3, the Adam optimizer is well-used, but we analyzed the effects of the three optimization methods to confirm the proposed method. In Figure 3(i), the result of AdaDelta shows that generator loss learns the original method and focal loss with similar loss values. The loss value may appear to be somewhat higher for smooth loss than for the two methods for reliably obtaining model optimization information.
It can be observed that when the weight decay is incorporated, as shown in Figure 5(ii), the disparity between generator loss and discriminator loss widens over repeated epochs. Weight decay facilitates learning in the correct convergence direction while accentuating the gap between the generator and discriminator losses. The weight decay is expected to cause a shift from unstable to stable learning patterns. However, empirical assessments of actual performance demonstrate that image generation is enhanced regardless of the presence of weight decay. The learning performance was consistent with the application of focal loss. Nevertheless, it appears that improvements in the actual performance lead to a reduction in the image generation performance when the model consistently generates images through stable learning. In the evaluations under single-loss conditions, both smooth and focal losses consistently outperformed existing loss functions.
In Figure 5, we evaluated performance evaluation and loss results with or without weight decay through optimization methods Adagrad Adam, Adam delta, and Adam.
The effect of batch size on smoothing loss was analyzed using weight decay. As we can see in Figure 6(a), we can see that applying weight decay reduces loss fluctuations and maximizes discriminator and generator losses.
The image-generation performance did not improve when focal loss was used. However, it appears that when the actual performance improvement is distinguished and the generated image is stably identified more accurately by learning, the performance image generation of the development is reduced. When measuring the performance under a single loss condition, smooth and focal losses tend to be better than existing losses.
Figure 6 is a scale correction method based on various batch sizes according to the initial distribution. For batch sizes 4 and 16, the Gumbel distribution training course showed that the average psnr performance was more stable than that of the other distributions. 32 Batch size and Batch size 64 indicate that the information reflected in the distribution of logistics.
Figure 7 performed through four distributions for two methods. The experimental results showed that the average peak signal-to-noise ratio (PSNR) decreased with the Random and Laplace distributions for Smooth Correction.
Various other loss methods were analyzed according to the smooth correction cost function shown in Figure 8. Figure 8(i) shows the result as the L1 coefficient value increases. As L1 increases, vibration frequently occurs as the epoch increases. However, Figure 8(ii) shows the result according to the increase of the L2 coefficient value. In the case of L2, the vibration tended to increase with an increasing number of epochs. It also shows that the cost functions of the concatenated parts of the initial and middle parts maximize each other and learn. Figure 8(iii) shows the tendency of cost function when L1 and L2 increase simultaneously. At this time, the magnitude of the vibration width increased as the value increased. The figure also shows that the number of oscillations increased and then decreased, as shown in Figure 8(ii)(a).
First, we aimed to verify the experimental results using a method that can affect the acquisition of model optimization information with a single loss. The effect of batch size on single smoothing loss was analyzed using the focal loss. The change in loss was reduced by applying a focal loss, and the discriminator and generator losses were maximized. As we can see from this phenomenon, the weight decay is less sensitive to the batch size. However, without focal loss, the batch size converges more later. The Adam optimizer used it well; however, we analyzed the impact of the three optimization methods to confirm the proposed method. The results of AdaDelta show that the generator loss learns cross-entropy and focus loss with similar loss values. In smooth loss, the loss value may appear somewhat higher than that of the two methods for reliably obtaining model optimization information. The knowledge of the collected model optimization information assumes that it is incorrect in the right direction because only the maximum value is obtained in a smooth loss calculation. Smooth correction loss was analyzed using three optimization methods and latent variable spaces to determine the impact of generation performance under the influence of the Z-latent space size as an input to the generation model. As the latent variable space increased, convergence appeared later when the effect of the Z-latent space was influenced by the size of the latent space area. According to the data, it is essential to map and store the optimized values in memory space using the optimized storage space parameters, which are necessary for model convergence through model optimization.
In summary, the smooth component provides smooth information. In addition, in the case of focal loss, the model converges by stably acquiring knowledge about a single loss owing to the correction. However, there are some limitations in obtaining model optimization information with a single loss. Here are the cascade component’s experimental results, a way to solve constraints using a new environment: we experimented with the proposed method when training the Fashion-MNIST dataset to verify the cascade components. The experimental results obtained using the cascade component are as follows.
We optimized the validation method of the proposed cascade component consisting of a Vanilla-GAN using four batch sizes: L1 0.0 loss, L2 0.5 loss, Z-latent space size of 100, and the grayscale Fashion-MNIST dataset in Figure 8. If the cascade component is applied and only a single composition is applied, the error range is 0.5 to 1.4, but the cascade component has a small error value of 0.2 to 0.4. When the existing loss was learned in the wrong direction, the loss reflected later was reliably learned through the correction effect. Cascade components have the advantage of being able to reuse losses with the same properties and apply losses with different properties. It consists of pipelines that can be used for parallelization. This is a method of applying multiple losses rather than applying them. By applying the loss, we can create a loss that can be performed more accurately through the loss and shorten learning time. Cascaded components have advantages such as improved performance, parallelism, and reflection of characteristics; however, they have the disadvantage of increased losses.
We experimented with three optimization methods and obtained the best results using the Adam optimizer. In particular, the proposed method demonstrates that the initial loss is better than the existing loss when the correction loss is corrected. The scale correction loss confirmed rapid convergence.
Covariate changes were checked using the scale correction method. Alpha increased from 0.25 to 0.75 by 0.25, and beta increased from 2 to 4 to 1 in the smooth correction method, as shown in Figure 9. The experimental results show that the effect of attention on alpha and gamma does not affect the creation of the generated image. The effect of the parameters on the scale term had a smaller effect on the creation of images in Vanilla-GAN.
We proposed a new loss function composed of smooth, cascade, and scale correction components to optimize image generation. First, the smooth components displayed by maximizing the margin between the feature information reduce errors. Second, it shows the parallelism loss characteristics of the cascade component and demonstrates that feature loss can be applied. Model optimization using cascade components plays an essential role in the formation of new information. Third, we confirmed that the learning was good by stably reflecting on work information through scale correction. Scale correction was performed using a sparse value scale.
In the future, it will be crucial to create a loss function so that the model can be optimized based on various paths and a path-guide base to go to the optimization path.
Visualization of experiment system configure.
Visualization of experiment loss.
Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam, a) Origin loss, b) Smooth loss, c) Correction loss.
Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam.
Compare of batch size on smooth loss in optimization process using MNIST dataset on learning rate 0.0007, L1 0.0 loss, L2 0.75 loss, and 1000 Z latent space size in Adam decay. a) 16 batch size, b) 32 batch size, c) 64 batch size, i) Non Weight decay, ii) Weight decay.
Smooth correction, learning rate 0.0007, 4 batch sizes, L1 0.0 loss, L2 0.0 loss, 100 Z latent space size, Adagrad optimizer, LSGAN, analysis of effects on various batch sizes with initial distribution for model optimization. a) Gumbel distribution, b) logistic distribution, i) 4 batch size, ii) 16 batch size, iii) 32 batch size, iv) 64 batch size.
The performance of the dependency analysis in color image generation measures psnr of various distributions over 8 batch sizes. a) Smooth method, b) Smooth correction, i) Random distribution, ii) Laplace distribution, iii) Logistic distribution, iv) Gumbel distribution.
Analysis of smooth correction about the effect of various other loss methods by batch size 4, Z latent space 100, scale correction coefficient at Alpha 2.0, Gamma 0.25, Vanilla-GAN, CIFAR-100 dataset, i) L1 loss and L2 loss about the L1 loss increase, ii) L1 loss and L2 loss about L2 loss increase, iii) L1 loss and L2 loss about L1 loss and L2 loss increase.
Learning rate 0.0007, batch size 4, L1 0.25 loss, L2 0.25 loss, Z latent space size 100, Adam optimizer, comparison of the effects of smooth correction loss on the coefficients in the Gumbel distribution in Vanilla-GAN i.a) Alpha 0.25 i.b) Alpha 0.5 i.c) Alpha 0.75 ii.a) Beta 2.0 ii.b) Beta 3.0 ii.c) Beta 4.0
Visualization of experiment system configure.
|@|~(^,^)~|@|Visualization of experiment loss.
|@|~(^,^)~|@|Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam, a) Origin loss, b) Smooth loss, c) Correction loss.
|@|~(^,^)~|@|Comparison about optimization of loss that composed of four batch sizes, L1 0.0 loss, L2 0.5 loss, 100 Z latent size, and Vanilla-GAN using Fashion-MNIST dataset. i) AdaDelta, ii) Adagrad, iii) Adam.
|@|~(^,^)~|@|Compare of batch size on smooth loss in optimization process using MNIST dataset on learning rate 0.0007, L1 0.0 loss, L2 0.75 loss, and 1000 Z latent space size in Adam decay. a) 16 batch size, b) 32 batch size, c) 64 batch size, i) Non Weight decay, ii) Weight decay.
|@|~(^,^)~|@|Smooth correction, learning rate 0.0007, 4 batch sizes, L1 0.0 loss, L2 0.0 loss, 100 Z latent space size, Adagrad optimizer, LSGAN, analysis of effects on various batch sizes with initial distribution for model optimization. a) Gumbel distribution, b) logistic distribution, i) 4 batch size, ii) 16 batch size, iii) 32 batch size, iv) 64 batch size.
|@|~(^,^)~|@|The performance of the dependency analysis in color image generation measures psnr of various distributions over 8 batch sizes. a) Smooth method, b) Smooth correction, i) Random distribution, ii) Laplace distribution, iii) Logistic distribution, iv) Gumbel distribution.
|@|~(^,^)~|@|Analysis of smooth correction about the effect of various other loss methods by batch size 4, Z latent space 100, scale correction coefficient at Alpha 2.0, Gamma 0.25, Vanilla-GAN, CIFAR-100 dataset, i) L1 loss and L2 loss about the L1 loss increase, ii) L1 loss and L2 loss about L2 loss increase, iii) L1 loss and L2 loss about L1 loss and L2 loss increase.
|@|~(^,^)~|@|Learning rate 0.0007, batch size 4, L1 0.25 loss, L2 0.25 loss, Z latent space size 100, Adam optimizer, comparison of the effects of smooth correction loss on the coefficients in the Gumbel distribution in Vanilla-GAN i.a) Alpha 0.25 i.b) Alpha 0.5 i.c) Alpha 0.75 ii.a) Beta 2.0 ii.b) Beta 3.0 ii.c) Beta 4.0