International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(4): 333-342
Published online December 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.4.333
© The Korean Institute of Intelligent Systems
Prisha Patel, Sakshi Chauhan, Shaurya Gupta, Tawishi Gupta, and Renuka Agrawal
Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
Correspondence to :
Renuka Agrawal (renuka.agrawal@sitpune.edu.in)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The automobile insurance industry faces significant challenges in detecting fraudulent activities because of the imbalanced nature of fraud data, which traditional machine learning algorithms struggle to address effectively. In this study, to improve the efficiency of fraud detection, we investigated three approaches: the Synthetic Minority Oversampling TEchnique (SMOTE), generative adversarial networks (GANs), and a hybrid approach combining SMOTE with GANs (SMOTEfied-GAN). SMOTE addresses the class imbalance by oversampling the minority class, whereas GANs generate synthetic data that resemble the training data distribution. The SMOTEfied-GAN combines the strengths of both methods by oversampling the minority class using SMOTE before training the GAN to enhance the quality of the synthetic samples. A comparative analysis was conducted on these approaches using a dataset from the automobile insurance industry. Our evaluation included metrics, such as precision, recall and F1-score. These findings suggest that each approach offers unique advantages in improving fraud-detection efficiency.
Keywords: Fraud detection, Generative adversarial network (GAN), Data balancing, Synthetic data generation, Machine learning algorithms, Synthetic Minority Over-sampling TEchnique (SMOTE)
Automobile insurance is a demanding industry, as financial risks are involved in owning and using a car. However, the most significant problem encountered by insurance businesses is the identification and avoidance of fraud. Moreover the expenses generated by fabricated claims, ranging from fake accidents, exaggerated losses, to deceitful information, significantly affect the insurer’s finances and operations [1].
Over the past few years, technology and big data have introduced new approaches for enhancing fraud control measures. Such tasks have long been addressed using traditional machine learning (ML) techniques. However, these algorithms do not work efficiently because of the inherently skewed nature of fraud data, as the ratio of fraudulent cases to the total number of cases is very small [2, 3]. This class imbalance results in an inaccurate model, in which true positives of the minority class, that is, fraudulent modes are ignored [4].
In this regard, this paper presents a new model that combines the Synthetic Minority Oversampling TEchnique (SMOTE) and generative adversarial networks (GANs) to obtain SMOTEfied-GAN. This method integrates the two major advantages of oversampling: SMOTE helps increase the number of samples of minority classes, and GAN can generate highly realistic samples [5]. To assess our approach, the Kaggle dataset with 10,000 insurance claims and 19 features was used, proving that the proposed techniques help enhance fraud detection based on precision, recall, and the F1-score.
Consequently, this study not only emphasizes the significance of employing more than one method of data augmentation but also proposes a powerful method that can successfully solve the problem of class imbalance in the automobile insurance sector, laying down foundations for better fraud detection [6]. The proposed study presents a unique authomobile insurance fraud detection model that is scalable and capable of detecting fraudulent claims. The efficacy of the proposed model is strengthened through a comparative analysis with traditional machine-learning (ML) models. The major contributions of this study are summarized as follows.
· A novel ensemble model is proposed for automobile insurance fraud detection, capable of identifying both fraudulent and non-fraudulent claims. The model was trained on a balanced dataset, using SMOTE and GAN to perform the balancing.
· The performance of the proposed model was validated against those of traditional models for insurance fraud detection.
The remainder of the article is organized as follows: Section 2 describes related works in similar domains conducted by different researchers. Section 3 presents a step-by-step methodology for insurance fraud detection. Section 4 presents results and discussions. Finally, Section 5 concludes the paper.
A comprehensive review of the existing literature is essential for understanding the current state-of-the-art in fraud detection methodologies within the automobile insurance industry. Each study is summarized with respect to the dataset used, methodology employed, outcomes observed, and limitations encountered. Through a critical examination of prior research, we identify the gaps in the literature and provide novel insights into the field of fraud detection in automobile insurance.
In related works, Strelcenia and Prakoonwit [7] investigated fraud detection using GANs to generate synthetic data from the Kaggle dataset. Their framework included algorithms, such as random forest, k-nearest neighbor (KNN), and iterative dichotomiser (ID3), comparing them with the multiscale kernel neural network (MNET) and support vector machine (SVM). They introduced K-CGAN, which is an oversampling method that combines conditional generative adversarial network with Kullback–Leibler (KL) divergence, and employs the cost-sensitive C-Boost model. Alshawi [2] applied GANs to generate synthetic data for credit card fraud detection using the Kaggle dataset. They evaluated six ML algorithms —logistic regression, decision trees, naïve Bayes, random forest, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost)— and achieved high accuracy, precision, recall, and area under the ROC curve (AUC) scores. Rodriguez-Almeida et al. [8] investigated synthetic patient data generation for illness prediction using small unbalanced medical datasets. They balanced the data using SMOTE and ADASYN, further augmenting the data using Gaussian copulas and conditional tabular GANs (CTGANs). Gaussian copulas excelled in maintaining linear correlations with minimal optimization, whereas CTGANs required refinement. The study highlighted privacy concerns and potential overfitting with Gaussian copulas, emphasizing the need for further research on privacy preservation methods. Alharbi and Kimura [9] addressed missing data imputation using GANs on four numerical datasets: Pima Indian diabetes, Iris, Glass, and Wine. They proposed random GAN and mesh GAN, with the latter using a 5-dimensional mesh for diverse records. GAN-based methods outperform traditional techniques and exhibit stable performance across varying missing data rates. Ding et al. [10] proposed an improved variational autoencoder GAN (VAEGAN) for credit card fraud detection by combining synthetic data generation with XGBoost classification. This method utilized SMOTE, GAN, VAE, and VAEGAN for oversampling, and XGBoost outperformed the other algorithms. The optimized VAEGAN improved precision, F1-score, and recall, showing potential beyond fraud detection. Endres et al. [11] compared synthetic data generation on the Adult Census, Airbnb, and Airline datasets. Using SMOTE, Synthpop parametric and nonparametric (SP-NP), synthetic data vault with Gaussian copula (SDV-G), and data synthesizer (DS), which were found to perform the best in terms of proximity and processing time, whereas GAN, SDV-GAN, and VAE faced scalability and memory issues. Pandey et al. [13] proposed an adversarial fraud-generation framework using a Wasserstein GAN (WGAN) to synthesize diverse fraud samples conditioned on non-fraudulent transactions. This approach improved the recall and F1-score over existing models such as SMOTE and WGAN, achieving an AUC of 0.976 for the European Credit Card and customs information system (CIS) fraud datasets. The study emphasized the importance of increasing the number of synthetic samples for better fraud detection although issues with mode breakdown and unstable GAN training were noted, requiring careful parameter tuning.
Although fraud detection has been investigated in the finance sector, the automobile insurance sector lacks a fraud-detection model. In addition, researchers have mostly worked on original datasets, for which data balancing is lacking. This study considered fraud detection in the automobile insurance domain, using an ensemble model of SMOTE and GAN for augmented data generation. After data balancing, ML techniques were applied for model selection. Automobile insurance fraud is a sector that researchers in the ML domain need to explore, as little work has been conducted in this domain.
This section describes the methodological strategy used to improve the effectiveness of fraud identification in the automobile insurance sector. The methodology began with data collection followed by pre-processing. The next stage involved synthetic data generation and data balancing. This is required for precise results. Model training and testing were further performed. Figure 1 outlines the proposed methodology.
As shown in Figure 1, the pipeline begins with data collection and pre-processing, followed by data augmentation and balancing, using SMOTE, GAN, and SMOTEfied-GAN. After selecting the best features, classifiers, such as decision tree, logistic regression, and random forest were chosen for model training and testing, culminating in knowledge discovery.
The data used in this study were obtained from Kaggle and consisted of a large set of car insurance claim data. This dataset comprises ten thousand distinct samples and sample insurance claims. This information includes 19 different fields of the dataset, embracing various characteristics of policyholders and their insurance claim history. These attributes play a significant role in insurance-risk prediction and analysis of various patterns regarding insurance.
The fields included: ID: a unique number representing a record and AGE: policyholder’s age. GENDER and RACE are demographic features. DRIVING_EXPERIENCE refers to the number of years that the policy holder has been driving. Available features comprised EDUCATION: level of education and INCOME: a person’s financial status. CREDIT_SCORE refers to a client’s creditworthiness, and several features are related to the insured asset, such as VEHICLE_OWNERSHIP, VEHICLE_YEAR, and VEHICLE_TYPE.
Social and lifestyle variables consider parameters such as MARRIED, CHILDREN, and POSTAL_CODE. The metrics associated with usage include ANNUAL_MILEAGE, which is the average number of miles travelled in a year. Driving history factors included SPEEDING_VIOLATIONS, PAST_ACCIDENTS, and DUIS. Finally, the dataset included the dependent variable of the study, which was the OUTCOME of the claim.
Pre-processing of the data ensures the integrity and suitability of the dataset for subsequent analyses. The pre-processing pipeline initially converts the categorical columns, such as “AGE” and “DRIVING_EXPERIENCE” into numerical data, facilitating quantitative analysis. Irrelevant columns such as “RACE” are eliminated. Addressing the missing values in the numeric columns is imperative and was achieved by replacing the missing values with the mean value of each column to maintain dataset integrity. Categorical variables were subjected to one-hot encoding to convert them into a binary format compatible with machine-learning algorithms. Additionally, numerical features were standardized using a standard scaler to ensure uniformity in feature scales, preventing features with larger magnitudes from dominating. To allow independent model validation, the dataset was divided into training and testing datasets.
Finally, the pre-processed dataset was visually inspected to verify the effectiveness of the pre-processing techniques and ensure data integrity before being saved in a new CSV file for subsequent analyses. These pre-processing steps collectively prepared the dataset for robust analysis and model development, laying the groundwork for fraud detection in the automobile insurance domain.
A major problem arises with ML model training because the dataset remains imbalanced, such that although model performance is high, the model is not well trained for minority classes. The proposed method uses SMOTE and GAN to balance and increase the number of samples in the dataset.
Recently, various types of GANs have been employed to generate synthetic image and textual data. Examples include conditional GANs used on tabular data [14–16], VAEs applied to electronic health record (EHR) data [17, 18], and WGANs used for generating time-series data [19]. Our initiative focused on significantly enhancing fraud-detection accuracy using a GAN. The main objective was to generate synthetic data that closely mimic authentic instances of fraudulent claims. The reason for choosing Vanilla GAN was selected because of its simplicity and effectiveness in generating diverse and realistic synthetic data. The generator component of the Vanilla GAN creates data that closely resemble authentic instances of fraudulent claims. The discriminator divides synthetic and actual data simultaneously without requiring the extra effort for the conditional or specialized variants. The streamlined approach allowed capturing the nuanced patterns and characteristics of fraudulent activities, leading to enhanced fraudulent detection accuracy and model robustness [20]. The framework involves training a discriminator model to effectively distinguish between real and synthetic instances, whereas a generator model learns to produce synthetic data samples that exhibit statistical attributes and patterns similar to real fraudulent claims. The generator utilized a latent noise vector of dimension 100 and consisted of dense layers with rectified linear unit (ReLU) activation functions and batch normalization. Conversely, the discriminator included layers with dropout regularization and a sigmoid activation function in the output layer. The discriminator was alternately trained on batches of each type of data to strengthen its capacity to distinguish between actual and synthetic data. To optimize the iterative learning process, we set batch size to 64 and number of training epochs to 500. Figure 2 shows the GAN architecture.
During the training process of the GAN, the progression of the discriminator and generator losses was monitored across multiple epochs to assess the model’s learning dynamics and convergence. Table 1 shows loss values by epoch for the GAN model.
In Table 1, as the number of epochs increases, generator and discriminator losses increase. One of the major reasons for this is data imbalance, as the samples of fraudulent cases are far fewer than those of non-fraudulent cases. Therefore, another technique is required to balance the dataset.
SMOTE is a method that balances a dataset by creating synthetic samples of minority classes to alleviate class imbalance [21]. The architecture is shown in Figure 3. Using KNN, SMOTE generates synthetic instances of minority class samples by transforming new samples between pre-existing minority class samples [22]. This ensures that ML models are not biased towards the majority class (genuine claims) and can effectively identify and learn from the minority class (fraudulent claims). After applying SMOTE to address class imbalance, the number of instances of fraudulent data in the training dataset increased to 5,437 in a total of 10,000 instances, which is approximately 50% of the total. Figure 3 illustrates the SMOTE process used to address class imbalance in automobile insurance fraud detection. After pre-processing, synthetic samples were generated for the minority class, creating a balanced dataset for training and evaluating the fraud detection model.
The hybrid “SMOTEFied-GAN” methodology integrates SMOTE with the GAN framework to address class imbalance in datasets and generate synthetic data [23]. SMOTEfied-GAN works by loading a modified dataset with a binary target variable (OUTCOME), setting the threshold for classification to the median value. In this method, SMOTE is first applied to the minority class to generate synthetic samples, increasing the total number of samples from 10,000 to 13,734, which are then used in conjunction with the original data to train a GAN. The architecture is shown in Figure 4. During training, the GAN iterates through 200 epochs with a batch size of 64, whereby the generator learns to produce synthetic data from a noise vector of dimension 100, and the discriminator learns to distinguish between real and synthetic data.
This iterative process refines the capacity of the GAN to produce realistic synthetic data, contributing to a more robust approach for handling class imbalances in ML tasks.
Table 2 shows that discriminator loss is decreasing, whereas generator loss is increasing with the number of epochs. Thus, the discriminator performance improved when distinguishing between real and fake data. A lower discriminator loss means that the discriminator confidently assigns higher probabilities to the real data and lower probabilities to the generated data.
Following data pre-processing and synthetic data generation, ML algorithms were applied as a fraud-detection model to the extended dataset of automobile insurance claims. Three classifiers were implemented and evaluated. The first includes logistic regression, which is a simple and fast binary classification approach evaluated using accuracy/error, precision/recall, and F1-score as metrics. Decision tree is pat of the family of nonparametric supervised learning tree modeling techniques that partition the feature space to create decision trees, and random forest is an intensified learning technique using many decision trees to arrive at a solution. Among these, random forest showed the best results and was selected as the final model for fraud detection.
The complexity of SMOTE is primarily influenced by the number of minority class samples (
The computational complexity of the random forest arises from the number of decision trees (
In our comprehensive evaluation of fraud detection methodologies in the automobile insurance industry, we investigated three distinct approaches: SMOTE, GAN, and a hybrid approach combining SMOTE with GANs (SMOTEfied-GAN). SMOTE effectively addresses class imbalance by oversampling the minority class, resulting in improved fraud detection accuracy and robustness across multiple classifiers. GANs provide an alternative method for enhancing efficiency by generating synthetic data resembling the underlying distribution, thereby contributing to improved model training and performance metrics. The hybrid approach, SMOTEfied-GAN, synergistically combines the strengths of both methods, resulting in a more diverse and realistic synthetic dataset, further improving fraud-detection capabilities. A comparative analysis was conducted to assess machine learning model, training on datasets using SMOTE, GAN, and the hybrid model of SMOTEfied-GAN. The results demonstrated significant improvements in fraud detection accuracy and robustness for models trained on augmented datasets. The SMOTEfied-GAN exhibited superior performance metrics compared to the original dataset, highlighting the effectiveness of synthetic data generation techniques in addressing class imbalance and enhancing fraud detection capabilities. Table 3 provides a comprehensive performance analysis of the ML models on the original and augmented datasets generated using the GAN, SMOTE, and SMOTEfied-GAN techniques.
All three ML models exhibited identical overall performance on the original dataset, with an accuracy of 81% for logistic regression and random forest, and 76% for decision tree. However, the imbalance of the dataset, with a higher proportion of Class 0 cases, caused these models to perform better for the majority class (Class 0) while underperforming for Class 1, as indicated by the lower precision, recall, and F1-scores for Class 1.
Data augmentation using GAN improved performance across models. Logistic regression and random forest achieved a no table accuracy of 91%, with F1-scores for Class 1 reaching 0.93. This demonstrates the effectiveness of the GAN in addressing class imbalance. Decision tree also showed significant improvement, with an accuracy of 88% and balanced F1-score for both classes.
Using SMOTE for minority class data augmentation resulted in varying performance improvements. Although logistic regression maintained an accuracy of 81%, its performance for Class 1 was marginally better than on the original dataset. Random forest outperformed the other models with an accuracy of 87%, whereas decision tree showed moderate improvements, achieving 82% accuracy.
The SMOTEfied-GAN, which combines the strengths of SMOTE and GAN, yielded the best results among the augmentation techniques. Random forest achieved the highest accuracy of 92%, along with F1-scores of 0.95 for Class 1 and 0.86 for Class 0. Decision tree and logistic regression also showed substantial gains in both precision and recall, with F1-scores reaching 0.92 and 0.91, respectively, for Class 1.
Figure 5 shows a comparative study of the accuracy of the ML models, revealing that the accuracy of the random forest model is higher than that of the other models. The dataset used for training and testing was the data augmented using SMOTEfied-GAN.
These results underscore the importance of balancing datasets for the unbiased evaluation of ML models. Although GAN and SMOTE individually address class imbalance effectively, the hybrid SMOTEfied-GAN approach provides the most robust augmentation strategy, significantly enhancing model performance, particularly for the minority class (Class 1). Among the evaluated models, random forest consistently outperformed the other classifiers across the augmented datasets, thus establishing itself as the best choice for this application.
In summary, our research advances the field of fraud detection in automobile insurance by addressing the challenge of class imbalance through innovative techniques, such as SMOTE, GANs, and SMOTEfied-GAN. While GANs demonstrated superior performance overall, the SMOTEfied-GAN proved to be particularly effective for random forest classifiers using balanced datasets. This nuanced understanding underscores the importance of tailoring data augmentation methods to specific ML algorithms. Future research could explore advanced deep learning techniques and the integration of temporal/contextual features for further refinement of fraud detection mechanisms. As technology evolves, continuous efforts to enhance detection strategies will be crucial in safeguarding against fraudulent activities and protecting the interests of insurance providers and policyholders.
No potential conflict of interest relevant to this article was reported.
No potential conflict of interest relevant to this article was reported.
Table 1. Discriminator and generator losses in GAN.
Epochs | Discriminator loss | Generator loss |
---|---|---|
1 | 0.29198 | 0.64165 |
50 | 0.57304 | 0.94987 |
100 | 0.70780 | 0.59036 |
150 | 0.69463 | 0.69675 |
200 | 0.69169 | 0.69192 |
Table 2. Performance analysis of SMOTEified-GAN.
Epochs | Discriminator loss | Generator loss |
---|---|---|
10 | 0.57601 | 0.44632 |
20 | 0.48884 | 0.37604 |
30 | 0.44740 | 0.33996 |
40 | 0.43613 | 0.35521 |
50 | 0.45767 | 0.27510 |
Table 3. Performance analysis of ML models on original and augmented datasets.
Dataset | Model | Accuracy (%) | Class 0 | Class 1 | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1-score | Precision | Recall | F1-score | |||
Original | Logistic regression | 81 | 0.84 | 0.89 | 0.86 | 0.72 | 0.62 | 0.67 |
Random forest | 81 | 0.84 | 0.89 | 0.86 | 0.73 | 0.63 | 0.67 | |
Decision tree | 76 | 0.83 | 0.82 | 0.82 | 0.62 | 0.64 | 0.63 | |
Augmented (GAN) | Logistic regression | 91 | 0.87 | 0.9 | 0.88 | 0.94 | 0.93 | 0.93 |
Random forest | 91 | 0.86 | 0.88 | 0.87 | 0.94 | 0.92 | 0.93 | |
Decision tree | 88 | 0.84 | 0.82 | 0.83 | 0.91 | 0.91 | 0.91 | |
Augmented (SMOTE) | Logistic regression | 81 | 0.82 | 0.78 | 0.8 | 0.81 | 0.85 | 0.83 |
Random forest | 87 | 0.87 | 0.86 | 0.86 | 0.87 | 0.88 | 0.88 | |
Decision tree | 82 | 0.81 | 0.81 | 0.81 | 0.83 | 0.83 | 0.83 | |
Augmented (SMOTEfied-GAN) | Logistic regression | 88 | 0.81 | 0.76 | 0.78 | 0.9 | 0.92 | 0.91 |
Random forest | 92 | 0.86 | 0.85 | 0.86 | 0.94 | 0.95 | 0.9 | |
Decision tree | 89 | 0.81 | 0.81 | 0.82 | 0.92 | 0.92 | 0.92 |
E-mail: prishapatelm20@gmail.com
E-mail: sakshi.sososhi@gmail.com
E-mail: guptashaurya507@gmail.com
E-mail: tawishigupta2280@gmail.com
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(4): 333-342
Published online December 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.4.333
Copyright © The Korean Institute of Intelligent Systems.
Prisha Patel, Sakshi Chauhan, Shaurya Gupta, Tawishi Gupta, and Renuka Agrawal
Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
Correspondence to:Renuka Agrawal (renuka.agrawal@sitpune.edu.in)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
The automobile insurance industry faces significant challenges in detecting fraudulent activities because of the imbalanced nature of fraud data, which traditional machine learning algorithms struggle to address effectively. In this study, to improve the efficiency of fraud detection, we investigated three approaches: the Synthetic Minority Oversampling TEchnique (SMOTE), generative adversarial networks (GANs), and a hybrid approach combining SMOTE with GANs (SMOTEfied-GAN). SMOTE addresses the class imbalance by oversampling the minority class, whereas GANs generate synthetic data that resemble the training data distribution. The SMOTEfied-GAN combines the strengths of both methods by oversampling the minority class using SMOTE before training the GAN to enhance the quality of the synthetic samples. A comparative analysis was conducted on these approaches using a dataset from the automobile insurance industry. Our evaluation included metrics, such as precision, recall and F1-score. These findings suggest that each approach offers unique advantages in improving fraud-detection efficiency.
Keywords: Fraud detection, Generative adversarial network (GAN), Data balancing, Synthetic data generation, Machine learning algorithms, Synthetic Minority Over-sampling TEchnique (SMOTE)
Automobile insurance is a demanding industry, as financial risks are involved in owning and using a car. However, the most significant problem encountered by insurance businesses is the identification and avoidance of fraud. Moreover the expenses generated by fabricated claims, ranging from fake accidents, exaggerated losses, to deceitful information, significantly affect the insurer’s finances and operations [1].
Over the past few years, technology and big data have introduced new approaches for enhancing fraud control measures. Such tasks have long been addressed using traditional machine learning (ML) techniques. However, these algorithms do not work efficiently because of the inherently skewed nature of fraud data, as the ratio of fraudulent cases to the total number of cases is very small [2, 3]. This class imbalance results in an inaccurate model, in which true positives of the minority class, that is, fraudulent modes are ignored [4].
In this regard, this paper presents a new model that combines the Synthetic Minority Oversampling TEchnique (SMOTE) and generative adversarial networks (GANs) to obtain SMOTEfied-GAN. This method integrates the two major advantages of oversampling: SMOTE helps increase the number of samples of minority classes, and GAN can generate highly realistic samples [5]. To assess our approach, the Kaggle dataset with 10,000 insurance claims and 19 features was used, proving that the proposed techniques help enhance fraud detection based on precision, recall, and the F1-score.
Consequently, this study not only emphasizes the significance of employing more than one method of data augmentation but also proposes a powerful method that can successfully solve the problem of class imbalance in the automobile insurance sector, laying down foundations for better fraud detection [6]. The proposed study presents a unique authomobile insurance fraud detection model that is scalable and capable of detecting fraudulent claims. The efficacy of the proposed model is strengthened through a comparative analysis with traditional machine-learning (ML) models. The major contributions of this study are summarized as follows.
· A novel ensemble model is proposed for automobile insurance fraud detection, capable of identifying both fraudulent and non-fraudulent claims. The model was trained on a balanced dataset, using SMOTE and GAN to perform the balancing.
· The performance of the proposed model was validated against those of traditional models for insurance fraud detection.
The remainder of the article is organized as follows: Section 2 describes related works in similar domains conducted by different researchers. Section 3 presents a step-by-step methodology for insurance fraud detection. Section 4 presents results and discussions. Finally, Section 5 concludes the paper.
A comprehensive review of the existing literature is essential for understanding the current state-of-the-art in fraud detection methodologies within the automobile insurance industry. Each study is summarized with respect to the dataset used, methodology employed, outcomes observed, and limitations encountered. Through a critical examination of prior research, we identify the gaps in the literature and provide novel insights into the field of fraud detection in automobile insurance.
In related works, Strelcenia and Prakoonwit [7] investigated fraud detection using GANs to generate synthetic data from the Kaggle dataset. Their framework included algorithms, such as random forest, k-nearest neighbor (KNN), and iterative dichotomiser (ID3), comparing them with the multiscale kernel neural network (MNET) and support vector machine (SVM). They introduced K-CGAN, which is an oversampling method that combines conditional generative adversarial network with Kullback–Leibler (KL) divergence, and employs the cost-sensitive C-Boost model. Alshawi [2] applied GANs to generate synthetic data for credit card fraud detection using the Kaggle dataset. They evaluated six ML algorithms —logistic regression, decision trees, naïve Bayes, random forest, extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost)— and achieved high accuracy, precision, recall, and area under the ROC curve (AUC) scores. Rodriguez-Almeida et al. [8] investigated synthetic patient data generation for illness prediction using small unbalanced medical datasets. They balanced the data using SMOTE and ADASYN, further augmenting the data using Gaussian copulas and conditional tabular GANs (CTGANs). Gaussian copulas excelled in maintaining linear correlations with minimal optimization, whereas CTGANs required refinement. The study highlighted privacy concerns and potential overfitting with Gaussian copulas, emphasizing the need for further research on privacy preservation methods. Alharbi and Kimura [9] addressed missing data imputation using GANs on four numerical datasets: Pima Indian diabetes, Iris, Glass, and Wine. They proposed random GAN and mesh GAN, with the latter using a 5-dimensional mesh for diverse records. GAN-based methods outperform traditional techniques and exhibit stable performance across varying missing data rates. Ding et al. [10] proposed an improved variational autoencoder GAN (VAEGAN) for credit card fraud detection by combining synthetic data generation with XGBoost classification. This method utilized SMOTE, GAN, VAE, and VAEGAN for oversampling, and XGBoost outperformed the other algorithms. The optimized VAEGAN improved precision, F1-score, and recall, showing potential beyond fraud detection. Endres et al. [11] compared synthetic data generation on the Adult Census, Airbnb, and Airline datasets. Using SMOTE, Synthpop parametric and nonparametric (SP-NP), synthetic data vault with Gaussian copula (SDV-G), and data synthesizer (DS), which were found to perform the best in terms of proximity and processing time, whereas GAN, SDV-GAN, and VAE faced scalability and memory issues. Pandey et al. [13] proposed an adversarial fraud-generation framework using a Wasserstein GAN (WGAN) to synthesize diverse fraud samples conditioned on non-fraudulent transactions. This approach improved the recall and F1-score over existing models such as SMOTE and WGAN, achieving an AUC of 0.976 for the European Credit Card and customs information system (CIS) fraud datasets. The study emphasized the importance of increasing the number of synthetic samples for better fraud detection although issues with mode breakdown and unstable GAN training were noted, requiring careful parameter tuning.
Although fraud detection has been investigated in the finance sector, the automobile insurance sector lacks a fraud-detection model. In addition, researchers have mostly worked on original datasets, for which data balancing is lacking. This study considered fraud detection in the automobile insurance domain, using an ensemble model of SMOTE and GAN for augmented data generation. After data balancing, ML techniques were applied for model selection. Automobile insurance fraud is a sector that researchers in the ML domain need to explore, as little work has been conducted in this domain.
This section describes the methodological strategy used to improve the effectiveness of fraud identification in the automobile insurance sector. The methodology began with data collection followed by pre-processing. The next stage involved synthetic data generation and data balancing. This is required for precise results. Model training and testing were further performed. Figure 1 outlines the proposed methodology.
As shown in Figure 1, the pipeline begins with data collection and pre-processing, followed by data augmentation and balancing, using SMOTE, GAN, and SMOTEfied-GAN. After selecting the best features, classifiers, such as decision tree, logistic regression, and random forest were chosen for model training and testing, culminating in knowledge discovery.
The data used in this study were obtained from Kaggle and consisted of a large set of car insurance claim data. This dataset comprises ten thousand distinct samples and sample insurance claims. This information includes 19 different fields of the dataset, embracing various characteristics of policyholders and their insurance claim history. These attributes play a significant role in insurance-risk prediction and analysis of various patterns regarding insurance.
The fields included: ID: a unique number representing a record and AGE: policyholder’s age. GENDER and RACE are demographic features. DRIVING_EXPERIENCE refers to the number of years that the policy holder has been driving. Available features comprised EDUCATION: level of education and INCOME: a person’s financial status. CREDIT_SCORE refers to a client’s creditworthiness, and several features are related to the insured asset, such as VEHICLE_OWNERSHIP, VEHICLE_YEAR, and VEHICLE_TYPE.
Social and lifestyle variables consider parameters such as MARRIED, CHILDREN, and POSTAL_CODE. The metrics associated with usage include ANNUAL_MILEAGE, which is the average number of miles travelled in a year. Driving history factors included SPEEDING_VIOLATIONS, PAST_ACCIDENTS, and DUIS. Finally, the dataset included the dependent variable of the study, which was the OUTCOME of the claim.
Pre-processing of the data ensures the integrity and suitability of the dataset for subsequent analyses. The pre-processing pipeline initially converts the categorical columns, such as “AGE” and “DRIVING_EXPERIENCE” into numerical data, facilitating quantitative analysis. Irrelevant columns such as “RACE” are eliminated. Addressing the missing values in the numeric columns is imperative and was achieved by replacing the missing values with the mean value of each column to maintain dataset integrity. Categorical variables were subjected to one-hot encoding to convert them into a binary format compatible with machine-learning algorithms. Additionally, numerical features were standardized using a standard scaler to ensure uniformity in feature scales, preventing features with larger magnitudes from dominating. To allow independent model validation, the dataset was divided into training and testing datasets.
Finally, the pre-processed dataset was visually inspected to verify the effectiveness of the pre-processing techniques and ensure data integrity before being saved in a new CSV file for subsequent analyses. These pre-processing steps collectively prepared the dataset for robust analysis and model development, laying the groundwork for fraud detection in the automobile insurance domain.
A major problem arises with ML model training because the dataset remains imbalanced, such that although model performance is high, the model is not well trained for minority classes. The proposed method uses SMOTE and GAN to balance and increase the number of samples in the dataset.
Recently, various types of GANs have been employed to generate synthetic image and textual data. Examples include conditional GANs used on tabular data [14–16], VAEs applied to electronic health record (EHR) data [17, 18], and WGANs used for generating time-series data [19]. Our initiative focused on significantly enhancing fraud-detection accuracy using a GAN. The main objective was to generate synthetic data that closely mimic authentic instances of fraudulent claims. The reason for choosing Vanilla GAN was selected because of its simplicity and effectiveness in generating diverse and realistic synthetic data. The generator component of the Vanilla GAN creates data that closely resemble authentic instances of fraudulent claims. The discriminator divides synthetic and actual data simultaneously without requiring the extra effort for the conditional or specialized variants. The streamlined approach allowed capturing the nuanced patterns and characteristics of fraudulent activities, leading to enhanced fraudulent detection accuracy and model robustness [20]. The framework involves training a discriminator model to effectively distinguish between real and synthetic instances, whereas a generator model learns to produce synthetic data samples that exhibit statistical attributes and patterns similar to real fraudulent claims. The generator utilized a latent noise vector of dimension 100 and consisted of dense layers with rectified linear unit (ReLU) activation functions and batch normalization. Conversely, the discriminator included layers with dropout regularization and a sigmoid activation function in the output layer. The discriminator was alternately trained on batches of each type of data to strengthen its capacity to distinguish between actual and synthetic data. To optimize the iterative learning process, we set batch size to 64 and number of training epochs to 500. Figure 2 shows the GAN architecture.
During the training process of the GAN, the progression of the discriminator and generator losses was monitored across multiple epochs to assess the model’s learning dynamics and convergence. Table 1 shows loss values by epoch for the GAN model.
In Table 1, as the number of epochs increases, generator and discriminator losses increase. One of the major reasons for this is data imbalance, as the samples of fraudulent cases are far fewer than those of non-fraudulent cases. Therefore, another technique is required to balance the dataset.
SMOTE is a method that balances a dataset by creating synthetic samples of minority classes to alleviate class imbalance [21]. The architecture is shown in Figure 3. Using KNN, SMOTE generates synthetic instances of minority class samples by transforming new samples between pre-existing minority class samples [22]. This ensures that ML models are not biased towards the majority class (genuine claims) and can effectively identify and learn from the minority class (fraudulent claims). After applying SMOTE to address class imbalance, the number of instances of fraudulent data in the training dataset increased to 5,437 in a total of 10,000 instances, which is approximately 50% of the total. Figure 3 illustrates the SMOTE process used to address class imbalance in automobile insurance fraud detection. After pre-processing, synthetic samples were generated for the minority class, creating a balanced dataset for training and evaluating the fraud detection model.
The hybrid “SMOTEFied-GAN” methodology integrates SMOTE with the GAN framework to address class imbalance in datasets and generate synthetic data [23]. SMOTEfied-GAN works by loading a modified dataset with a binary target variable (OUTCOME), setting the threshold for classification to the median value. In this method, SMOTE is first applied to the minority class to generate synthetic samples, increasing the total number of samples from 10,000 to 13,734, which are then used in conjunction with the original data to train a GAN. The architecture is shown in Figure 4. During training, the GAN iterates through 200 epochs with a batch size of 64, whereby the generator learns to produce synthetic data from a noise vector of dimension 100, and the discriminator learns to distinguish between real and synthetic data.
This iterative process refines the capacity of the GAN to produce realistic synthetic data, contributing to a more robust approach for handling class imbalances in ML tasks.
Table 2 shows that discriminator loss is decreasing, whereas generator loss is increasing with the number of epochs. Thus, the discriminator performance improved when distinguishing between real and fake data. A lower discriminator loss means that the discriminator confidently assigns higher probabilities to the real data and lower probabilities to the generated data.
Following data pre-processing and synthetic data generation, ML algorithms were applied as a fraud-detection model to the extended dataset of automobile insurance claims. Three classifiers were implemented and evaluated. The first includes logistic regression, which is a simple and fast binary classification approach evaluated using accuracy/error, precision/recall, and F1-score as metrics. Decision tree is pat of the family of nonparametric supervised learning tree modeling techniques that partition the feature space to create decision trees, and random forest is an intensified learning technique using many decision trees to arrive at a solution. Among these, random forest showed the best results and was selected as the final model for fraud detection.
The complexity of SMOTE is primarily influenced by the number of minority class samples (
The computational complexity of the random forest arises from the number of decision trees (
In our comprehensive evaluation of fraud detection methodologies in the automobile insurance industry, we investigated three distinct approaches: SMOTE, GAN, and a hybrid approach combining SMOTE with GANs (SMOTEfied-GAN). SMOTE effectively addresses class imbalance by oversampling the minority class, resulting in improved fraud detection accuracy and robustness across multiple classifiers. GANs provide an alternative method for enhancing efficiency by generating synthetic data resembling the underlying distribution, thereby contributing to improved model training and performance metrics. The hybrid approach, SMOTEfied-GAN, synergistically combines the strengths of both methods, resulting in a more diverse and realistic synthetic dataset, further improving fraud-detection capabilities. A comparative analysis was conducted to assess machine learning model, training on datasets using SMOTE, GAN, and the hybrid model of SMOTEfied-GAN. The results demonstrated significant improvements in fraud detection accuracy and robustness for models trained on augmented datasets. The SMOTEfied-GAN exhibited superior performance metrics compared to the original dataset, highlighting the effectiveness of synthetic data generation techniques in addressing class imbalance and enhancing fraud detection capabilities. Table 3 provides a comprehensive performance analysis of the ML models on the original and augmented datasets generated using the GAN, SMOTE, and SMOTEfied-GAN techniques.
All three ML models exhibited identical overall performance on the original dataset, with an accuracy of 81% for logistic regression and random forest, and 76% for decision tree. However, the imbalance of the dataset, with a higher proportion of Class 0 cases, caused these models to perform better for the majority class (Class 0) while underperforming for Class 1, as indicated by the lower precision, recall, and F1-scores for Class 1.
Data augmentation using GAN improved performance across models. Logistic regression and random forest achieved a no table accuracy of 91%, with F1-scores for Class 1 reaching 0.93. This demonstrates the effectiveness of the GAN in addressing class imbalance. Decision tree also showed significant improvement, with an accuracy of 88% and balanced F1-score for both classes.
Using SMOTE for minority class data augmentation resulted in varying performance improvements. Although logistic regression maintained an accuracy of 81%, its performance for Class 1 was marginally better than on the original dataset. Random forest outperformed the other models with an accuracy of 87%, whereas decision tree showed moderate improvements, achieving 82% accuracy.
The SMOTEfied-GAN, which combines the strengths of SMOTE and GAN, yielded the best results among the augmentation techniques. Random forest achieved the highest accuracy of 92%, along with F1-scores of 0.95 for Class 1 and 0.86 for Class 0. Decision tree and logistic regression also showed substantial gains in both precision and recall, with F1-scores reaching 0.92 and 0.91, respectively, for Class 1.
Figure 5 shows a comparative study of the accuracy of the ML models, revealing that the accuracy of the random forest model is higher than that of the other models. The dataset used for training and testing was the data augmented using SMOTEfied-GAN.
These results underscore the importance of balancing datasets for the unbiased evaluation of ML models. Although GAN and SMOTE individually address class imbalance effectively, the hybrid SMOTEfied-GAN approach provides the most robust augmentation strategy, significantly enhancing model performance, particularly for the minority class (Class 1). Among the evaluated models, random forest consistently outperformed the other classifiers across the augmented datasets, thus establishing itself as the best choice for this application.
In summary, our research advances the field of fraud detection in automobile insurance by addressing the challenge of class imbalance through innovative techniques, such as SMOTE, GANs, and SMOTEfied-GAN. While GANs demonstrated superior performance overall, the SMOTEfied-GAN proved to be particularly effective for random forest classifiers using balanced datasets. This nuanced understanding underscores the importance of tailoring data augmentation methods to specific ML algorithms. Future research could explore advanced deep learning techniques and the integration of temporal/contextual features for further refinement of fraud detection mechanisms. As technology evolves, continuous efforts to enhance detection strategies will be crucial in safeguarding against fraudulent activities and protecting the interests of insurance providers and policyholders.
No potential conflict of interest relevant to this article was reported.
Logical flow of methodology.
Architecture of GAN.
Architecture of SMOTE.
Architecture of SMOTEfied-GAN.
Analysis of accuracy of ML models.
Table 1 . Discriminator and generator losses in GAN.
Epochs | Discriminator loss | Generator loss |
---|---|---|
1 | 0.29198 | 0.64165 |
50 | 0.57304 | 0.94987 |
100 | 0.70780 | 0.59036 |
150 | 0.69463 | 0.69675 |
200 | 0.69169 | 0.69192 |
Table 2 . Performance analysis of SMOTEified-GAN.
Epochs | Discriminator loss | Generator loss |
---|---|---|
10 | 0.57601 | 0.44632 |
20 | 0.48884 | 0.37604 |
30 | 0.44740 | 0.33996 |
40 | 0.43613 | 0.35521 |
50 | 0.45767 | 0.27510 |
Table 3 . Performance analysis of ML models on original and augmented datasets.
Dataset | Model | Accuracy (%) | Class 0 | Class 1 | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1-score | Precision | Recall | F1-score | |||
Original | Logistic regression | 81 | 0.84 | 0.89 | 0.86 | 0.72 | 0.62 | 0.67 |
Random forest | 81 | 0.84 | 0.89 | 0.86 | 0.73 | 0.63 | 0.67 | |
Decision tree | 76 | 0.83 | 0.82 | 0.82 | 0.62 | 0.64 | 0.63 | |
Augmented (GAN) | Logistic regression | 91 | 0.87 | 0.9 | 0.88 | 0.94 | 0.93 | 0.93 |
Random forest | 91 | 0.86 | 0.88 | 0.87 | 0.94 | 0.92 | 0.93 | |
Decision tree | 88 | 0.84 | 0.82 | 0.83 | 0.91 | 0.91 | 0.91 | |
Augmented (SMOTE) | Logistic regression | 81 | 0.82 | 0.78 | 0.8 | 0.81 | 0.85 | 0.83 |
Random forest | 87 | 0.87 | 0.86 | 0.86 | 0.87 | 0.88 | 0.88 | |
Decision tree | 82 | 0.81 | 0.81 | 0.81 | 0.83 | 0.83 | 0.83 | |
Augmented (SMOTEfied-GAN) | Logistic regression | 88 | 0.81 | 0.76 | 0.78 | 0.9 | 0.92 | 0.91 |
Random forest | 92 | 0.86 | 0.85 | 0.86 | 0.94 | 0.95 | 0.9 | |
Decision tree | 89 | 0.81 | 0.81 | 0.82 | 0.92 | 0.92 | 0.92 |
Hasan J. Alyamani, Shakeel Ahmad, Asif Hassan Syed, Sheikh Muhammad Saqib, and Yasser D. Al-Otaibi
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 409-422 https://doi.org/10.5391/IJFIS.2021.21.4.409Logical flow of methodology.
|@|~(^,^)~|@|Architecture of GAN.
|@|~(^,^)~|@|Architecture of SMOTE.
|@|~(^,^)~|@|Architecture of SMOTEfied-GAN.
|@|~(^,^)~|@|Analysis of accuracy of ML models.