International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 233-244
Published online September 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.3.233
© The Korean Institute of Intelligent Systems
Ahmed Al-Ajeli, Eman S. Al-Shamery, and Raaid Alubady
College of Information Technology, University of Babylon, Babylon, Iraq
Correspondence to :
Ahmed Al-Ajeli (a.alajeli@uobabylon.edu.iq)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emails have become a common modality to exchange messages and information over the internet. Some emails are frequently received from malicious senders, which can lead to several problems. Thus, there is an urgent need to develop reliable and powerful methods for filtering such emails. In this paper, we present a new approach to filter emails to spam and ham based on the text of the email being analyzed. After feature extraction, a supervised classifier system is adopted to generate a rule-based model that captures spam and non-spam patterns used later as an interactive filter. The aim is to obtain a compact model that provides a highly accurate performance using only a few rules. The proposed approach was applied to two datasets in terms of size and features. Experimental results indicated that an accuracy of 99:7% was achieved using a model with only 74 rules and 70 conditions. With this reduced number of rules, the proposed approach presents an efficient solution for filtering spam email using the trained model. In addition, the results of the proposed approach showed better performance than their counterparts.
Keywords: Spam email, Learning classifier systems, Rule-based systems, Rule compaction
Billions of people around the world use emails as an effective, fast, and inexpensive method of exchanging messages, and they are considered an essential communication medium. Saleh et al. [1] showed that nearly five billion email accounts were actively utilized in 2017, and by the end of 2019 this number was expected to increase to more than five and half billion. They also highlighted the fact that there are 270 billion emails exchanged daily, and approximately 57% of these are spam emails. The term “spam email” according to the Text Retrieval Conference (TREC) means an unsolicited and unwanted email that is indiscriminately sent via email accounts [2]. These spam emails may carry different types of malicious content, such as the proliferation of unwanted advertisements, phishing, and fraud schemes.
Spam emails have a set of effects that can be listed as follows: (i) disturbing individual users, (ii) unreliable emails, (iii) wasting computational resources allocated for file servers, (iv) network bandwidth ill-usage, (v) loss of work productivity, (vi) increasing the probability of spreading viruses such as Trojan horses and worms, and (vii) increasing financial losses by denial of service attacks, phishing attacks, and directory harvesting attacks [3]. Furthermore, with the increasing popularity of emails, the problem of spam email filtering has become crucial for social stability and public security [4]. Hence, an effective and smart solution to address the problem of filtering spam emails is urgently required.
To this end, several approaches have been proposed to address the problem in question (see [5,6] for a systematic review). Nevertheless, thousands of spam emails are still encountered by email users every day. This is a result of the changing nature of emails, which leads researchers to explore new methods from the artificial intelligence field, in particular, machine learning methods [7]. Recently, some studies have also proposed the application of machine-learning techniques to address the detection problems of unwanted scenarios [8,9].
The focus of this study is the use of machine learning techniques to address the filtering problem of spam emails. Learning classifier system (LCS) approaches have been investigated to address this problem. The LCS model is a rule-based system in which each rule covers part of the state space [10]. Evolutionary algorithms such as a genetic algorithm (GA) are applied to evolve each rule and guide the search to obtain the optimal set of rules based on the classification accuracy. The problem of spam email filtering was first addressed using the LCS model in [11]. This study proposes an approach that extends the extended classifier system (XCS) model (reinforcement learning model) [12,13] of the LCS by presenting a new scheme for the representation of rules. The aim was to address the problem of sparseness in feature vectors encoded by the term frequency-inverse document frequency (TF-IDF). However, the size of the resulting learning model has not been considered and studied. In other words, the number of rules and the conditions in each rule need to be considered, as this could be very problematic, especially when running the model online.
The aim of the present study is to construct a rule-based machine learning approach for the detection of spam emails. This approach produces a filter (model) with the minimum size (a few rules), while maintaining the same performance. The supervised classifier system (UCS) [14–16] was adopted to generate a learning model that can then be used to filter spam emails. The UCS model was specifically designed for supervised learning, as is the case in the current problem. Two datasets of different sizes and features were considered to evaluate the performance of the proposed approach. Both datasets have a set of emails in the form of text files that are cleaned and transformed into feature vectors capturing binary values. Each dataset is divided into two subsets: training and testing. The training subset was passed to the UCS algorithm to generate an optimal rule-based learning model. The resulting model can contain a considerable number of redundant rules. In this step, a rule compaction strategy was applied to reduce the number of redundant rules, and several strategies have been developed [17] to this end. In this study, a strategy called quick rule compaction (QRC) is adopted [18]. This provides an approach that is different from the previous works in that it builds a rule-based machine learning model that is human-readable (if-then-action form) and capable of making a decision in near real-time.
The remainder of this paper is organized as follows. A brief summary of related work is provided in Section 2. Sections 3 and 4 provide formal descriptions of the spam email filtering problem and learning classifier systems, respectively. In Section 5, details of the proposed approach are presented. The experimental results and discussion are presented in Section 6. The conclusions are presented in Section 7.
In this section, the most popular techniques to address the problem of spam email filtering are briefly reviewed. According to Kakade et al. [19], spam filtering techniques can be classified into four categories: machine learning-based techniques, content-based techniques, list or word-based techniques, and hybrid techniques. In the scientific community, machine learning-based techniques are often trusted and adopted with a powerful mathematical background, which motivates the popularity and success of such techniques.
A simple classifier model is the naïve Bayes, which is based on Bayes’ theorem with a strong independence hypothesis [20]. It is a probability-based classifier that calculates the separation probabilities for specific instances. The probability set is computed by determining the harmonic and frequency values of the dataset being analyzed. Then, the class probability closest to the back end is chosen by the classifier. This algorithm is also a multilayered classifier that works efficiently in a supervised learning paradigm. However, the accuracy of this classifier depends on the type of dataset being handled, in addition to the misspelled words present in emails. Consequently, a decrease in the accuracy and efficiency of the classification is observed. Further work has been conducted to mitigate the weaknesses of the naïve Bayes algorithm [4,21,22].
The support vector machine (SVM) technique is also applied in email classification [23]. It is a supervised learning approach and is ranked as the best “off-the-shelf” method recently used in the spam email filtering problem. In the machine learning community, it has also become one of the most sought-after classifiers, because it provides better generalization, requires fewer samples for training, and can handle high-dimensional datasets with the help of kernel-based decision machines. However, it is not suitable for large datasets, and its performance may decrease when the number of features exceeds the number of training instances.
Further improvements in the performance and efficiency of SVM techniques have been achieved by applying the sequential minimal optimization (SMO) algorithm for spam email filtering [24]. SMO is a modified form for solving the quadratic programming problem of the standard SVM learning algorithm. The function of this modified form is to improve efficiency and optimize the learning model. Moreover, the proposed approach incorporates a hybrid feature selection algorithm to reduce dimensionality and increase scalability. In addition, a comparison of the performance of the standard SVM approach and several previous approaches is presented. This comparison shows that the proposed approach outperformed these approaches.
The use of clustering for analyzing spam emails was reported in [25,26]. The authors propose a framework to address the problem of spam email filtering, based on the use of a categorical clustering tree algorithm for both groups, and classifies large datasets and data streams of spam emails into homogeneous campaigns by structural similarity. Two modes of operation, batch and dynamic, were applied to handle the data. While the aim of grouping spam emails is to identify the campaigns of similar emails, classification is used to assign labels to certain emails (e.g., malware, phishing advertisement, etc.).
Furthermore, a spam filter that mainly adopts deep learning techniques has been proposed [27]. This filter integrates three main components: (a) feature representation based on N-gram TF-IDF to handle high dimensionality, (b) a modified balancing algorithm, and (c) a multilayer perceptron neural network model. The authors demonstrated the performance of the proposed approach using four datasets. In addition, a comparison with state-of-the-art anti-spam filters is provided. The results of the comparison show a better performance in terms of the filtering accuracy of the proposed approach. Similarly, another study based on a convolutional neural network and long short-term memory architecture was proposed [28]. This architecture was supported by adding semantic information to the representation of words. The experimental results were obtained using two datasets (SMS and Twitter) to measure the performance in terms of the F1-score and accuracy. In addition, several variations have been developed to improve the performance of deep-learning techniques for spam email detection [29,30].
In this section, the problem of spam email filtering is described formally. Consider a set of emails
Given two sets
•
•
In this study, the focus to filter a spam email is on the textual content (body) of the email itself, i.e., the only information considered is contained in the body of the email being analyzed. In other words, the source from which emails come is not considered in order to distinguish between emails.
In this section, we briefly discuss LCS. According to Sigaud and Wilson [10], an LCS is defined as a rule-based system in which a set of optimal rules is automatically built. The LCS comprises a population of rules called classifiers evolving by an adaptive algorithm, such as a GA, to discover the optimal rules. In this sense, the LCS can be considered a learning system through which a set of optimal rules is learned. The main source of the data from which these rules are built is called the environment, which in our case is the training set of emails being analyzed.
The core of the LCS is the rule that represents the building blocks of the resulting model. These rules comprise two parts: the condition part (corresponding to the features) and the action part (corresponding to the class). Rules can be interpreted in a human-readable form as
UCS uses incremental learning, in which an instance of the dataset is selected in each iteration. We assume that
The UCS builds an optimal set of rules using accuracy-based fitness to guide the search for these optimal rules. Each rule is associated with a set of parameters, the most important of which are: accuracy, fitness, experience, and numerosity. The quality of a rule is measured based on its accuracy and fitness. Numerosity is defined as the number of copies a rule has in the population, and experience is the number of times a rule participates in
Table 1 provides an example of the two rules presented in this approach. Each rule has as a condition part 12 entries with the values 0, 1 or #. The action part contained 0 or 1. We assume that the input vector is (0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1), which matches rule 1, but not rule 2.
An outline of the proposed approach is presented in Figure 1. Three main components are: pre-learning, UCS, and post-learning.
Pre-processing is the first step in preparing the textual data of the emails and providing them to the next steps. In this step, all unnecessary content in the process of filtering the emails was removed. To this end, we used the natural language toolkit (NLTK 3.6.2) of Python to implement the following steps through which an email is processed:
1.
2.
3.
4.
Applying these steps results in clean text that is appropriate for the following steps.
Features have an important impact on data mining. Extracting appropriate features helps solve several problems, including dimensional reduction, training convergence, overfitting avoidance, and classification accuracy. The purpose of this process was to obtain a set of features for each instance in the dataset. An instance here represents an email in which the features are a reflection of the vocabulary (textual content) of the email being analyzed. The result of this process is a vector of features entered into the LCS model to determine whether a given email is ham or spam.
In this study, binary values were used to represent feature vectors, which can be determined as follows. Let
In this study, the UCS algorithm is adopted as a learning algorithm, in which one training instance of the dataset being considered is used in each iteration. Eight phases are used in the learning cycle of the UCS. Phases 1, 2, 3, 4, 5, 6, 8, and 9 describe the learning part of the cycle, whereas phase 7 is for rule discovery (see Figure 1). These phases are repeated until a user-predefined number, say
One training instance from the dataset was received in each iteration (phase 1). This training instance is checked for a match in set
For rule discovery (phase 7), the GA is applied to produce two new rules, which are added to the population
The pseudocode for implementing the details of the UCS is given in Algorithm 1. Given a set of training instances in which each element is a pair of feature vectors
After the learning step ended, a rule-compaction strategy was applied. As described previously, the aim of applying this strategy is to remove redundant rules from
In the testing mode, a new input instance is given, and the
is used to determine the values of votes. If
We conclude this section by commenting on the computation/ memory requirements using the present approach. The training stage (offline step) incurs high computational cost. The search process to obtain the optimal set of rules is slow compared to techniques based on a single local solution. However, the resulting model (rule set) can provide classification decisions in real time. This is an important outcome obtained by applying the UCS approach.
This section presents the experimental results obtained using the proposed approach. Two datasets with different features and sizes were used. The first dataset (
In this study, parameters of the UCS are set as follows:
Then, the performance of the proposed approach was evaluated by applying different criteria, such as
The measurement of performance in terms of precision, recall, and F-measure is depicted in Figure 3. The goal is to study the ratio of true positives to true negatives. In the spam email filtering problem, the most interesting case arose when a given email was classified as ham, while it was spam. From Figure 3(a), it can be observed that an increase in the number of iterations leads to an increase in the values of the three measures under consideration with a slight fluctuation. The results show that the highest performance is recorded for recall with a value of 0.9753, i.e., a higher true positive ratio. The results in Figure 3(b) provide a more stable picture and an obvious improvement in the performance obtained earlier, i.e., precision, recall, and F-measure measures reach their peak performance at a value of 0.997 with 150000 iterations. This is because working with a larger dataset results in a higher probability of speeding up the learning process.
The size of the model in terms of the number of rules and features in each rule is illustrated in Figure 4(b). For both datasets, it can be seen that with more iterations, the UCS requires fewer rules, as shown in Figure 4(a). However, the results obtained for Enron4 were better than those obtained for Ling-Spam. The best number of rules is obtained in the case of the Enron4 dataset, which is only 74. Although the Enron4 dataset contained more e-mails than the Ling-Spam dataset, there were fewer rules. This result is expected, because a large dataset gives the LCS the opportunity to produce a more compact set of general rules.
Table 2 summarizes the results of applying the QRC strategy to both the datasets. In each case, the number of rules before and after the application of the QRC was listed. A significant reduction in the number of rules formed by the UCS is achieved; for example, the number of rules was reduced from 4294 to 74 in the case of Enron4. Note that all these rules are %100 accurate, i.e., the set of instances covered by a given rule is correctly classified.
However, a similar relationship is observed between the number of features and the number of iterations, as depicted in Figure 4(b). Again, the UCS yields better results for Enron4 than for Ling-Spam, and these results gradually improve when the number of iterations increases. UCS implicitly applies feature selection by presenting the symbol #. In other words, a given feature in the training set can be ignored if the corresponding condition over all rules in
Finally, a comparison between the performance of the proposed approach and most relevant approaches is presented in Table 3. The Enron4 dataset was used for comparison. Using four evaluation measures, the results indicate that the UCS-based approach provides higher performance than the other approaches. A value of approximately 0.99 is obtained by the present approach compared to 0.95 obtained by the other approaches in the best cases. The reason why the UCS performs better could stem from its ability to divide the problem into sub-problems that are easy to solve compared with other approaches that rely on a single solution. This enables the UCS to produce a global solution that cannot be guaranteed by other approaches.
In this study, a new approach for filtering spam emails is presented. This approach uses the UCS algorithm to create a learning model for filtering. The learning model was captured in the form of a set of rules generated by the UCS. The proposed approach comprises three main components. A pre-processing plus feature extraction component is first implemented for refining, normalizing, and representing the email data. Then, the UCS algorithm is applied in the learning process that incorporates the GA as an optimization method to discover new rules. The use of rules presents a human-readable, form and provides an easy output to interpret and understand. However, a considerable problem arises because of the large number of rules generated by the UCS in addition to the large number of conditions in each rule. Thus, in the final stage, the QRC strategy is adopted to reduce the number of rules, while maintaining the performance of the generated learning model. From the experimental results, it can be observed that selecting the appropriate parameters, such as the number of learning iterations and the size of the rule population, has a major impact on the performance, stability, and generalization of the proposed approach. Moreover, the proposed approach yields promising performance compared with the most popular existing approaches. This performance was also supported by a reduction in the size of the learning model. In particular, a reduction rate of 98.27% is achieved with respect to the number of rules due to the QRC strategy. Additionally, a reduction rate of 37.5% was obtained for the number of conditions. This was a result of the implicit ability of the UCS algorithm to supply a form of feature selection. In future work, exploring the more relevant features of instances in the datasets is required to disregard remaining instances, thus reducing the number of conditions on the rules.
No potential conflict of interest relevant to this article was reported.
Algorithm 1. UCS..
1: | |
2: | |
3: | |
4: | |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | |
17: | |
18: | |
19: | |
20: | |
21: | |
22: | |
23: | |
24: | |
25: | |
26: | |
27: | |
28: | |
29: | |
30: | |
31: | |
32: | Find the worst rule, |
33: | |
34: | |
35: | |
36: | Delete |
37: | |
38: | |
39: |
Algorithm 2. QRC..
1: | Sort |
2: | Let |
3: | |
4: | |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | |
17: |
Table 1. Some examples of rules produced by the UCS.
No. | Condition | Action |
---|---|---|
1 | # # # 1 # # # 0 # # 0 # | 0 |
2 | # # 0 # 0 # # 1 # # 1 # | 1 |
Table 2. Results of rule compaction.
Iterations | Ling-Spam | Enron4 | ||
---|---|---|---|---|
before | after | before | after | |
50000 | 19791 | 277 | 7370 | 94 |
100000 | 19738 | 155 | 6766 | 82 |
150000 | 19737 | 104 | 5674 | 80 |
200000 | 19503 | 83 | 4294 | 74 |
Table 3. Evaluation of the Enron4 dataset.
Learning algorithm | Precision | Recall | F-measure | Accuracy |
---|---|---|---|---|
Bayes net | 0.880 | 0.843 | 0.838 | 0.8426 |
Naïve Bayes | 0.893 | 0.868 | 0.866 | 0.868 |
Logistic function | 0.951 | 0.951 | 0.951 | 0.951 |
SVM | 0.939 | 0.933 | 0.933 | 0.933 |
UCS |
Bold font indicates best results..
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 233-244
Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.233
Copyright © The Korean Institute of Intelligent Systems.
Ahmed Al-Ajeli, Eman S. Al-Shamery, and Raaid Alubady
College of Information Technology, University of Babylon, Babylon, Iraq
Correspondence to:Ahmed Al-Ajeli (a.alajeli@uobabylon.edu.iq)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Emails have become a common modality to exchange messages and information over the internet. Some emails are frequently received from malicious senders, which can lead to several problems. Thus, there is an urgent need to develop reliable and powerful methods for filtering such emails. In this paper, we present a new approach to filter emails to spam and ham based on the text of the email being analyzed. After feature extraction, a supervised classifier system is adopted to generate a rule-based model that captures spam and non-spam patterns used later as an interactive filter. The aim is to obtain a compact model that provides a highly accurate performance using only a few rules. The proposed approach was applied to two datasets in terms of size and features. Experimental results indicated that an accuracy of 99:7% was achieved using a model with only 74 rules and 70 conditions. With this reduced number of rules, the proposed approach presents an efficient solution for filtering spam email using the trained model. In addition, the results of the proposed approach showed better performance than their counterparts.
Keywords: Spam email, Learning classifier systems, Rule-based systems, Rule compaction
Billions of people around the world use emails as an effective, fast, and inexpensive method of exchanging messages, and they are considered an essential communication medium. Saleh et al. [1] showed that nearly five billion email accounts were actively utilized in 2017, and by the end of 2019 this number was expected to increase to more than five and half billion. They also highlighted the fact that there are 270 billion emails exchanged daily, and approximately 57% of these are spam emails. The term “spam email” according to the Text Retrieval Conference (TREC) means an unsolicited and unwanted email that is indiscriminately sent via email accounts [2]. These spam emails may carry different types of malicious content, such as the proliferation of unwanted advertisements, phishing, and fraud schemes.
Spam emails have a set of effects that can be listed as follows: (i) disturbing individual users, (ii) unreliable emails, (iii) wasting computational resources allocated for file servers, (iv) network bandwidth ill-usage, (v) loss of work productivity, (vi) increasing the probability of spreading viruses such as Trojan horses and worms, and (vii) increasing financial losses by denial of service attacks, phishing attacks, and directory harvesting attacks [3]. Furthermore, with the increasing popularity of emails, the problem of spam email filtering has become crucial for social stability and public security [4]. Hence, an effective and smart solution to address the problem of filtering spam emails is urgently required.
To this end, several approaches have been proposed to address the problem in question (see [5,6] for a systematic review). Nevertheless, thousands of spam emails are still encountered by email users every day. This is a result of the changing nature of emails, which leads researchers to explore new methods from the artificial intelligence field, in particular, machine learning methods [7]. Recently, some studies have also proposed the application of machine-learning techniques to address the detection problems of unwanted scenarios [8,9].
The focus of this study is the use of machine learning techniques to address the filtering problem of spam emails. Learning classifier system (LCS) approaches have been investigated to address this problem. The LCS model is a rule-based system in which each rule covers part of the state space [10]. Evolutionary algorithms such as a genetic algorithm (GA) are applied to evolve each rule and guide the search to obtain the optimal set of rules based on the classification accuracy. The problem of spam email filtering was first addressed using the LCS model in [11]. This study proposes an approach that extends the extended classifier system (XCS) model (reinforcement learning model) [12,13] of the LCS by presenting a new scheme for the representation of rules. The aim was to address the problem of sparseness in feature vectors encoded by the term frequency-inverse document frequency (TF-IDF). However, the size of the resulting learning model has not been considered and studied. In other words, the number of rules and the conditions in each rule need to be considered, as this could be very problematic, especially when running the model online.
The aim of the present study is to construct a rule-based machine learning approach for the detection of spam emails. This approach produces a filter (model) with the minimum size (a few rules), while maintaining the same performance. The supervised classifier system (UCS) [14–16] was adopted to generate a learning model that can then be used to filter spam emails. The UCS model was specifically designed for supervised learning, as is the case in the current problem. Two datasets of different sizes and features were considered to evaluate the performance of the proposed approach. Both datasets have a set of emails in the form of text files that are cleaned and transformed into feature vectors capturing binary values. Each dataset is divided into two subsets: training and testing. The training subset was passed to the UCS algorithm to generate an optimal rule-based learning model. The resulting model can contain a considerable number of redundant rules. In this step, a rule compaction strategy was applied to reduce the number of redundant rules, and several strategies have been developed [17] to this end. In this study, a strategy called quick rule compaction (QRC) is adopted [18]. This provides an approach that is different from the previous works in that it builds a rule-based machine learning model that is human-readable (if-then-action form) and capable of making a decision in near real-time.
The remainder of this paper is organized as follows. A brief summary of related work is provided in Section 2. Sections 3 and 4 provide formal descriptions of the spam email filtering problem and learning classifier systems, respectively. In Section 5, details of the proposed approach are presented. The experimental results and discussion are presented in Section 6. The conclusions are presented in Section 7.
In this section, the most popular techniques to address the problem of spam email filtering are briefly reviewed. According to Kakade et al. [19], spam filtering techniques can be classified into four categories: machine learning-based techniques, content-based techniques, list or word-based techniques, and hybrid techniques. In the scientific community, machine learning-based techniques are often trusted and adopted with a powerful mathematical background, which motivates the popularity and success of such techniques.
A simple classifier model is the naïve Bayes, which is based on Bayes’ theorem with a strong independence hypothesis [20]. It is a probability-based classifier that calculates the separation probabilities for specific instances. The probability set is computed by determining the harmonic and frequency values of the dataset being analyzed. Then, the class probability closest to the back end is chosen by the classifier. This algorithm is also a multilayered classifier that works efficiently in a supervised learning paradigm. However, the accuracy of this classifier depends on the type of dataset being handled, in addition to the misspelled words present in emails. Consequently, a decrease in the accuracy and efficiency of the classification is observed. Further work has been conducted to mitigate the weaknesses of the naïve Bayes algorithm [4,21,22].
The support vector machine (SVM) technique is also applied in email classification [23]. It is a supervised learning approach and is ranked as the best “off-the-shelf” method recently used in the spam email filtering problem. In the machine learning community, it has also become one of the most sought-after classifiers, because it provides better generalization, requires fewer samples for training, and can handle high-dimensional datasets with the help of kernel-based decision machines. However, it is not suitable for large datasets, and its performance may decrease when the number of features exceeds the number of training instances.
Further improvements in the performance and efficiency of SVM techniques have been achieved by applying the sequential minimal optimization (SMO) algorithm for spam email filtering [24]. SMO is a modified form for solving the quadratic programming problem of the standard SVM learning algorithm. The function of this modified form is to improve efficiency and optimize the learning model. Moreover, the proposed approach incorporates a hybrid feature selection algorithm to reduce dimensionality and increase scalability. In addition, a comparison of the performance of the standard SVM approach and several previous approaches is presented. This comparison shows that the proposed approach outperformed these approaches.
The use of clustering for analyzing spam emails was reported in [25,26]. The authors propose a framework to address the problem of spam email filtering, based on the use of a categorical clustering tree algorithm for both groups, and classifies large datasets and data streams of spam emails into homogeneous campaigns by structural similarity. Two modes of operation, batch and dynamic, were applied to handle the data. While the aim of grouping spam emails is to identify the campaigns of similar emails, classification is used to assign labels to certain emails (e.g., malware, phishing advertisement, etc.).
Furthermore, a spam filter that mainly adopts deep learning techniques has been proposed [27]. This filter integrates three main components: (a) feature representation based on N-gram TF-IDF to handle high dimensionality, (b) a modified balancing algorithm, and (c) a multilayer perceptron neural network model. The authors demonstrated the performance of the proposed approach using four datasets. In addition, a comparison with state-of-the-art anti-spam filters is provided. The results of the comparison show a better performance in terms of the filtering accuracy of the proposed approach. Similarly, another study based on a convolutional neural network and long short-term memory architecture was proposed [28]. This architecture was supported by adding semantic information to the representation of words. The experimental results were obtained using two datasets (SMS and Twitter) to measure the performance in terms of the F1-score and accuracy. In addition, several variations have been developed to improve the performance of deep-learning techniques for spam email detection [29,30].
In this section, the problem of spam email filtering is described formally. Consider a set of emails
Given two sets
•
•
In this study, the focus to filter a spam email is on the textual content (body) of the email itself, i.e., the only information considered is contained in the body of the email being analyzed. In other words, the source from which emails come is not considered in order to distinguish between emails.
In this section, we briefly discuss LCS. According to Sigaud and Wilson [10], an LCS is defined as a rule-based system in which a set of optimal rules is automatically built. The LCS comprises a population of rules called classifiers evolving by an adaptive algorithm, such as a GA, to discover the optimal rules. In this sense, the LCS can be considered a learning system through which a set of optimal rules is learned. The main source of the data from which these rules are built is called the environment, which in our case is the training set of emails being analyzed.
The core of the LCS is the rule that represents the building blocks of the resulting model. These rules comprise two parts: the condition part (corresponding to the features) and the action part (corresponding to the class). Rules can be interpreted in a human-readable form as
UCS uses incremental learning, in which an instance of the dataset is selected in each iteration. We assume that
The UCS builds an optimal set of rules using accuracy-based fitness to guide the search for these optimal rules. Each rule is associated with a set of parameters, the most important of which are: accuracy, fitness, experience, and numerosity. The quality of a rule is measured based on its accuracy and fitness. Numerosity is defined as the number of copies a rule has in the population, and experience is the number of times a rule participates in
Table 1 provides an example of the two rules presented in this approach. Each rule has as a condition part 12 entries with the values 0, 1 or #. The action part contained 0 or 1. We assume that the input vector is (0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1), which matches rule 1, but not rule 2.
An outline of the proposed approach is presented in Figure 1. Three main components are: pre-learning, UCS, and post-learning.
Pre-processing is the first step in preparing the textual data of the emails and providing them to the next steps. In this step, all unnecessary content in the process of filtering the emails was removed. To this end, we used the natural language toolkit (NLTK 3.6.2) of Python to implement the following steps through which an email is processed:
1.
2.
3.
4.
Applying these steps results in clean text that is appropriate for the following steps.
Features have an important impact on data mining. Extracting appropriate features helps solve several problems, including dimensional reduction, training convergence, overfitting avoidance, and classification accuracy. The purpose of this process was to obtain a set of features for each instance in the dataset. An instance here represents an email in which the features are a reflection of the vocabulary (textual content) of the email being analyzed. The result of this process is a vector of features entered into the LCS model to determine whether a given email is ham or spam.
In this study, binary values were used to represent feature vectors, which can be determined as follows. Let
In this study, the UCS algorithm is adopted as a learning algorithm, in which one training instance of the dataset being considered is used in each iteration. Eight phases are used in the learning cycle of the UCS. Phases 1, 2, 3, 4, 5, 6, 8, and 9 describe the learning part of the cycle, whereas phase 7 is for rule discovery (see Figure 1). These phases are repeated until a user-predefined number, say
One training instance from the dataset was received in each iteration (phase 1). This training instance is checked for a match in set
For rule discovery (phase 7), the GA is applied to produce two new rules, which are added to the population
The pseudocode for implementing the details of the UCS is given in Algorithm 1. Given a set of training instances in which each element is a pair of feature vectors
After the learning step ended, a rule-compaction strategy was applied. As described previously, the aim of applying this strategy is to remove redundant rules from
In the testing mode, a new input instance is given, and the
is used to determine the values of votes. If
We conclude this section by commenting on the computation/ memory requirements using the present approach. The training stage (offline step) incurs high computational cost. The search process to obtain the optimal set of rules is slow compared to techniques based on a single local solution. However, the resulting model (rule set) can provide classification decisions in real time. This is an important outcome obtained by applying the UCS approach.
This section presents the experimental results obtained using the proposed approach. Two datasets with different features and sizes were used. The first dataset (
In this study, parameters of the UCS are set as follows:
Then, the performance of the proposed approach was evaluated by applying different criteria, such as
The measurement of performance in terms of precision, recall, and F-measure is depicted in Figure 3. The goal is to study the ratio of true positives to true negatives. In the spam email filtering problem, the most interesting case arose when a given email was classified as ham, while it was spam. From Figure 3(a), it can be observed that an increase in the number of iterations leads to an increase in the values of the three measures under consideration with a slight fluctuation. The results show that the highest performance is recorded for recall with a value of 0.9753, i.e., a higher true positive ratio. The results in Figure 3(b) provide a more stable picture and an obvious improvement in the performance obtained earlier, i.e., precision, recall, and F-measure measures reach their peak performance at a value of 0.997 with 150000 iterations. This is because working with a larger dataset results in a higher probability of speeding up the learning process.
The size of the model in terms of the number of rules and features in each rule is illustrated in Figure 4(b). For both datasets, it can be seen that with more iterations, the UCS requires fewer rules, as shown in Figure 4(a). However, the results obtained for Enron4 were better than those obtained for Ling-Spam. The best number of rules is obtained in the case of the Enron4 dataset, which is only 74. Although the Enron4 dataset contained more e-mails than the Ling-Spam dataset, there were fewer rules. This result is expected, because a large dataset gives the LCS the opportunity to produce a more compact set of general rules.
Table 2 summarizes the results of applying the QRC strategy to both the datasets. In each case, the number of rules before and after the application of the QRC was listed. A significant reduction in the number of rules formed by the UCS is achieved; for example, the number of rules was reduced from 4294 to 74 in the case of Enron4. Note that all these rules are %100 accurate, i.e., the set of instances covered by a given rule is correctly classified.
However, a similar relationship is observed between the number of features and the number of iterations, as depicted in Figure 4(b). Again, the UCS yields better results for Enron4 than for Ling-Spam, and these results gradually improve when the number of iterations increases. UCS implicitly applies feature selection by presenting the symbol #. In other words, a given feature in the training set can be ignored if the corresponding condition over all rules in
Finally, a comparison between the performance of the proposed approach and most relevant approaches is presented in Table 3. The Enron4 dataset was used for comparison. Using four evaluation measures, the results indicate that the UCS-based approach provides higher performance than the other approaches. A value of approximately 0.99 is obtained by the present approach compared to 0.95 obtained by the other approaches in the best cases. The reason why the UCS performs better could stem from its ability to divide the problem into sub-problems that are easy to solve compared with other approaches that rely on a single solution. This enables the UCS to produce a global solution that cannot be guaranteed by other approaches.
In this study, a new approach for filtering spam emails is presented. This approach uses the UCS algorithm to create a learning model for filtering. The learning model was captured in the form of a set of rules generated by the UCS. The proposed approach comprises three main components. A pre-processing plus feature extraction component is first implemented for refining, normalizing, and representing the email data. Then, the UCS algorithm is applied in the learning process that incorporates the GA as an optimization method to discover new rules. The use of rules presents a human-readable, form and provides an easy output to interpret and understand. However, a considerable problem arises because of the large number of rules generated by the UCS in addition to the large number of conditions in each rule. Thus, in the final stage, the QRC strategy is adopted to reduce the number of rules, while maintaining the performance of the generated learning model. From the experimental results, it can be observed that selecting the appropriate parameters, such as the number of learning iterations and the size of the rule population, has a major impact on the performance, stability, and generalization of the proposed approach. Moreover, the proposed approach yields promising performance compared with the most popular existing approaches. This performance was also supported by a reduction in the size of the learning model. In particular, a reduction rate of 98.27% is achieved with respect to the number of rules due to the QRC strategy. Additionally, a reduction rate of 37.5% was obtained for the number of conditions. This was a result of the implicit ability of the UCS algorithm to supply a form of feature selection. In future work, exploring the more relevant features of instances in the datasets is required to disregard remaining instances, thus reducing the number of conditions on the rules.
Architecture of the proposed approach.
Experimental results based on accuracy: (a) Ling-Spam dataset, (b) Enron4 dataset.
Evaluation results based on different measures: (a) Ling-Spam dataset, (b) Enron4 dataset.
Size of the model: (a) number of rules, (b) number of features.
Algorithm 1. UCS..
1: | |
2: | |
3: | |
4: | |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | |
17: | |
18: | |
19: | |
20: | |
21: | |
22: | |
23: | |
24: | |
25: | |
26: | |
27: | |
28: | |
29: | |
30: | |
31: | |
32: | Find the worst rule, |
33: | |
34: | |
35: | |
36: | Delete |
37: | |
38: | |
39: |
Algorithm 2. QRC..
1: | Sort |
2: | Let |
3: | |
4: | |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | |
17: |
Table 1. Some examples of rules produced by the UCS.
No. | Condition | Action |
---|---|---|
1 | # # # 1 # # # 0 # # 0 # | 0 |
2 | # # 0 # 0 # # 1 # # 1 # | 1 |
Table 2. Results of rule compaction.
Iterations | Ling-Spam | Enron4 | ||
---|---|---|---|---|
before | after | before | after | |
50000 | 19791 | 277 | 7370 | 94 |
100000 | 19738 | 155 | 6766 | 82 |
150000 | 19737 | 104 | 5674 | 80 |
200000 | 19503 | 83 | 4294 | 74 |
Table 3. Evaluation of the Enron4 dataset.
Learning algorithm | Precision | Recall | F-measure | Accuracy |
---|---|---|---|---|
Bayes net | 0.880 | 0.843 | 0.838 | 0.8426 |
Naïve Bayes | 0.893 | 0.868 | 0.866 | 0.868 |
Logistic function | 0.951 | 0.951 | 0.951 | 0.951 |
SVM | 0.939 | 0.933 | 0.933 | 0.933 |
UCS |
Bold font indicates best results..
Architecture of the proposed approach.
|@|~(^,^)~|@|Experimental results based on accuracy: (a) Ling-Spam dataset, (b) Enron4 dataset.
|@|~(^,^)~|@|Evaluation results based on different measures: (a) Ling-Spam dataset, (b) Enron4 dataset.
|@|~(^,^)~|@|Size of the model: (a) number of rules, (b) number of features.