Article Search
닫기

## Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 233-244

Published online September 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.3.233

© The Korean Institute of Intelligent Systems

## An Intelligent Spam Email Filtering Approach Using a Learning Classifier System

Ahmed Al-Ajeli, Eman S. Al-Shamery, and Raaid Alubady

College of Information Technology, University of Babylon, Babylon, Iraq

Correspondence to :
Ahmed Al-Ajeli (a.alajeli@uobabylon.edu.iq)

Received: February 11, 2022; Revised: April 19, 2022; Accepted: May 24, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Emails have become a common modality to exchange messages and information over the internet. Some emails are frequently received from malicious senders, which can lead to several problems. Thus, there is an urgent need to develop reliable and powerful methods for filtering such emails. In this paper, we present a new approach to filter emails to spam and ham based on the text of the email being analyzed. After feature extraction, a supervised classifier system is adopted to generate a rule-based model that captures spam and non-spam patterns used later as an interactive filter. The aim is to obtain a compact model that provides a highly accurate performance using only a few rules. The proposed approach was applied to two datasets in terms of size and features. Experimental results indicated that an accuracy of 99:7% was achieved using a model with only 74 rules and 70 conditions. With this reduced number of rules, the proposed approach presents an efficient solution for filtering spam email using the trained model. In addition, the results of the proposed approach showed better performance than their counterparts.

Keywords: Spam email, Learning classifier systems, Rule-based systems, Rule compaction

Billions of people around the world use emails as an effective, fast, and inexpensive method of exchanging messages, and they are considered an essential communication medium. Saleh et al. [1] showed that nearly five billion email accounts were actively utilized in 2017, and by the end of 2019 this number was expected to increase to more than five and half billion. They also highlighted the fact that there are 270 billion emails exchanged daily, and approximately 57% of these are spam emails. The term “spam email” according to the Text Retrieval Conference (TREC) means an unsolicited and unwanted email that is indiscriminately sent via email accounts [2]. These spam emails may carry different types of malicious content, such as the proliferation of unwanted advertisements, phishing, and fraud schemes.

Spam emails have a set of effects that can be listed as follows: (i) disturbing individual users, (ii) unreliable emails, (iii) wasting computational resources allocated for file servers, (iv) network bandwidth ill-usage, (v) loss of work productivity, (vi) increasing the probability of spreading viruses such as Trojan horses and worms, and (vii) increasing financial losses by denial of service attacks, phishing attacks, and directory harvesting attacks [3]. Furthermore, with the increasing popularity of emails, the problem of spam email filtering has become crucial for social stability and public security [4]. Hence, an effective and smart solution to address the problem of filtering spam emails is urgently required.

To this end, several approaches have been proposed to address the problem in question (see [5,6] for a systematic review). Nevertheless, thousands of spam emails are still encountered by email users every day. This is a result of the changing nature of emails, which leads researchers to explore new methods from the artificial intelligence field, in particular, machine learning methods [7]. Recently, some studies have also proposed the application of machine-learning techniques to address the detection problems of unwanted scenarios [8,9].

The focus of this study is the use of machine learning techniques to address the filtering problem of spam emails. Learning classifier system (LCS) approaches have been investigated to address this problem. The LCS model is a rule-based system in which each rule covers part of the state space [10]. Evolutionary algorithms such as a genetic algorithm (GA) are applied to evolve each rule and guide the search to obtain the optimal set of rules based on the classification accuracy. The problem of spam email filtering was first addressed using the LCS model in [11]. This study proposes an approach that extends the extended classifier system (XCS) model (reinforcement learning model) [12,13] of the LCS by presenting a new scheme for the representation of rules. The aim was to address the problem of sparseness in feature vectors encoded by the term frequency-inverse document frequency (TF-IDF). However, the size of the resulting learning model has not been considered and studied. In other words, the number of rules and the conditions in each rule need to be considered, as this could be very problematic, especially when running the model online.

The aim of the present study is to construct a rule-based machine learning approach for the detection of spam emails. This approach produces a filter (model) with the minimum size (a few rules), while maintaining the same performance. The supervised classifier system (UCS) [1416] was adopted to generate a learning model that can then be used to filter spam emails. The UCS model was specifically designed for supervised learning, as is the case in the current problem. Two datasets of different sizes and features were considered to evaluate the performance of the proposed approach. Both datasets have a set of emails in the form of text files that are cleaned and transformed into feature vectors capturing binary values. Each dataset is divided into two subsets: training and testing. The training subset was passed to the UCS algorithm to generate an optimal rule-based learning model. The resulting model can contain a considerable number of redundant rules. In this step, a rule compaction strategy was applied to reduce the number of redundant rules, and several strategies have been developed [17] to this end. In this study, a strategy called quick rule compaction (QRC) is adopted [18]. This provides an approach that is different from the previous works in that it builds a rule-based machine learning model that is human-readable (if-then-action form) and capable of making a decision in near real-time.

The remainder of this paper is organized as follows. A brief summary of related work is provided in Section 2. Sections 3 and 4 provide formal descriptions of the spam email filtering problem and learning classifier systems, respectively. In Section 5, details of the proposed approach are presented. The experimental results and discussion are presented in Section 6. The conclusions are presented in Section 7.

In this section, the most popular techniques to address the problem of spam email filtering are briefly reviewed. According to Kakade et al. [19], spam filtering techniques can be classified into four categories: machine learning-based techniques, content-based techniques, list or word-based techniques, and hybrid techniques. In the scientific community, machine learning-based techniques are often trusted and adopted with a powerful mathematical background, which motivates the popularity and success of such techniques.

A simple classifier model is the naïve Bayes, which is based on Bayes’ theorem with a strong independence hypothesis [20]. It is a probability-based classifier that calculates the separation probabilities for specific instances. The probability set is computed by determining the harmonic and frequency values of the dataset being analyzed. Then, the class probability closest to the back end is chosen by the classifier. This algorithm is also a multilayered classifier that works efficiently in a supervised learning paradigm. However, the accuracy of this classifier depends on the type of dataset being handled, in addition to the misspelled words present in emails. Consequently, a decrease in the accuracy and efficiency of the classification is observed. Further work has been conducted to mitigate the weaknesses of the naïve Bayes algorithm [4,21,22].

The support vector machine (SVM) technique is also applied in email classification [23]. It is a supervised learning approach and is ranked as the best “off-the-shelf” method recently used in the spam email filtering problem. In the machine learning community, it has also become one of the most sought-after classifiers, because it provides better generalization, requires fewer samples for training, and can handle high-dimensional datasets with the help of kernel-based decision machines. However, it is not suitable for large datasets, and its performance may decrease when the number of features exceeds the number of training instances.

Further improvements in the performance and efficiency of SVM techniques have been achieved by applying the sequential minimal optimization (SMO) algorithm for spam email filtering [24]. SMO is a modified form for solving the quadratic programming problem of the standard SVM learning algorithm. The function of this modified form is to improve efficiency and optimize the learning model. Moreover, the proposed approach incorporates a hybrid feature selection algorithm to reduce dimensionality and increase scalability. In addition, a comparison of the performance of the standard SVM approach and several previous approaches is presented. This comparison shows that the proposed approach outperformed these approaches.

The use of clustering for analyzing spam emails was reported in [25,26]. The authors propose a framework to address the problem of spam email filtering, based on the use of a categorical clustering tree algorithm for both groups, and classifies large datasets and data streams of spam emails into homogeneous campaigns by structural similarity. Two modes of operation, batch and dynamic, were applied to handle the data. While the aim of grouping spam emails is to identify the campaigns of similar emails, classification is used to assign labels to certain emails (e.g., malware, phishing advertisement, etc.).

Furthermore, a spam filter that mainly adopts deep learning techniques has been proposed [27]. This filter integrates three main components: (a) feature representation based on N-gram TF-IDF to handle high dimensionality, (b) a modified balancing algorithm, and (c) a multilayer perceptron neural network model. The authors demonstrated the performance of the proposed approach using four datasets. In addition, a comparison with state-of-the-art anti-spam filters is provided. The results of the comparison show a better performance in terms of the filtering accuracy of the proposed approach. Similarly, another study based on a convolutional neural network and long short-term memory architecture was proposed [28]. This architecture was supported by adding semantic information to the representation of words. The experimental results were obtained using two datasets (SMS and Twitter) to measure the performance in terms of the F1-score and accuracy. In addition, several variations have been developed to improve the performance of deep-learning techniques for spam email detection [29,30].

In this section, the problem of spam email filtering is described formally. Consider a set of emails E = {e1, e2, …, en} partitioned into two sets Es (the set of spam emails) and Eh (the set of ham emails). For each instance eiE there exists an associated label l ∈ {0, 1}, where 0 and 1 denote ham and spam classes, respectively. The goal is to build a filter such tha, for any unseen received email eE, a label (also called a target class) is given to that email. In other words, the problem of email spam filtering is reduced to a binary classification problem, in which only two classes exist.

### Definition 1

Given two sets E = EhEs and E′ of emails such that EE′, filter F is a mapping constructed from E such that F : E′ → {0, 1}. Subsequently, by receiving an email eE′, this filter identifies two target classes as follows:

• F(e) = 0 if e is ham

• F(e) = 1 if e is spam

In this study, the focus to filter a spam email is on the textual content (body) of the email itself, i.e., the only information considered is contained in the body of the email being analyzed. In other words, the source from which emails come is not considered in order to distinguish between emails.

In this section, we briefly discuss LCS. According to Sigaud and Wilson [10], an LCS is defined as a rule-based system in which a set of optimal rules is automatically built. The LCS comprises a population of rules called classifiers evolving by an adaptive algorithm, such as a GA, to discover the optimal rules. In this sense, the LCS can be considered a learning system through which a set of optimal rules is learned. The main source of the data from which these rules are built is called the environment, which in our case is the training set of emails being analyzed.

The core of the LCS is the rule that represents the building blocks of the resulting model. These rules comprise two parts: the condition part (corresponding to the features) and the action part (corresponding to the class). Rules can be interpreted in a human-readable form as ifconditionthenaction. The rules in this form are similar to the structure of the instances in the training set, except for the symbol #, i.e., do not care, and are included in the condition part of the rules. This symbol provides a generalization between the corresponding features in the dataset and the predicted class. It is noteworthy that there might be more than one condition labelled with the symbol #.

UCS uses incremental learning, in which an instance of the dataset is selected in each iteration. We assume that P represents the population of rules. From this set, a new set M with all the rules in P whose conditions match the selected instance is created. Matching is the process of identifying the given instance to which it belongs. This is achieved by checking the specific values in the rule condition (0 or 1) for the corresponding features of the current instance. A match is made if and only if each of these specific values and their corresponding features are equal. Note that symbol # can match any value, and the action part of the rule is excluded from the matching process. Next, a further set C is built to hold the rules of M with an action matched to the label of the instance being considered. Initially, set P is empty. In this case, an operation called covering is activated, and a new rule is then created and added to P. Creation is accomplished by replacing the specific features of the current instance with the symbol # at probability P#. Another case that requires activation of covering arises when no rule in M has an action matched to the label of the instance.

The UCS builds an optimal set of rules using accuracy-based fitness to guide the search for these optimal rules. Each rule is associated with a set of parameters, the most important of which are: accuracy, fitness, experience, and numerosity. The quality of a rule is measured based on its accuracy and fitness. Numerosity is defined as the number of copies a rule has in the population, and experience is the number of times a rule participates in M. These parameters were updated after each learning iteration. To discover a new rule, the GA selects two rules as a parent (based on their fitness) from set C, and produces two new offspring. These new offspring are added to population P. Previously, they were checked for subsumption of its parent. The goal of the subsumption mechanism is to determine whether the parent is more general than its offspring. If this is not the case, the two offsprings are added to the population.

Table 1 provides an example of the two rules presented in this approach. Each rule has as a condition part 12 entries with the values 0, 1 or #. The action part contained 0 or 1. We assume that the input vector is (0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1), which matches rule 1, but not rule 2.

An outline of the proposed approach is presented in Figure 1. Three main components are: pre-learning, UCS, and post-learning.

### 5.1 Pre-processing

Pre-processing is the first step in preparing the textual data of the emails and providing them to the next steps. In this step, all unnecessary content in the process of filtering the emails was removed. To this end, we used the natural language toolkit (NLTK 3.6.2) of Python to implement the following steps through which an email is processed:

• 1. Word tokenization: Sentences comprising emails are segmented into their basic words each of which is called a token.

• 2. Removal of stopwords: Words that have no meaning such as “a”, “the” and “of” are filtered. Moreover, numbers and punctuation are deleted. In addition, white spaces such as tabs, newlines, and repeated spaces are all reduced to a single space.

• 3. Lemmatization: By lemmatization, word affixes are removed to return the word to its base form. For example, the words “crash”, “crashes” and “crashing” are replaced by “crash”.

• 4. Case conversions: All text in emails is converted to the lowercase form.

Applying these steps results in clean text that is appropriate for the following steps.

### 5.2 Feature Extraction

Features have an important impact on data mining. Extracting appropriate features helps solve several problems, including dimensional reduction, training convergence, overfitting avoidance, and classification accuracy. The purpose of this process was to obtain a set of features for each instance in the dataset. An instance here represents an email in which the features are a reflection of the vocabulary (textual content) of the email being analyzed. The result of this process is a vector of features entered into the LCS model to determine whether a given email is ham or spam.

In this study, binary values were used to represent feature vectors, which can be determined as follows. Let W = {w1, w2, …, wk} be a set that contains all unique words appearing in the emails of the dataset. The frequency of each word in set W is identified. We assume that n represents the number of the most frequent words in W. Then, given email bi, an n-dimensional feature vector xi = {x1, x2, …, xk} is built, where the feature xi is equal to 1 if the corresponding word wiW is held in bi, and 0 otherwise.

### 5.3 Training and Testing of the Learning Model

In this study, the UCS algorithm is adopted as a learning algorithm, in which one training instance of the dataset being considered is used in each iteration. Eight phases are used in the learning cycle of the UCS. Phases 1, 2, 3, 4, 5, 6, 8, and 9 describe the learning part of the cycle, whereas phase 7 is for rule discovery (see Figure 1). These phases are repeated until a user-predefined number, say maxIterations, is reached.

One training instance from the dataset was received in each iteration (phase 1). This training instance is checked for a match in set P (phase 2) to form set M (phase 3). A ternary alphabet is adopted to represent the rules, i.e., symbols 0, 1, and # are used. Hence, setM holds every rule with conditions that match the features in the current training instance. Further, the set C which has all rules of M with the same label (class) as the training instance, is created (phase 4). At this moment, if set C is empty, covering will be invoked to produce a matching rule that has the correct class, and then, it is added to both sets M and C. The covering operation scans the features of the instance and randomly inserts the symbol # with probability P# (phase 5). Another case in which the invocation of covering is required is at the initial iteration, as the UCS starts with zero rules in the set P. Parameters such as the accuracy and fitness associated with each rule in C are updated in the next step (phase 6). In this study, for rule i, the accuracy and fitness are calculated as follows:

Accuracyi=|Ci||Mi|,Fitnessi=Accuracyi.

For rule discovery (phase 7), the GA is applied to produce two new rules, which are added to the population P in each learning iteration. Essentially, the GA selects two rules (parents) using a ternary tournament selection for rule fitness. These rules are reproduced by a uniform crossover to provide two new rules (offspring). The output of the GA is then passed to subsumption (phase 8) to test whether both parents are more general than the offspring, and whether their |P| > θsub and Accuracy > acc0 hold. If so, the offspring will not be added to P, and the parent’s numerosity (a parameter representing the number of copies a rule has in P) is increased; otherwise, the offspring is added. The last step of the learning cycle is deletion (phase 9). This step is activated if the number of rules in P exceeds the user-defined number maxPopSize. Basically, a rule using binary tournament selection is selected, and either the numerosity of that rule is decreased, or the rule is entirely removed from P.

The pseudocode for implementing the details of the UCS is given in Algorithm 1. Given a set of training instances in which each element is a pair of feature vectors xi and its label yi, a rule-based model captured in set P is produced. Initially, set P is empty, and it grows by looping within the first loop (step 2). The first rule appears in set P by computing steps 5–7 by activating the covering. In addition, the covering is triggered again (step 17) when every rule in set M has an action a different from the label of the current training instance. Computing the fitness of rule p in C is conducted in steps 26–28. The number of rules in C (pcorrect) is divided by the number of rules in M (pexperience). GA, subsumption, and deletion were executed at the end of the UCS algorithm. Given set C, runGA creates offspring G, which are then passed to runSubsumption to either be added to P, or to increase the numerosity of the parent.

After the learning step ended, a rule-compaction strategy was applied. As described previously, the aim of applying this strategy is to remove redundant rules from P, and obtain a compact set of accurate rules Pc to be used as a prediction model. Algorithm 2 captures details of the QRC strategy adopted in this study. Given the set of rules P and training instances of the dataset, Algorithm 1 yields Pc. The algorithm starts by sorting the rules in decreasing order of fitness. The first rule from the resulting set is then selected, and all instances that match this rule are found (steps 5–10). Next, the same rule is moved to set Pc (steps 12–13), and all of its matched instances are removed from the training set (steps 9 and 15). This cycle is repeated until the training set is empty.

In the testing mode, a new input instance is given, and the Pc model is employed to predict the associated class as follows. The match set M is first created, and a vote is determined for each class given by the rules of M. It is assumed that M0 (resp. M1) is the subset of all rules in M corresponding to ham (resp. spam). In addition, we assume that vote0 and vote1 correspond to the ham and spam classes, respectively. Then, for the rules in M

votei=jMiFitnessj×numerosityji{0,1}

is used to determine the values of votes. If vote0 > vote1, label 0 (ham) is associated with the input instance; if vote0 < vote1, label 1 (spam) is assigned. If vote0 = vote1, the label is randomly selected.

We conclude this section by commenting on the computation/ memory requirements using the present approach. The training stage (offline step) incurs high computational cost. The search process to obtain the optimal set of rules is slow compared to techniques based on a single local solution. However, the resulting model (rule set) can provide classification decisions in real time. This is an important outcome obtained by applying the UCS approach.

This section presents the experimental results obtained using the proposed approach. Two datasets with different features and sizes were used. The first dataset (http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html) contains data from approximately 960 emails. This dataset is balanced, in which spam and ham emails are equally distributed with 700 and 260 emails, respectively, for training and testing. The other dataset (https://www.cs.cmu.edu/~enron/) is called Enron4, and contains information about 6000 emails. The emails in this dataset are unbalanced, and are partitioned into 1500 spam emails and 4500 ham emails. This dataset was balanced by applying the method described in [31], in which the majority class was reduced to have the same number of instances as the minority class. Likewise, this dataset was split into 2250 emails for training and 750 for testing. Note that both datasets have emails labeled with the class to which they belong.

In this study, parameters of the UCS are set as follows: maxIterations = 200000, P# = 0.95, Pc = 0.8, Pm = 0.001, θsub = 20 and acc0 = 0.999. Performance experiments were conducted with maxPopSize set to 20000 for the Ling-Spam dataset and to 10000 for the Enron4 dataset. In addition, n (the number of most frequent words) is given the value 503 (resp. 112) for the Ling-Spam (Enron4). Note that the parameters of the rules created by covering are initially configured as pexperience = pcorrect = 1 accuracy = 1, fitness = 1 and numersity = 1. Similarly, when rules are formed by GA, their parameters are initialized. To avoid random circumstances and to validate the results, these experiments were repeated 10 times, and the average was considered as the final result.

Then, the performance of the proposed approach was evaluated by applying different criteria, such as accuracy, precision, recall, and F-measure. Figure 2 illustrates the classification accuracy over a range of iterations. Figure 2(a) shows that the results obtained from the analysis of Ling-Spam indicate that the accuracy increases with an increase in the number of iterations. The maximum value of 0.9661 was obtained when the number of iterations was set to 200000. Furthermore, Figure 2(b) shows that higher performance can be achieved using Enron4, where the accuracy reaches 0.99 starting from 150000 iterations. This result is expected, because the size of the Enron4 dataset is larger than that of the Ling-Spam dataset.

The measurement of performance in terms of precision, recall, and F-measure is depicted in Figure 3. The goal is to study the ratio of true positives to true negatives. In the spam email filtering problem, the most interesting case arose when a given email was classified as ham, while it was spam. From Figure 3(a), it can be observed that an increase in the number of iterations leads to an increase in the values of the three measures under consideration with a slight fluctuation. The results show that the highest performance is recorded for recall with a value of 0.9753, i.e., a higher true positive ratio. The results in Figure 3(b) provide a more stable picture and an obvious improvement in the performance obtained earlier, i.e., precision, recall, and F-measure measures reach their peak performance at a value of 0.997 with 150000 iterations. This is because working with a larger dataset results in a higher probability of speeding up the learning process.

The size of the model in terms of the number of rules and features in each rule is illustrated in Figure 4(b). For both datasets, it can be seen that with more iterations, the UCS requires fewer rules, as shown in Figure 4(a). However, the results obtained for Enron4 were better than those obtained for Ling-Spam. The best number of rules is obtained in the case of the Enron4 dataset, which is only 74. Although the Enron4 dataset contained more e-mails than the Ling-Spam dataset, there were fewer rules. This result is expected, because a large dataset gives the LCS the opportunity to produce a more compact set of general rules.

Table 2 summarizes the results of applying the QRC strategy to both the datasets. In each case, the number of rules before and after the application of the QRC was listed. A significant reduction in the number of rules formed by the UCS is achieved; for example, the number of rules was reduced from 4294 to 74 in the case of Enron4. Note that all these rules are %100 accurate, i.e., the set of instances covered by a given rule is correctly classified.

However, a similar relationship is observed between the number of features and the number of iterations, as depicted in Figure 4(b). Again, the UCS yields better results for Enron4 than for Ling-Spam, and these results gradually improve when the number of iterations increases. UCS implicitly applies feature selection by presenting the symbol #. In other words, a given feature in the training set can be ignored if the corresponding condition over all rules in P contains symbol #. For instance, assume that Pc = {p1, p2, p3} where p1 = (1,#, 0,#, 1), p2 = (1,#, 1, 0, 1) and p3 = (0,#, 1, 1, #). Since the second condition of each rule pi in Pc contains #, the corresponding feature of each instance in the dataset is ignored, because it does not affect the classification decision. Thus, we can see that in the case of Enron4, the UCS starts with 112 features and ends with 70 features, i.e., 42 features were ignored by the UCS during the learning phase.

Finally, a comparison between the performance of the proposed approach and most relevant approaches is presented in Table 3. The Enron4 dataset was used for comparison. Using four evaluation measures, the results indicate that the UCS-based approach provides higher performance than the other approaches. A value of approximately 0.99 is obtained by the present approach compared to 0.95 obtained by the other approaches in the best cases. The reason why the UCS performs better could stem from its ability to divide the problem into sub-problems that are easy to solve compared with other approaches that rely on a single solution. This enables the UCS to produce a global solution that cannot be guaranteed by other approaches.

In this study, a new approach for filtering spam emails is presented. This approach uses the UCS algorithm to create a learning model for filtering. The learning model was captured in the form of a set of rules generated by the UCS. The proposed approach comprises three main components. A pre-processing plus feature extraction component is first implemented for refining, normalizing, and representing the email data. Then, the UCS algorithm is applied in the learning process that incorporates the GA as an optimization method to discover new rules. The use of rules presents a human-readable, form and provides an easy output to interpret and understand. However, a considerable problem arises because of the large number of rules generated by the UCS in addition to the large number of conditions in each rule. Thus, in the final stage, the QRC strategy is adopted to reduce the number of rules, while maintaining the performance of the generated learning model. From the experimental results, it can be observed that selecting the appropriate parameters, such as the number of learning iterations and the size of the rule population, has a major impact on the performance, stability, and generalization of the proposed approach. Moreover, the proposed approach yields promising performance compared with the most popular existing approaches. This performance was also supported by a reduction in the size of the learning model. In particular, a reduction rate of 98.27% is achieved with respect to the number of rules due to the QRC strategy. Additionally, a reduction rate of 37.5% was obtained for the number of conditions. This was a result of the implicit ability of the UCS algorithm to supply a form of feature selection. In future work, exploring the more relevant features of instances in the datasets is required to disregard remaining instances, thus reducing the number of conditions on the rules.

Fig. 1.

Architecture of the proposed approach.

Fig. 2.

Experimental results based on accuracy: (a) Ling-Spam dataset, (b) Enron4 dataset.

Fig. 3.

Evaluation results based on different measures: (a) Ling-Spam dataset, (b) Enron4 dataset.

Fig. 4.

Size of the model: (a) number of rules, (b) number of features.

Table. 1.

Algorithm 1. UCS..

 Input:T = {(x1, y1), (x2, y2), …, (xm, ym)}, a set of training data. Output:P = {(c1, a1), …, (cr, ar)}, a set of rules. 1: InitialiseP ← ∅︀ 2: for alli = 1, …, maxIterationdo 3: M ← ∅︀, C ← ∅︀ 4: j ← i modm 5: ifP = ∅︀ then 6: Rc ← covering(tj) 7: P ← P ∪ {Rc} 8: else 9: for allp ∈ P s.t. p = (c, a) do 10: ifmatch(x, c) then 11: pexperience ← pexperience + 1 12: M ← M ∪ {p} 13: end if 14: end for 15: end if 16: if ∄p ∈ M s.t. p = (c, a) and a = ythen 17: Rc ← covering(tj) 18: P ← P ∪ {Rc} 19: end if 20: for allp ∈ P s.t. p = (c, a) do 21: ifa = ythen 22: pcorrect ← pcorrect + 1 23: C ← C ∪ {p} 24: end if 25: end for 26: for allp ∈ Cdo 27: pfitness←pcorrectpexperience 28: end for 29: G ← runGA(C) 30: runSubsumption(G) 31: while |P| > maxPopSizedo 32: Find the worst rule, pw 33: ifpnumerosityw>1then 34: pnumerosityw←pnumerosityw-1 35: else 36: Delete pw from P 37: end if 38: end while 39: end for

Table. 2.

Algorithm 2. QRC..

 Input:T = {x1, x2, …, xm}, a set of training data; and P = {p1, …, pr} is a set of rules. Output:Pc = {p1, …, pf }, a set of compact rules where f ≤ r. 1: Sort P in a descending order of fitness. 2: Let Tnew ← ∅︀ 3: whileT ≠ ∅︀ do 4: MatchCount ← 0 5: for allx ∈ Tdo 6: ifmatch(x, p1) then 7: MatchCount ← MatchCount + 1 8: else 9: Tnew ← Tnew ∪ {x} 10: end if 11: end for 12: ifMatchCount > 0 then 13: Pc ← Pc ∪ {p1} 14: end if 15: T ← Tnew 16: P ← P \ {p1} 17: end while

Table. 3.

Table 1. Some examples of rules produced by the UCS.

No.ConditionAction
1# # # 1 # # # 0 # # 0 #0
2# # 0 # 0 # # 1 # # 1 #1

Table. 4.

Table 2. Results of rule compaction.

IterationsLing-SpamEnron4

beforeafterbeforeafter
5000019791277737094

10000019738155676682

15000019737104567480

2000001950383429474

Table. 5.

Table 3. Evaluation of the Enron4 dataset.

Learning algorithmPrecisionRecallF-measureAccuracy
Bayes net0.8800.8430.8380.8426
Naïve Bayes0.8930.8680.8660.868
Logistic function0.9510.9510.9510.951
SVM0.9390.9330.9330.933
UCS0.9970.99760.99730.9973

Bold font indicates best results..

1. Saleh, AJ, Karim, A, Shanmugam, B, Azam, S, Kannoorpatti, K, Jonkman, M, and Boer, FD (2019). An intelligent spam detection model based on artificial immune system. Information. 10. article no 209
2. Cormack, GV (2008). Email spam filtering: a systematic review. Foundations and Trends in Information Retrieval. 1, 335-455. https://doi.org/10.1561/1500000006
3. Bhowmick, A, and Hazarika, SM. (2016) . E-mail spam filtering: a review of techniques and trends. Available: https://arxiv.org/abs/1606.01042
4. Ning, B, Junwei, W, and Feng, H (2019). Spam message classification based on the Naïve Bayes classification algorithm. IAENG International Journal of Computer Science. 46, 46-53.
5. Karim, A, Azam, S, Shanmugam, B, Kannoorpatti, K, and Alazab, M (2019). A comprehensive survey for intelligent spam email detection. IEEE Access. 7, 168261-168295. https://doi.org/10.1109/ACCESS.2019.2954791
6. Shaukat, K, Luo, S, Varadharajan, V, Hameed, IA, and Xu, M (2020). A survey on machine learning techniques for cyber security in the last decade. IEEE Access. 8, 222310-222354. https://doi.org/10.1109/ACCESS.2020.3041951
7. Dada, EG, Bassi, JS, Chiroma, H, Adetunmbi, AO, and Ajibuwa, OE (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon. 5. article no. e01802
8. Ye, X, Hong, SS, and Han, MM (2020). Feature engineering method using double-layer hidden Markov model for insider threat detection. International Journal of Fuzzy Logic and Intelligent Systems. 20, 17-25. https://doi.org/10.5391/IJFIS.2020.20.1.17
9. Wicaksana, K, and Cahyani, DE (2021). Modification of a density-based spatial clustering algorithm for applications with noise for data reduction in intrusion detection systems. International Journal of Fuzzy Logic and Intelligent Systems. 21, 189-203. https://doi.org/10.5391/IJFIS.2021.21.2.189
10. Sigaud, O, and Wilson, SW (2007). Learning classifier systems: a survey. Soft Computing. 11, 1065-1078. https://doi.org/10.1007/s00500-007-0164-0
11. Arif, MH, Li, J, Iqbal, M, and Liu, K (2018). Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Computing. 22, 7281-7291. https://doi.org/10.1007/s00500-017-2729-x
12. Butz, MV, and Wilson, SW (2002). An algorithmic description of XCS. Soft Computing. 6, 144-153. https://doi.org/10.1007/s005000100111
13. Roozegar, M, Mahjoob, M, Esfandyari, M, and Panahi, MS (2016). XCS-based reinforcement learning algorithm for motion planning of a spherical mobile robot. Applied Intelligence. 45, 736-746. https://doi.org/10.1007/s10489-016-0788-9
14. Bernado-Mansilla, E, and Garrell-Guiu, JM (2003). Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary computation. 11, 209-238. https://doi.org/10.1162/106365603322365289
15. Orriols-Puig, A, and Bernado-Mansilla, E. (2006) . UCS Revisiting description, fitness sharing, and comparison with XCS. Learning Classifier Systems, 96-116. https://doi.org/10.1007/978-3-540-88138-4_6
16. Urbanowicz, RJ, and Browne, WN (2017). Introduction to Learning Classifier Systems. Heidelberg, Germany: Springer
17. Wilson, SW (2001). Compact rulesets from XCSI. Advanced in Learning Classifier Systems. Heidelberg, Germany: Springer, pp. 197-208 https://doi.org/10.1007/3-540-48104-4_12
18. Tan, J, Moore, J, and Urbanowicz, R (2013). Rapid rule compaction strategies for global knowledge discovery in a supervised learning classifier system. ECAL 2013: The Twelfth European Conference on Artificial Life. Cambridge, MA: MIT Press, pp. 110-117 https://doi.org/10.7551/978-0-262-31709-2-ch017
19. Kakade, AG, Kharat, PK, Gupta, AK, and Batra, T . Spam filtering techniques and MapReduce with SVM: a study., Proceedings of 2014 Asia-Pacific Conference on Computer Aided System Engineering (APCASE), 2014, South Kuta, Indonesia, Array, pp.59-64. https://doi.org/10.1109/APCASE.2014.6924472
20. Sharma, P, and Bhardwaj, U (2018). Machine learning based spam e-mail detection. International Journal of Intelligent Engineering and Systems. 11, 1-10.
21. Renuka, DK, Visalakshi, P, and Sankar, T (2015). Improving e-mail spam classification using ant colony optimization algorithm. International Journal of Computer Applications. 2, 22-26.
22. Peng, W, Huang, L, Jia, J, and Ingram, E . Enhancing the naive bayes spam filter through intelligent text modification detection., Proceedings of 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), 2018, New York, NY, Array, pp.849-854. https://doi.org/10.1109/TrustCom/BigDataSE.2018.00122
23. Singh, M, Pamula, R, and Shekhar, SK . Email spam classification by support vector machine., Proceedings of 2018 International Conference on Computing, Power and Communication Technologies (GUCON), 2018, Greater Noida, India, Array, pp.878-882. https://doi.org/10.1109/GUCON.2018.8674973
24. Al-Ajeli, A, Alubady, R, and Al-Shamery, ES (2020). Improving spam email detection using hybrid feature selection and sequential minimal optimization. Indonesian Journal of Electrical Engineering and Computer Science. 19, 535-542. https://doi.org/10.11591/ijeecs.v19.i1.pp535-542
25. Sheikhalishahi, M, Saracino, A, Mejri, M, Tawbi, N, and Martinelli, F (2015). Fast and effective clustering of spam emails based on structural similarity. Foundations and Practice of Security. Cham, Switzerland: Springer, pp. 195-211 https://doi.org/10.1007/978-3-319-30303-1_12
26. Sheikhalishahi, M, Saracino, A, Martinelli, F, La Mejri, Marra A, and Tawbi, N (2020). Digital waste disposal: an automated framework for analysis of spam emails. International Journal of Information Security. 19, 499-522. https://doi.org/10.1007/s10207-019-00470-x
27. Barushka, A, and Hajek, P (2018). Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Applied Intelligence. 48, 3538-3556. https://doi.org/10.1007/s10489-018-1161-y
28. Jain, G, Sharma, M, and Agarwal, B (2019). Spam detection in social media using convolutional and long short term memory neural network. Annals of Mathematics and Artificial Intelligence. 85, 21-44. https://doi.org/10.1007/s10472-018-9612-z
29. Yaseen, Q (2021). Spam email detection using deep learning techniques. Procedia Computer Science. 184, 853-858. https://doi.org/10.1016/j.procs.2021.03.107
30. Srinivasan, S, Ravi, V, Alazab, M, Ketha, S, Al-Zoubi, AM, and Kotti Padannayil, S (2021). Spam emails detection based on distributed word embedding with deep learning. Machine Intelligence and Big Data Analytics for Cyber-security Applications. Cham, Switzerland: Springer, pp. 161-189 https://doi.org/10.1007/978-3-030-57024-8_7
31. Orriols-Puig, A, and Bernado-Mansilla, E (2003). The class imbalance problem in UCS classifier system: a preliminary study. Learning Classifier Systems. Heidelberg, Germany: Springer, pp. 161-180 https://doi.org/10.1007/978-3-540-71231-2_12

Al-Ajeli Ahmed received the B.Sc. and M.Sc. degrees in Computer Science from the University of Babylon, Iraq, in 1999 and 2002, respectively. Since then, he worked as an assistant lecturer in the Department of Computer Science, University of Babylon. In 2017, he received his Ph.D. in computer science from the University of Birmingham, UK. Currently, he is an assistant professor at the College of Information Technology, University of Babylon. His current research interests include fault diagnosis and prognosis in discrete-event systems, machine learning, and anomaly detection.

Eman S. Al-Shamery received the B.Sc. and M.Sc. degrees in Computer Science from the University of Babylon, Iraq, in 1998 and 2001, respectively. Since then, she worked as an assistant lecturer in the Department of Computer Science, University of Babylon. In 2013, she received her Ph.D. in computer science from the University of Babylon. Currently, she is a professor at the Software Department, University of Babylon. Her current research interests include artificial intelligence, bioinformatics, machine learning,neural networks, deep learning and data mining.

Raaid Alubady received his Ph.D. in information technology from Universiti Utara Malaysia in 2017. He received a bachelor’s degree in Computer Sciences from the University of Babylon-Iraq, a higher diploma in data security from the Iraqi Commission for Computers and Informatics Iraq, and a Master’s degree in Information Technology from UUM, Malaysia. Alubady is a lecturer in the Network Information Department, College of Information Technology, University of Babylon-Iraq. He is a member of IEEE and is actively involved in IEEE activities. In addition, he is a member of the Internet Society Malaysia Chapter, a member of the Iraqi Association for IT specialists, Iraq, and a reviewer for several international academic journals and conferences. It is currently being attached to the InterNetWorks Research Laboratory (IRL). The current area of research focuses on the future Internet (ICN and NDN), wireless networking/MANET, Internet of Things, routing protocol, and performance analysis.

### Article

#### Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 233-244

Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.233

## An Intelligent Spam Email Filtering Approach Using a Learning Classifier System

Ahmed Al-Ajeli, Eman S. Al-Shamery, and Raaid Alubady

College of Information Technology, University of Babylon, Babylon, Iraq

Correspondence to:Ahmed Al-Ajeli (a.alajeli@uobabylon.edu.iq)

Received: February 11, 2022; Revised: April 19, 2022; Accepted: May 24, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

Emails have become a common modality to exchange messages and information over the internet. Some emails are frequently received from malicious senders, which can lead to several problems. Thus, there is an urgent need to develop reliable and powerful methods for filtering such emails. In this paper, we present a new approach to filter emails to spam and ham based on the text of the email being analyzed. After feature extraction, a supervised classifier system is adopted to generate a rule-based model that captures spam and non-spam patterns used later as an interactive filter. The aim is to obtain a compact model that provides a highly accurate performance using only a few rules. The proposed approach was applied to two datasets in terms of size and features. Experimental results indicated that an accuracy of 99:7% was achieved using a model with only 74 rules and 70 conditions. With this reduced number of rules, the proposed approach presents an efficient solution for filtering spam email using the trained model. In addition, the results of the proposed approach showed better performance than their counterparts.

Keywords: Spam email, Learning classifier systems, Rule-based systems, Rule compaction

### 1. Introduction

Billions of people around the world use emails as an effective, fast, and inexpensive method of exchanging messages, and they are considered an essential communication medium. Saleh et al. [1] showed that nearly five billion email accounts were actively utilized in 2017, and by the end of 2019 this number was expected to increase to more than five and half billion. They also highlighted the fact that there are 270 billion emails exchanged daily, and approximately 57% of these are spam emails. The term “spam email” according to the Text Retrieval Conference (TREC) means an unsolicited and unwanted email that is indiscriminately sent via email accounts [2]. These spam emails may carry different types of malicious content, such as the proliferation of unwanted advertisements, phishing, and fraud schemes.

Spam emails have a set of effects that can be listed as follows: (i) disturbing individual users, (ii) unreliable emails, (iii) wasting computational resources allocated for file servers, (iv) network bandwidth ill-usage, (v) loss of work productivity, (vi) increasing the probability of spreading viruses such as Trojan horses and worms, and (vii) increasing financial losses by denial of service attacks, phishing attacks, and directory harvesting attacks [3]. Furthermore, with the increasing popularity of emails, the problem of spam email filtering has become crucial for social stability and public security [4]. Hence, an effective and smart solution to address the problem of filtering spam emails is urgently required.

To this end, several approaches have been proposed to address the problem in question (see [5,6] for a systematic review). Nevertheless, thousands of spam emails are still encountered by email users every day. This is a result of the changing nature of emails, which leads researchers to explore new methods from the artificial intelligence field, in particular, machine learning methods [7]. Recently, some studies have also proposed the application of machine-learning techniques to address the detection problems of unwanted scenarios [8,9].

The focus of this study is the use of machine learning techniques to address the filtering problem of spam emails. Learning classifier system (LCS) approaches have been investigated to address this problem. The LCS model is a rule-based system in which each rule covers part of the state space [10]. Evolutionary algorithms such as a genetic algorithm (GA) are applied to evolve each rule and guide the search to obtain the optimal set of rules based on the classification accuracy. The problem of spam email filtering was first addressed using the LCS model in [11]. This study proposes an approach that extends the extended classifier system (XCS) model (reinforcement learning model) [12,13] of the LCS by presenting a new scheme for the representation of rules. The aim was to address the problem of sparseness in feature vectors encoded by the term frequency-inverse document frequency (TF-IDF). However, the size of the resulting learning model has not been considered and studied. In other words, the number of rules and the conditions in each rule need to be considered, as this could be very problematic, especially when running the model online.

The aim of the present study is to construct a rule-based machine learning approach for the detection of spam emails. This approach produces a filter (model) with the minimum size (a few rules), while maintaining the same performance. The supervised classifier system (UCS) [1416] was adopted to generate a learning model that can then be used to filter spam emails. The UCS model was specifically designed for supervised learning, as is the case in the current problem. Two datasets of different sizes and features were considered to evaluate the performance of the proposed approach. Both datasets have a set of emails in the form of text files that are cleaned and transformed into feature vectors capturing binary values. Each dataset is divided into two subsets: training and testing. The training subset was passed to the UCS algorithm to generate an optimal rule-based learning model. The resulting model can contain a considerable number of redundant rules. In this step, a rule compaction strategy was applied to reduce the number of redundant rules, and several strategies have been developed [17] to this end. In this study, a strategy called quick rule compaction (QRC) is adopted [18]. This provides an approach that is different from the previous works in that it builds a rule-based machine learning model that is human-readable (if-then-action form) and capable of making a decision in near real-time.

The remainder of this paper is organized as follows. A brief summary of related work is provided in Section 2. Sections 3 and 4 provide formal descriptions of the spam email filtering problem and learning classifier systems, respectively. In Section 5, details of the proposed approach are presented. The experimental results and discussion are presented in Section 6. The conclusions are presented in Section 7.

### 2. Related Work

In this section, the most popular techniques to address the problem of spam email filtering are briefly reviewed. According to Kakade et al. [19], spam filtering techniques can be classified into four categories: machine learning-based techniques, content-based techniques, list or word-based techniques, and hybrid techniques. In the scientific community, machine learning-based techniques are often trusted and adopted with a powerful mathematical background, which motivates the popularity and success of such techniques.

A simple classifier model is the naïve Bayes, which is based on Bayes’ theorem with a strong independence hypothesis [20]. It is a probability-based classifier that calculates the separation probabilities for specific instances. The probability set is computed by determining the harmonic and frequency values of the dataset being analyzed. Then, the class probability closest to the back end is chosen by the classifier. This algorithm is also a multilayered classifier that works efficiently in a supervised learning paradigm. However, the accuracy of this classifier depends on the type of dataset being handled, in addition to the misspelled words present in emails. Consequently, a decrease in the accuracy and efficiency of the classification is observed. Further work has been conducted to mitigate the weaknesses of the naïve Bayes algorithm [4,21,22].

The support vector machine (SVM) technique is also applied in email classification [23]. It is a supervised learning approach and is ranked as the best “off-the-shelf” method recently used in the spam email filtering problem. In the machine learning community, it has also become one of the most sought-after classifiers, because it provides better generalization, requires fewer samples for training, and can handle high-dimensional datasets with the help of kernel-based decision machines. However, it is not suitable for large datasets, and its performance may decrease when the number of features exceeds the number of training instances.

Further improvements in the performance and efficiency of SVM techniques have been achieved by applying the sequential minimal optimization (SMO) algorithm for spam email filtering [24]. SMO is a modified form for solving the quadratic programming problem of the standard SVM learning algorithm. The function of this modified form is to improve efficiency and optimize the learning model. Moreover, the proposed approach incorporates a hybrid feature selection algorithm to reduce dimensionality and increase scalability. In addition, a comparison of the performance of the standard SVM approach and several previous approaches is presented. This comparison shows that the proposed approach outperformed these approaches.

The use of clustering for analyzing spam emails was reported in [25,26]. The authors propose a framework to address the problem of spam email filtering, based on the use of a categorical clustering tree algorithm for both groups, and classifies large datasets and data streams of spam emails into homogeneous campaigns by structural similarity. Two modes of operation, batch and dynamic, were applied to handle the data. While the aim of grouping spam emails is to identify the campaigns of similar emails, classification is used to assign labels to certain emails (e.g., malware, phishing advertisement, etc.).

Furthermore, a spam filter that mainly adopts deep learning techniques has been proposed [27]. This filter integrates three main components: (a) feature representation based on N-gram TF-IDF to handle high dimensionality, (b) a modified balancing algorithm, and (c) a multilayer perceptron neural network model. The authors demonstrated the performance of the proposed approach using four datasets. In addition, a comparison with state-of-the-art anti-spam filters is provided. The results of the comparison show a better performance in terms of the filtering accuracy of the proposed approach. Similarly, another study based on a convolutional neural network and long short-term memory architecture was proposed [28]. This architecture was supported by adding semantic information to the representation of words. The experimental results were obtained using two datasets (SMS and Twitter) to measure the performance in terms of the F1-score and accuracy. In addition, several variations have been developed to improve the performance of deep-learning techniques for spam email detection [29,30].

### 3. Spam Email Filtering

In this section, the problem of spam email filtering is described formally. Consider a set of emails E = {e1, e2, …, en} partitioned into two sets Es (the set of spam emails) and Eh (the set of ham emails). For each instance eiE there exists an associated label l ∈ {0, 1}, where 0 and 1 denote ham and spam classes, respectively. The goal is to build a filter such tha, for any unseen received email eE, a label (also called a target class) is given to that email. In other words, the problem of email spam filtering is reduced to a binary classification problem, in which only two classes exist.

### Definition 1

Given two sets E = EhEs and E′ of emails such that EE′, filter F is a mapping constructed from E such that F : E′ → {0, 1}. Subsequently, by receiving an email eE′, this filter identifies two target classes as follows:

• F(e) = 0 if e is ham

• F(e) = 1 if e is spam

In this study, the focus to filter a spam email is on the textual content (body) of the email itself, i.e., the only information considered is contained in the body of the email being analyzed. In other words, the source from which emails come is not considered in order to distinguish between emails.

### 4. Learning Classifier Systems

In this section, we briefly discuss LCS. According to Sigaud and Wilson [10], an LCS is defined as a rule-based system in which a set of optimal rules is automatically built. The LCS comprises a population of rules called classifiers evolving by an adaptive algorithm, such as a GA, to discover the optimal rules. In this sense, the LCS can be considered a learning system through which a set of optimal rules is learned. The main source of the data from which these rules are built is called the environment, which in our case is the training set of emails being analyzed.

The core of the LCS is the rule that represents the building blocks of the resulting model. These rules comprise two parts: the condition part (corresponding to the features) and the action part (corresponding to the class). Rules can be interpreted in a human-readable form as ifconditionthenaction. The rules in this form are similar to the structure of the instances in the training set, except for the symbol #, i.e., do not care, and are included in the condition part of the rules. This symbol provides a generalization between the corresponding features in the dataset and the predicted class. It is noteworthy that there might be more than one condition labelled with the symbol #.

UCS uses incremental learning, in which an instance of the dataset is selected in each iteration. We assume that P represents the population of rules. From this set, a new set M with all the rules in P whose conditions match the selected instance is created. Matching is the process of identifying the given instance to which it belongs. This is achieved by checking the specific values in the rule condition (0 or 1) for the corresponding features of the current instance. A match is made if and only if each of these specific values and their corresponding features are equal. Note that symbol # can match any value, and the action part of the rule is excluded from the matching process. Next, a further set C is built to hold the rules of M with an action matched to the label of the instance being considered. Initially, set P is empty. In this case, an operation called covering is activated, and a new rule is then created and added to P. Creation is accomplished by replacing the specific features of the current instance with the symbol # at probability P#. Another case that requires activation of covering arises when no rule in M has an action matched to the label of the instance.

The UCS builds an optimal set of rules using accuracy-based fitness to guide the search for these optimal rules. Each rule is associated with a set of parameters, the most important of which are: accuracy, fitness, experience, and numerosity. The quality of a rule is measured based on its accuracy and fitness. Numerosity is defined as the number of copies a rule has in the population, and experience is the number of times a rule participates in M. These parameters were updated after each learning iteration. To discover a new rule, the GA selects two rules as a parent (based on their fitness) from set C, and produces two new offspring. These new offspring are added to population P. Previously, they were checked for subsumption of its parent. The goal of the subsumption mechanism is to determine whether the parent is more general than its offspring. If this is not the case, the two offsprings are added to the population.

Table 1 provides an example of the two rules presented in this approach. Each rule has as a condition part 12 entries with the values 0, 1 or #. The action part contained 0 or 1. We assume that the input vector is (0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1), which matches rule 1, but not rule 2.

### 5. Proposed Method

An outline of the proposed approach is presented in Figure 1. Three main components are: pre-learning, UCS, and post-learning.

### 5.1 Pre-processing

Pre-processing is the first step in preparing the textual data of the emails and providing them to the next steps. In this step, all unnecessary content in the process of filtering the emails was removed. To this end, we used the natural language toolkit (NLTK 3.6.2) of Python to implement the following steps through which an email is processed:

• 1. Word tokenization: Sentences comprising emails are segmented into their basic words each of which is called a token.

• 2. Removal of stopwords: Words that have no meaning such as “a”, “the” and “of” are filtered. Moreover, numbers and punctuation are deleted. In addition, white spaces such as tabs, newlines, and repeated spaces are all reduced to a single space.

• 3. Lemmatization: By lemmatization, word affixes are removed to return the word to its base form. For example, the words “crash”, “crashes” and “crashing” are replaced by “crash”.

• 4. Case conversions: All text in emails is converted to the lowercase form.

Applying these steps results in clean text that is appropriate for the following steps.

### 5.2 Feature Extraction

Features have an important impact on data mining. Extracting appropriate features helps solve several problems, including dimensional reduction, training convergence, overfitting avoidance, and classification accuracy. The purpose of this process was to obtain a set of features for each instance in the dataset. An instance here represents an email in which the features are a reflection of the vocabulary (textual content) of the email being analyzed. The result of this process is a vector of features entered into the LCS model to determine whether a given email is ham or spam.

In this study, binary values were used to represent feature vectors, which can be determined as follows. Let W = {w1, w2, …, wk} be a set that contains all unique words appearing in the emails of the dataset. The frequency of each word in set W is identified. We assume that n represents the number of the most frequent words in W. Then, given email bi, an n-dimensional feature vector xi = {x1, x2, …, xk} is built, where the feature xi is equal to 1 if the corresponding word wiW is held in bi, and 0 otherwise.

### 5.3 Training and Testing of the Learning Model

In this study, the UCS algorithm is adopted as a learning algorithm, in which one training instance of the dataset being considered is used in each iteration. Eight phases are used in the learning cycle of the UCS. Phases 1, 2, 3, 4, 5, 6, 8, and 9 describe the learning part of the cycle, whereas phase 7 is for rule discovery (see Figure 1). These phases are repeated until a user-predefined number, say maxIterations, is reached.

One training instance from the dataset was received in each iteration (phase 1). This training instance is checked for a match in set P (phase 2) to form set M (phase 3). A ternary alphabet is adopted to represent the rules, i.e., symbols 0, 1, and # are used. Hence, setM holds every rule with conditions that match the features in the current training instance. Further, the set C which has all rules of M with the same label (class) as the training instance, is created (phase 4). At this moment, if set C is empty, covering will be invoked to produce a matching rule that has the correct class, and then, it is added to both sets M and C. The covering operation scans the features of the instance and randomly inserts the symbol # with probability P# (phase 5). Another case in which the invocation of covering is required is at the initial iteration, as the UCS starts with zero rules in the set P. Parameters such as the accuracy and fitness associated with each rule in C are updated in the next step (phase 6). In this study, for rule i, the accuracy and fitness are calculated as follows:

$Accuracyi=|Ci||Mi|,$$Fitnessi=Accuracyi.$

For rule discovery (phase 7), the GA is applied to produce two new rules, which are added to the population P in each learning iteration. Essentially, the GA selects two rules (parents) using a ternary tournament selection for rule fitness. These rules are reproduced by a uniform crossover to provide two new rules (offspring). The output of the GA is then passed to subsumption (phase 8) to test whether both parents are more general than the offspring, and whether their |P| > θsub and Accuracy > acc0 hold. If so, the offspring will not be added to P, and the parent’s numerosity (a parameter representing the number of copies a rule has in P) is increased; otherwise, the offspring is added. The last step of the learning cycle is deletion (phase 9). This step is activated if the number of rules in P exceeds the user-defined number maxPopSize. Basically, a rule using binary tournament selection is selected, and either the numerosity of that rule is decreased, or the rule is entirely removed from P.

The pseudocode for implementing the details of the UCS is given in Algorithm 1. Given a set of training instances in which each element is a pair of feature vectors xi and its label yi, a rule-based model captured in set P is produced. Initially, set P is empty, and it grows by looping within the first loop (step 2). The first rule appears in set P by computing steps 5–7 by activating the covering. In addition, the covering is triggered again (step 17) when every rule in set M has an action a different from the label of the current training instance. Computing the fitness of rule p in C is conducted in steps 26–28. The number of rules in C (pcorrect) is divided by the number of rules in M (pexperience). GA, subsumption, and deletion were executed at the end of the UCS algorithm. Given set C, runGA creates offspring G, which are then passed to runSubsumption to either be added to P, or to increase the numerosity of the parent.

After the learning step ended, a rule-compaction strategy was applied. As described previously, the aim of applying this strategy is to remove redundant rules from P, and obtain a compact set of accurate rules Pc to be used as a prediction model. Algorithm 2 captures details of the QRC strategy adopted in this study. Given the set of rules P and training instances of the dataset, Algorithm 1 yields Pc. The algorithm starts by sorting the rules in decreasing order of fitness. The first rule from the resulting set is then selected, and all instances that match this rule are found (steps 5–10). Next, the same rule is moved to set Pc (steps 12–13), and all of its matched instances are removed from the training set (steps 9 and 15). This cycle is repeated until the training set is empty.

In the testing mode, a new input instance is given, and the Pc model is employed to predict the associated class as follows. The match set M is first created, and a vote is determined for each class given by the rules of M. It is assumed that M0 (resp. M1) is the subset of all rules in M corresponding to ham (resp. spam). In addition, we assume that vote0 and vote1 correspond to the ham and spam classes, respectively. Then, for the rules in M

$votei=∑j∈MiFitnessj×numerosityj i∈{0,1}$

is used to determine the values of votes. If vote0 > vote1, label 0 (ham) is associated with the input instance; if vote0 < vote1, label 1 (spam) is assigned. If vote0 = vote1, the label is randomly selected.

We conclude this section by commenting on the computation/ memory requirements using the present approach. The training stage (offline step) incurs high computational cost. The search process to obtain the optimal set of rules is slow compared to techniques based on a single local solution. However, the resulting model (rule set) can provide classification decisions in real time. This is an important outcome obtained by applying the UCS approach.

### 6. Results and Discussion

This section presents the experimental results obtained using the proposed approach. Two datasets with different features and sizes were used. The first dataset (http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html) contains data from approximately 960 emails. This dataset is balanced, in which spam and ham emails are equally distributed with 700 and 260 emails, respectively, for training and testing. The other dataset (https://www.cs.cmu.edu/~enron/) is called Enron4, and contains information about 6000 emails. The emails in this dataset are unbalanced, and are partitioned into 1500 spam emails and 4500 ham emails. This dataset was balanced by applying the method described in [31], in which the majority class was reduced to have the same number of instances as the minority class. Likewise, this dataset was split into 2250 emails for training and 750 for testing. Note that both datasets have emails labeled with the class to which they belong.

In this study, parameters of the UCS are set as follows: maxIterations = 200000, P# = 0.95, Pc = 0.8, Pm = 0.001, θsub = 20 and acc0 = 0.999. Performance experiments were conducted with maxPopSize set to 20000 for the Ling-Spam dataset and to 10000 for the Enron4 dataset. In addition, n (the number of most frequent words) is given the value 503 (resp. 112) for the Ling-Spam (Enron4). Note that the parameters of the rules created by covering are initially configured as pexperience = pcorrect = 1 accuracy = 1, fitness = 1 and numersity = 1. Similarly, when rules are formed by GA, their parameters are initialized. To avoid random circumstances and to validate the results, these experiments were repeated 10 times, and the average was considered as the final result.

Then, the performance of the proposed approach was evaluated by applying different criteria, such as accuracy, precision, recall, and F-measure. Figure 2 illustrates the classification accuracy over a range of iterations. Figure 2(a) shows that the results obtained from the analysis of Ling-Spam indicate that the accuracy increases with an increase in the number of iterations. The maximum value of 0.9661 was obtained when the number of iterations was set to 200000. Furthermore, Figure 2(b) shows that higher performance can be achieved using Enron4, where the accuracy reaches 0.99 starting from 150000 iterations. This result is expected, because the size of the Enron4 dataset is larger than that of the Ling-Spam dataset.

The measurement of performance in terms of precision, recall, and F-measure is depicted in Figure 3. The goal is to study the ratio of true positives to true negatives. In the spam email filtering problem, the most interesting case arose when a given email was classified as ham, while it was spam. From Figure 3(a), it can be observed that an increase in the number of iterations leads to an increase in the values of the three measures under consideration with a slight fluctuation. The results show that the highest performance is recorded for recall with a value of 0.9753, i.e., a higher true positive ratio. The results in Figure 3(b) provide a more stable picture and an obvious improvement in the performance obtained earlier, i.e., precision, recall, and F-measure measures reach their peak performance at a value of 0.997 with 150000 iterations. This is because working with a larger dataset results in a higher probability of speeding up the learning process.

The size of the model in terms of the number of rules and features in each rule is illustrated in Figure 4(b). For both datasets, it can be seen that with more iterations, the UCS requires fewer rules, as shown in Figure 4(a). However, the results obtained for Enron4 were better than those obtained for Ling-Spam. The best number of rules is obtained in the case of the Enron4 dataset, which is only 74. Although the Enron4 dataset contained more e-mails than the Ling-Spam dataset, there were fewer rules. This result is expected, because a large dataset gives the LCS the opportunity to produce a more compact set of general rules.

Table 2 summarizes the results of applying the QRC strategy to both the datasets. In each case, the number of rules before and after the application of the QRC was listed. A significant reduction in the number of rules formed by the UCS is achieved; for example, the number of rules was reduced from 4294 to 74 in the case of Enron4. Note that all these rules are %100 accurate, i.e., the set of instances covered by a given rule is correctly classified.

However, a similar relationship is observed between the number of features and the number of iterations, as depicted in Figure 4(b). Again, the UCS yields better results for Enron4 than for Ling-Spam, and these results gradually improve when the number of iterations increases. UCS implicitly applies feature selection by presenting the symbol #. In other words, a given feature in the training set can be ignored if the corresponding condition over all rules in P contains symbol #. For instance, assume that Pc = {p1, p2, p3} where p1 = (1,#, 0,#, 1), p2 = (1,#, 1, 0, 1) and p3 = (0,#, 1, 1, #). Since the second condition of each rule pi in Pc contains #, the corresponding feature of each instance in the dataset is ignored, because it does not affect the classification decision. Thus, we can see that in the case of Enron4, the UCS starts with 112 features and ends with 70 features, i.e., 42 features were ignored by the UCS during the learning phase.

Finally, a comparison between the performance of the proposed approach and most relevant approaches is presented in Table 3. The Enron4 dataset was used for comparison. Using four evaluation measures, the results indicate that the UCS-based approach provides higher performance than the other approaches. A value of approximately 0.99 is obtained by the present approach compared to 0.95 obtained by the other approaches in the best cases. The reason why the UCS performs better could stem from its ability to divide the problem into sub-problems that are easy to solve compared with other approaches that rely on a single solution. This enables the UCS to produce a global solution that cannot be guaranteed by other approaches.

### 7. Conclusion

In this study, a new approach for filtering spam emails is presented. This approach uses the UCS algorithm to create a learning model for filtering. The learning model was captured in the form of a set of rules generated by the UCS. The proposed approach comprises three main components. A pre-processing plus feature extraction component is first implemented for refining, normalizing, and representing the email data. Then, the UCS algorithm is applied in the learning process that incorporates the GA as an optimization method to discover new rules. The use of rules presents a human-readable, form and provides an easy output to interpret and understand. However, a considerable problem arises because of the large number of rules generated by the UCS in addition to the large number of conditions in each rule. Thus, in the final stage, the QRC strategy is adopted to reduce the number of rules, while maintaining the performance of the generated learning model. From the experimental results, it can be observed that selecting the appropriate parameters, such as the number of learning iterations and the size of the rule population, has a major impact on the performance, stability, and generalization of the proposed approach. Moreover, the proposed approach yields promising performance compared with the most popular existing approaches. This performance was also supported by a reduction in the size of the learning model. In particular, a reduction rate of 98.27% is achieved with respect to the number of rules due to the QRC strategy. Additionally, a reduction rate of 37.5% was obtained for the number of conditions. This was a result of the implicit ability of the UCS algorithm to supply a form of feature selection. In future work, exploring the more relevant features of instances in the datasets is required to disregard remaining instances, thus reducing the number of conditions on the rules.

### Fig 1.

Figure 1.

Architecture of the proposed approach.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 233-244https://doi.org/10.5391/IJFIS.2022.22.3.233

### Fig 2.

Figure 2.

Experimental results based on accuracy: (a) Ling-Spam dataset, (b) Enron4 dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 233-244https://doi.org/10.5391/IJFIS.2022.22.3.233

### Fig 3.

Figure 3.

Evaluation results based on different measures: (a) Ling-Spam dataset, (b) Enron4 dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 233-244https://doi.org/10.5391/IJFIS.2022.22.3.233

### Fig 4.

Figure 4.

Size of the model: (a) number of rules, (b) number of features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 233-244https://doi.org/10.5391/IJFIS.2022.22.3.233

Algorithm 1. UCS..

 Input:T = {(x1, y1), (x2, y2), …, (xm, ym)}, a set of training data. Output:P = {(c1, a1), …, (cr, ar)}, a set of rules. 1: InitialiseP ← ∅︀ 2: for alli = 1, …, maxIterationdo 3: M ← ∅︀, C ← ∅︀ 4: j ← i modm 5: ifP = ∅︀ then 6: Rc ← covering(tj) 7: P ← P ∪ {Rc} 8: else 9: for allp ∈ P s.t. p = (c, a) do 10: ifmatch(x, c) then 11: pexperience ← pexperience + 1 12: M ← M ∪ {p} 13: end if 14: end for 15: end if 16: if ∄p ∈ M s.t. p = (c, a) and a = ythen 17: Rc ← covering(tj) 18: P ← P ∪ {Rc} 19: end if 20: for allp ∈ P s.t. p = (c, a) do 21: ifa = ythen 22: pcorrect ← pcorrect + 1 23: C ← C ∪ {p} 24: end if 25: end for 26: for allp ∈ Cdo 27: pfitness←pcorrectpexperience 28: end for 29: G ← runGA(C) 30: runSubsumption(G) 31: while |P| > maxPopSizedo 32: Find the worst rule, pw 33: ifpnumerosityw>1then 34: pnumerosityw←pnumerosityw-1 35: else 36: Delete pw from P 37: end if 38: end while 39: end for

Algorithm 2. QRC..

 Input:T = {x1, x2, …, xm}, a set of training data; and P = {p1, …, pr} is a set of rules. Output:Pc = {p1, …, pf }, a set of compact rules where f ≤ r. 1: Sort P in a descending order of fitness. 2: Let Tnew ← ∅︀ 3: whileT ≠ ∅︀ do 4: MatchCount ← 0 5: for allx ∈ Tdo 6: ifmatch(x, p1) then 7: MatchCount ← MatchCount + 1 8: else 9: Tnew ← Tnew ∪ {x} 10: end if 11: end for 12: ifMatchCount > 0 then 13: Pc ← Pc ∪ {p1} 14: end if 15: T ← Tnew 16: P ← P \ {p1} 17: end while

Table 1. Some examples of rules produced by the UCS.

No.ConditionAction
1# # # 1 # # # 0 # # 0 #0
2# # 0 # 0 # # 1 # # 1 #1

Table 2. Results of rule compaction.

IterationsLing-SpamEnron4

beforeafterbeforeafter
5000019791277737094

10000019738155676682

15000019737104567480

2000001950383429474

Table 3. Evaluation of the Enron4 dataset.

Learning algorithmPrecisionRecallF-measureAccuracy
Bayes net0.8800.8430.8380.8426
Naïve Bayes0.8930.8680.8660.868
Logistic function0.9510.9510.9510.951
SVM0.9390.9330.9330.933
UCS0.9970.99760.99730.9973

Bold font indicates best results..

### References

1. Saleh, AJ, Karim, A, Shanmugam, B, Azam, S, Kannoorpatti, K, Jonkman, M, and Boer, FD (2019). An intelligent spam detection model based on artificial immune system. Information. 10. article no 209
2. Cormack, GV (2008). Email spam filtering: a systematic review. Foundations and Trends in Information Retrieval. 1, 335-455. https://doi.org/10.1561/1500000006
3. Bhowmick, A, and Hazarika, SM. (2016) . E-mail spam filtering: a review of techniques and trends. Available: https://arxiv.org/abs/1606.01042
4. Ning, B, Junwei, W, and Feng, H (2019). Spam message classification based on the Naïve Bayes classification algorithm. IAENG International Journal of Computer Science. 46, 46-53.
5. Karim, A, Azam, S, Shanmugam, B, Kannoorpatti, K, and Alazab, M (2019). A comprehensive survey for intelligent spam email detection. IEEE Access. 7, 168261-168295. https://doi.org/10.1109/ACCESS.2019.2954791
6. Shaukat, K, Luo, S, Varadharajan, V, Hameed, IA, and Xu, M (2020). A survey on machine learning techniques for cyber security in the last decade. IEEE Access. 8, 222310-222354. https://doi.org/10.1109/ACCESS.2020.3041951
7. Dada, EG, Bassi, JS, Chiroma, H, Adetunmbi, AO, and Ajibuwa, OE (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon. 5. article no. e01802
8. Ye, X, Hong, SS, and Han, MM (2020). Feature engineering method using double-layer hidden Markov model for insider threat detection. International Journal of Fuzzy Logic and Intelligent Systems. 20, 17-25. https://doi.org/10.5391/IJFIS.2020.20.1.17
9. Wicaksana, K, and Cahyani, DE (2021). Modification of a density-based spatial clustering algorithm for applications with noise for data reduction in intrusion detection systems. International Journal of Fuzzy Logic and Intelligent Systems. 21, 189-203. https://doi.org/10.5391/IJFIS.2021.21.2.189
10. Sigaud, O, and Wilson, SW (2007). Learning classifier systems: a survey. Soft Computing. 11, 1065-1078. https://doi.org/10.1007/s00500-007-0164-0
11. Arif, MH, Li, J, Iqbal, M, and Liu, K (2018). Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Computing. 22, 7281-7291. https://doi.org/10.1007/s00500-017-2729-x
12. Butz, MV, and Wilson, SW (2002). An algorithmic description of XCS. Soft Computing. 6, 144-153. https://doi.org/10.1007/s005000100111
13. Roozegar, M, Mahjoob, M, Esfandyari, M, and Panahi, MS (2016). XCS-based reinforcement learning algorithm for motion planning of a spherical mobile robot. Applied Intelligence. 45, 736-746. https://doi.org/10.1007/s10489-016-0788-9
14. Bernado-Mansilla, E, and Garrell-Guiu, JM (2003). Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary computation. 11, 209-238. https://doi.org/10.1162/106365603322365289
15. Orriols-Puig, A, and Bernado-Mansilla, E. (2006) . UCS Revisiting description, fitness sharing, and comparison with XCS. Learning Classifier Systems, 96-116. https://doi.org/10.1007/978-3-540-88138-4_6
16. Urbanowicz, RJ, and Browne, WN (2017). Introduction to Learning Classifier Systems. Heidelberg, Germany: Springer
17. Wilson, SW (2001). Compact rulesets from XCSI. Advanced in Learning Classifier Systems. Heidelberg, Germany: Springer, pp. 197-208 https://doi.org/10.1007/3-540-48104-4_12
18. Tan, J, Moore, J, and Urbanowicz, R (2013). Rapid rule compaction strategies for global knowledge discovery in a supervised learning classifier system. ECAL 2013: The Twelfth European Conference on Artificial Life. Cambridge, MA: MIT Press, pp. 110-117 https://doi.org/10.7551/978-0-262-31709-2-ch017
19. Kakade, AG, Kharat, PK, Gupta, AK, and Batra, T . Spam filtering techniques and MapReduce with SVM: a study., Proceedings of 2014 Asia-Pacific Conference on Computer Aided System Engineering (APCASE), 2014, South Kuta, Indonesia, Array, pp.59-64. https://doi.org/10.1109/APCASE.2014.6924472
20. Sharma, P, and Bhardwaj, U (2018). Machine learning based spam e-mail detection. International Journal of Intelligent Engineering and Systems. 11, 1-10.
21. Renuka, DK, Visalakshi, P, and Sankar, T (2015). Improving e-mail spam classification using ant colony optimization algorithm. International Journal of Computer Applications. 2, 22-26.
22. Peng, W, Huang, L, Jia, J, and Ingram, E . Enhancing the naive bayes spam filter through intelligent text modification detection., Proceedings of 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), 2018, New York, NY, Array, pp.849-854. https://doi.org/10.1109/TrustCom/BigDataSE.2018.00122
23. Singh, M, Pamula, R, and Shekhar, SK . Email spam classification by support vector machine., Proceedings of 2018 International Conference on Computing, Power and Communication Technologies (GUCON), 2018, Greater Noida, India, Array, pp.878-882. https://doi.org/10.1109/GUCON.2018.8674973
24. Al-Ajeli, A, Alubady, R, and Al-Shamery, ES (2020). Improving spam email detection using hybrid feature selection and sequential minimal optimization. Indonesian Journal of Electrical Engineering and Computer Science. 19, 535-542. https://doi.org/10.11591/ijeecs.v19.i1.pp535-542
25. Sheikhalishahi, M, Saracino, A, Mejri, M, Tawbi, N, and Martinelli, F (2015). Fast and effective clustering of spam emails based on structural similarity. Foundations and Practice of Security. Cham, Switzerland: Springer, pp. 195-211 https://doi.org/10.1007/978-3-319-30303-1_12
26. Sheikhalishahi, M, Saracino, A, Martinelli, F, La Mejri, Marra A, and Tawbi, N (2020). Digital waste disposal: an automated framework for analysis of spam emails. International Journal of Information Security. 19, 499-522. https://doi.org/10.1007/s10207-019-00470-x
27. Barushka, A, and Hajek, P (2018). Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Applied Intelligence. 48, 3538-3556. https://doi.org/10.1007/s10489-018-1161-y
28. Jain, G, Sharma, M, and Agarwal, B (2019). Spam detection in social media using convolutional and long short term memory neural network. Annals of Mathematics and Artificial Intelligence. 85, 21-44. https://doi.org/10.1007/s10472-018-9612-z
29. Yaseen, Q (2021). Spam email detection using deep learning techniques. Procedia Computer Science. 184, 853-858. https://doi.org/10.1016/j.procs.2021.03.107
30. Srinivasan, S, Ravi, V, Alazab, M, Ketha, S, Al-Zoubi, AM, and Kotti Padannayil, S (2021). Spam emails detection based on distributed word embedding with deep learning. Machine Intelligence and Big Data Analytics for Cyber-security Applications. Cham, Switzerland: Springer, pp. 161-189 https://doi.org/10.1007/978-3-030-57024-8_7
31. Orriols-Puig, A, and Bernado-Mansilla, E (2003). The class imbalance problem in UCS classifier system: a preliminary study. Learning Classifier Systems. Heidelberg, Germany: Springer, pp. 161-180 https://doi.org/10.1007/978-3-540-71231-2_12