International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 117-129
Published online June 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.2.117
© The Korean Institute of Intelligent Systems
Peddarapu Rama Krishna and Pothuraju Rajarajeswari
Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India
Correspondence to :
Peddarapu Rama Krishna (peddarapuramakrishna@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics has emerged as a promising field with innovative applications in various biological domains. Microarray data analysis has become the preferred technology for estimating gene expression and diagnosing diseases. However, processing and computing the vast amount of information contained in microarray data pose significant challenges. This study focuses on developing an effective feature selection model for disease diagnosis using microarray datasets. The proposed approach, mutual fuzzy swarm optimization (MFSO), uses mutual information (MI) values to compute features in a microarray gene dataset. The computed MI values are then applied to a fuzzy expert system (FES) for sample classification. To improve classification accuracy, fuzzy logic is iteratively estimated for unclassified instances in the microarray sample. The feature selection model employs a particle swarm optimization model to extract microarray sample features based on the derived MI. The swarm optimization movement is guided by an objective function. The expert system integrates fuzzy logic with human expert knowledge to perform medical diagnoses, with a specific focus on the selected features. The proposed MFSO model utilizes if-then rules to estimate the membership values of the microarray dataset features. Through simulation analysis, the performance of the proposed MFSO model is evaluated using sensitivity, specificity, and ROC values for five different microarray datasets. The results demonstrate improved performance compared to existing methods. In conclusion, this study presents a novel approach for effective feature selection in a microarray dataset for disease diagnosis. The proposed MFSO model integrates MI computation, fuzzy logic, and expert knowledge to achieve improved classification accuracy. The simulation analysis validates the effectiveness of the proposed model, highlighting its superior performance compared to existing methods.
Keywords: Microarray, Fuzzy expert system (FES), Particle swarm optimization, Bioinformatics, Feature selection
Bioinformatics is an interdisciplinary field that encompasses statistics, biology, computer science, and management. It involves modeling and analyzing data related to biological sequences, genome content, and arrangement to predict macromolecule structures. With the the advancement of genomic data, bioinformatics has gained prominence in biological and medical research, leading to increased accounts throughout the years. Bioinformatics combines information and computational biology, exhibiting interrelated characteristics. It is a conventional discipline that utilizes computers, engineers, statisticians, and mathematicians to support information management in healthcare.
Microarray data are widely regarded as effective tools for examining DNA expression in large-scale gene sequences. Microarray technology facilitates the interaction between gene expression and biology, enabling genome-wide analysis. It involves the parallel processing of gene profiles through a hybridization model. The evaluation of eDNA expression at the genome level allows for the potential application of microarrays in the biomedical community. Researchers have reported the use of microarray data in pharmaceutical drug development for selection, assessment, and quality control, as well as in disease diagnosis and monitoring to derive adverse effects from therapeutic interventions. Tumors can be classified using microarray technology based on molecular characteristics, providing an easier alternative to traditional diagnostic schemes.
Microarray techniques have been utilized in the study of carcinogenesis, drug discovery, and the subclassification of cancer cells. Accurate diagnosis of cancer has been demonstrated through the effective classification of microarray data. Gene expression profiles of cancer tissues have been used to assess cancer status using gene expression classifier profile databases. Microarray data typically have high-dimensional gene expression dimensions, which pose challenges in terms of noise measurement and data analysis due to differences in sample quantities. Moreover, the imbalance in label class samples further complicates the analysis. Reducing the dimensionality of microarray data using computer-based systems can mitigate these challenges. However, the analysis of high-dimensional, low and skewed data incurs higher computation costs and can result in poor learning and classification processes. Therefore, microarray data can significantly impact disease identification and estimation. The microarray classification process involves two steps: gene selection and the construction of a classification model to generate accurate predictions from the data.
Bioinformatics has emerged as a promising and innovative field with applications in various biological domains. In particular, the analysis of microarray data has gained significant attention in the disease diagnosis based on gene expression patterns. However, processing and computing large volumes of information contained in microarray data present substantial challenges. This study aims to address the need for an effective feature selection model for disease diagnosis using microarray datasets, thereby contributing to the field. The specific objectives of this study include constructing a feature selection model and evaluating its performance in accurately diagnosing diseases. The research gap lies in the development of a feature selection model capable of effectively handling the challenges associated with microarray data. While previous studies have explored various feature selection methods, there is a need for an approach that can optimize the selection process to improve disease diagnosis accuracy using microarray datasets. This study aims to fill this gap by proposing a novel approach, the mutual fuzzy swarm optimization (MFSO) model, and evaluating its performance against existing methods. By utilizing mutual information (MI) computation, fuzzy logic, and expert knowledge, the MFSO model aims to enhance classification accuracy and improve disease diagnosis performance. A simulation analysis using multiple microarray datasets is conducted to compare the proposed model with existing methods to demonstrate its superiority.
This study aims to address the research gap in effective feature selection for disease diagnosis using microarray datasets. The proposed MFSO model introduces novel approaches and techniques to improve accuracy and performance. Evaluating the model against existing methods will provide evidence of its effectiveness and contribute to the advancement of bioinformatics for disease diagnosis.
The contributions of this study are as follows:
• Development of an effective feature selection model: This study proposes the MFSO model as a novel approach for feature selection in disease diagnosis using microarray datasets. The MFSO model integrates MI computation, fuzzy logic, and expert knowledge to optimize the selection process and improve the classification accuracy. By developing this model, this study contributes to the advancement of feature selection techniques specifically tailored for microarray data analysis.
• Improved accuracy in disease diagnosis: The proposed MFSO model aims to enhance the accuracy of disease diagnosis using microarray datasets. Through the integration of MI, fuzzy logic, and expert knowledge, the model provides an effective approach for identifying relevant features and classifying samples. Achieving higher accuracy rates improves the reliability and effectiveness of disease diagnosis using microarray data.
• Comparison with existing methods: This study conducts a thorough evaluation of the proposed MFSO model by comparing its performance with that of existing feature selection methods. By benchmarking the MFSO model against established approaches, this study provides evidence of its superiority and demonstrates its potential for practical applications in bioinformatics. The comparison identifies the strengths and advantages of the proposed model over existing methods, contributing to the existing body of knowledge .
• Addressing the research gap: The study focuses on the need for an effective feature selection model for disease diagnosis using microarray datasets, addressing a specific research gap in bioinformatics. By proposing a novel approach, this study fills an important gap in the literature and offers a valuable contribution to bioinformatics research.
• The study contributes to the field of bioinformatics by introducing a novel feature selection model, improving the accuracy of disease diagnosis, providing a comparative analysis with existing methods, and addressing research gaps. These contributions have implications for both the theoretical understanding and practical application of bioinformatics in disease diagnosis.
Microarray technology is used to compute the genome probe, generating data points in a high-dimensional space based on the genome size. However, microarray data classification often involves large sample sizes.
In their work [12], an optimization model was presented for feature selection and classification of medical data. The developed model utilizes the centroid mutation-based Search and Rescue (cmSAR) optimization algorithm that incorporates a k-nearest neighbor (kNN) classifier for disease classification.
The cmSAR model performs selects features from an optimal group by estimating two classes: convergence and local searches. It employs an fuzzy logic system that considers multivalue fuzzy set logic for the mutation operations. Statistical results demonstrated the effectiveness of the metaheuristics-based optimization model for classification. The disease dataset used in the study comprised 15 disease data features extracted from the UCI dataset. Additionally, the cmSAR model uses the CEC-C06 2019 objective function and performs classification based on the Friedman and Bonferroni–Dunn tests. The examination results confirmed that this cmSAR model achieves improved performance for medical datasets.
In another study [13], phonocardiogram (PCG) signal analysis was conducted by applying the decomposition of the intrinsic mode function (IMF) Hilbert-Huang transform. The analysis employed the mel-frequency cepstral coefficient for feature extraction and the Hilbert-Huang transform for signal classification. Feature selection methods such as the kNN, multilayer perceptron, support vector machine (SVM), and deep neural network (DNN) were employed for classification. This PCG model uses multiple classification classes, including healthy, aortic stenosis, mitral stenosis, mitral regurgitation, and mitral valve prolapse, and employs a five-fold cross-validation approach. The DNN model demonstrated higher precision, recall, F1-score, and accuracy values of 98.9%, 98.7%, 98.8%, and 98.9%, respectively.
In [14], a four-step hybrid ensemble feature selection algorithm was examined using cross-validation. The deployed model utilizes a filter method with an ensemble-weighted score based on ranking features using a sequential forward selection model to create a wrapper-based optimal feature subset. The optimal subset features were sampled to perform the classification. An experimental analysis was performed on a 20-benchmark dataset based on the dimensionalities. The hybrid approach exhibited a higher dataset dimensionality. The proposed hybrid approach comprises of feature selection algorithm, such as naive bayes, radial basic function and SVM, kNN, and random forest, to perform classification. The experimental results demonstrated that the proposed methodology exhibits higher accuracy, specificity, sensitivity, AUC, and F1-score for the selected features. The results exhibited improved performance in achieving competitiveness with state-of-the-art methods.
In [15], a framework for disease diagnosis and classification was constructed. The developed GENE model performs gene selection using a multiobjective player selection strategy-based hybrid population search (MOPS-HPS). The model uses a multifiltering adaptive parameter to perform gene selection. Using a wrapper-based scheme, a rotational blending operator is employed for the transmute binary search space through stochastic search phase estimation to perform classification. The effectiveness of the developed GENEmops model was evaluated on 16 biological datasets and its performance was compared to that of eight binary and eight multi-class models. GENEmops exhibited improved intelligent computational multiclass features with enhanced efficiency.
Neighborhood estimation based on the Euclidean distance was performed to evaluate minority classes. High-dimensional microarray features were evaluated based on the Euclidean distance settings, assuming equal weights for the different features. An effective SMOTE-weighted Minkowski distance computation model was developed for estimating samples from the minority class [16].
The model features were evaluated in a relevant classification task, performing attributes selection with a threshold value of weights. To assess the predictive performance of the model, an experimental analysis was conducted on the 42-imbalance classes. The analysis aimed to compare the performance with the traditional SMOTE approach in both low- and high-dimensional settings, while ensuring that complexity was not affected.
A binary shuffled frog leaping algorithm (BSFLA) meta-heuristics model is implemented in [17] for gene selection. The gene subset was computed based on optimal features, considering 20 subsets of genes extracted from the original dataset. The gene subset was implemented using the kNN classifier model and compared with an artificial neural network (ANN) and SVM. The comparative analysis showed improved performance through particle swarm optimization.
In [18], a relief attribute evaluation was deployed with the light gradient boosting machine (GBM) and DNN model for classifying non-cancer and cancer cells using 20 machine-learning techniques. The analysis was performed on 111 patients with 22,278 features using Python and Weka software. The evaluation metrics used were accuracy, precision, recall, and AUC, utilizing decision trees, naive Bayes, AdaBoost, quantitative descriptive analysis, and DNN models. The classification network demonstrated exceptional performance by achieving 100% accuracy on the microarray dataset, effectively distinguishing between non-cancerous and cancerous cells.
A framework for feature selection was developed in [11] to increase feature discrimination and stability. Features were selected using aggregation strategies within the sampled dataset. These strategies were combined with a multiclass model to achieve classification accuracy. The stability features were examined using binary and multiclass metrics. The simulation results indicated that this method has a higher stability score compared to other classifiers.
In [19], a feature-level ensemble model was constructed for microarray classification using feature subspace partitioning (referred to as fuzzy logic and adaptive evolutionary optimizer-based feature selection for optimization of fuzzy inference systems). The ensemble classification model consists of three steps: generating subsets through partitioning, feature selection, and aggregation of the ranking function. The developed framework model uses a microarray dataset, considering parameters such as accuracy, stability, and runtime.
The proposed methodology introduces the MFSO model to estimate MI for optimizing variables in microarray datasets. The MFSO model computes the optimization of microarray variables to extract features from the dataset.
(1) Fuzzy logic: Fuzzy logic is a mathematical framework that addresses uncertainty and imprecision by allowing the representation and manipulation of vague or fuzzy concepts. It extends classical binary logic by introducing the concept of partial truth, where membership functions (MFs) quantify the degree to which an element belongs to a particular set. In this study, fuzzy logic is utilized to estimate the membership values of features in the microarray dataset using if-then rules.
The proposed MFSO model incorporates fuzzy logic by iteratively estimating the membership values for unclassified instances in the microarray sample. By utilizing if-then rules and membership estimation, the model integrates human expert knowledge and fuzzy reasoning to improve classification accuracy and enhance disease diagnosis performance.
(2) Particle swarm optimization (PSO): PSO is a metaheuristic optimization algorithm inspired by the behaviors of bird flocks and fish schools. It is used to solve optimization problems by simulating the social behavior of particles in a multi-dimensional search space. Each particle represents a potential solution and its movements are influenced by its own best position (pbest) and the global best position (gbest) found by the swarm.
In the proposed MFSO model, PSO is employed to extract microarray sample features based on the derived MI. The swarm optimization movement is guided by an objective function that aims to optimize the feature selection process. By leveraging the PSO, the model explores the search space and identifies the most relevant and informative features for disease diagnosis in microarray datasets.
Fuzzy logic and PSO are selected for this study because of their ability to handle uncertainty, optimize feature selection, and integrate expert knowledge. By incorporating these methodologies into the proposed MFSO model, this study aims to leverage their strengths and benefits to enhance the accuracy and effectiveness of disease diagnosis using microarray datasets.
MI is a measure that captures the mutual dependence between two variables. The proposed MFSO model consists of a stochastic system with a variable input X and an output Y. The data are represented as a discrete variable X with possible values denoted as
Here, the probability value of
In
The information is estimated by computing the average uncertainty value considering the random variable
The steps of the MFSO are as follows:
1. Compute the gene expression sample ′
2. Evaluate each input gene ′
a. Estimate the initial entropy value using
b. Compute the conditional entropy using
c. Evaluate the MI between the gene expressions using
3. Based on the estimated MI values, store and sort the genes.
4. For every gene 1, 2, ...,
a. Partition the sample population ′
b. Use the optimization model to measure the fuzzy set values for training and testing with variables
c. Classify the sample based on testing sample features
For every unclassified sample
The microarray data for the gene expression elements are computed based on the fuzzy expert system (FES), as illustrated in Figure 2. With the FES, the MFs are aggregated using the if-then rules to derive decisions regarding the data. The prediction class is indicated using discrete sample labels with nonlinear mapping of the input and output spaces. The FES microarray data architecture is illustrated in Figure 2.
The FES model consists of the constructed if-then linguistic general form “IF A THEN B.” The proposition of variables A and B is estimated with the rule set B for premises and consequences. The uncertainty between the variables is evaluated based on the linguistic variable and fuzzy set value if-then. The FL model is crucial for accumulating and summarizing data to derive decision-relevant information. The universal fuzzy set
In
Here, the gene input value is represented by
The relationships between the variables are computed using the Cartesian product universal set model. The universal set X is evaluated with the fuzzy set A with ordered pairs based on
In this case, the input variable x in A is denoted as
The MFSO PSO model uses food sources for the phases through a random search model to derive the decision-making process for the selection of food sources. The PSO model used to select features from the gene data is illustrated in Figure 3.
The MFSO model computes the optimal rule-set model with PSO. The estimation of the variable is based on the optimal rule set with combinational optimization features in the genes. Particle swarm features are computed based on the tuned MF for the allocated points uniformly in the gene expression range. Tuning with the membership model is evaluated using continuous optimization features in the dataset. The optimal variables eliminate the overall relationship between gene features, as presented in Figure 4.
The PSO model initializes the parameter values by computing the particles in parts that are within the permissible range. A particle swarm within a permissible range is evaluated and effectively processed using the constructed path. The path explored in the optimization model features is evaluated based on a combination of discrete values to derive the decision values based on the objective function. The objective function features are estimated using the objective function model. If the process is yes, the optimal path is selected, and the variables are computed with an updated pheromone and a permissible range set using the iteration process. Each iteration value is evaluated for a random particle swarm to derive the optimum solution.
The particle swarm between variables is used to compute the shortest optimal path for the computation of the edges to estimate the optimal path. To minimize the interaction between the swarms in the region, the movement values are based on global and local functions to derive the optimal path in the region. The flow of the PSO model is presented in Figure 5.
The MFSO model is computed based on the initial variable matrix to obtain the optimal solution for gene features. The gene features considered in this model are blood cells, pancreatic islets, liver cells, and neurons. The initial random solution space primary matrix is computed using
In
The PSO model demonstrated enhanced estimation of the swarm path movement through the utilization of a self-estimation model derived using
MFSO is performed using the iteration value denoted as
Based on the PSO algorithm, a new swarm movement is computed using
The swarm path
In
Therefore, the maximal fitness function
The feature computation is performed using the greedy selection model, considering both old and new fitness values. The relative features of the variables are evaluated with global and local feature estimations using a fuzzy set model. Through feature estimation, the optimal cost for the features is determined to improve processing speed. The termination condition is set to achieve the maximum number of iterations, allowing the evaluation of the output and derivation of the best solution features through the iteration process. The pseudocode representation of the proposed MFSO model is as Algorithm 1.
The MFSO model utilizes a fuzzy-set model to classify a microarray dataset consisting of five different datasets: type 2 diabetes (T2D), colon cancer, prostate cancer, insulin sensitive resistance (ISR), and leukemia. The dataset comprises a two-class gene expression profile that is publicly accessible. Table 1 provides an overview of the microarray dataset used for feature selection.
The results of the proposed MFSO model are obtained through 30 independent trials to compute unique values and estimate the control parameters and convergence values. The PSO model feature is estimated based on the optimization function that considers swarm movement, as shown in Table 2.
Table 11. Comparison of membership function
Dataset | Mean number of membership functions ( | ||||
---|---|---|---|---|---|
BCGA | PSO | GSA | HCA | MFSO | |
T2D | 0.11 | 0.9 | 2.3 | 2.13 | 4.08 |
ISR | 0.6 | 1.3 | 2.2 | 2.28 | 3.98 |
Colon | 0.1 | 1.1 | 2.1 | 2.23 | 3.45 |
Leukemia | 0.2 | 1.3 | 2.1 | 2.46 | 3.06 |
Prostate | 0.4 | 1.2 | 2.3 | 2.69 | 4.67 |
The MFSO model utilizes the fuzzy set rule to define or generate a specific set of rules for eliminate irrelevant genes in deriving outcome measures. The convergence learning values for the proposed MFSO model are listed in Table 3.
The performance of the proposed MFSO model is evaluated using a conventional optimization model in the task of feature extraction. The optimization models considered for the analysis are the binary coded genetic algorithm (BCGA), genetic swarm optimization (GSA), real coded genetic algorithm (RCGA), and PSO. It is observed that the proposed MFSO model increased the fitness function value for an iteration value between 40 and 70. The results indicate that the optimal number of iterations for the microarray dataset is 140. The label features for the microarray dataset used to compute the average and worst fitness values increase significantly, as presented in Table 4.
The gene id (GID) generates the MF estimation, as presented in Table 4. The execution MF increases with the simultaneous estimation of the fine-tuned gene number and the MF through the mapped linguistic label in the self-evident variable’s computation features. The MFSO model uses a fuzzy set rule assessment scheme to interpret the feature set in a microarray dataset. The sample microarray dataset features are used to estimate the optimal rule set value based on the MFSO model, as indicated in Table 5.
Among the provided rule datasets, T2D Rule set 3 demonstrates the highest accuracy, which can be attributed to several factors.
• The value of Ncorrect: T2D Rule set 3 exhibits a relatively high number of correctly classified instances (Ncorrect = 13). This indicates the effectiveness of the rules in accurately predicting the target variable or classifying instances.
• Value of coverage: Although T2D Rule set 3 has a lower coverage value (coverage = 47.32), it focuses on specific patterns or conditions in the data, resulting in higher accuracy. By targeting these specific instances or patterns, this rule set achieves improved accuracy in classification.
• Rule set complexity: T2D Rule set 3 may have a higher complexity than other rule sets, either in terms of the number of rules or the complexity of individual rules. This increased complexity enables a more fine-grained and accurate classification of instances, leading to higher overall accuracy.
Training data characteristics: The training data used to generate T2D Rule set 3 possesses distinct patterns or characteristics that are captured well by the rules in this particular rule set. These patterns are likely indicative of the target variable, contributing to a higher accuracy in prediction.
• Notably, the accuracy of a rule set can vary depending on the specific dataset and problem domain. The aforementioned factors collectively contribute to the higher accuracy observed in T2D Rule set 3 within the given dataset.
Additionally, the interpretability variables of the proposed MFSO model are computed for the estimated features. The MFSO model utilizes estimated microarray feature selection, which is comparatively presented in Table 6.
A comparative analysis is conducted to assess the convergence values of the proposed MFSO model in comparison to existing GA and HCA models. The results indicate that the proposed MFSO model significantly outperforms the conventional GA and HCA models. The generated fuzzy set model enabled the identification of the minimal number of genes required. The results indicate that the proposed MFSO model achieves higher interpretability than the GA and HCA models. The generalized computations of the dataset features across different datasets are presented in Table 7.
Furthermore, the comparative analysis of the generalization ability reveals that the MFSO model exhibits a minimal number of incorrectly classified samples for the dataset. Thus, the proposed MFSO model significantly enhances the performance of existing methods for disease diagnosis. Table 8 provides a comparison of the sensitivity and specificity of the MFSO model.
The comparative analysis of specificity and sensitivity values for different datasets using the proposed MFSO model demonstrates higher values compared to existing classification models (Figure 6). Additionally, the analysis of false-positive and false-negative values shows significantly lower values compared to existing classifier models. Table 9 presents the mean coverage values for the different datasets.
The mean coverage values for different datasets indicate that the MFSO model achieves a higher mean coverage value. Specifically, for the T2D dataset, the mean coverage is estimated to be 35, which is substantially higher than that of the existing model. Table 10 provides a comparative estimation of variables with different features. The MFs estimated using the proposed MFSO model are listed in Table 11.
The results reveal that the proposed MFSO model achieves more variables compared to the features of the conventional classifier model. The feature selection estimation with the proposed MFSO model is effective, and the computation of MFs for the features yields higher values compared to the existing model.
The text also mentions various research papers proposing different feature-selection and classification models. For instance, Houssein et al. [12] presented an optimization model for medical data feature selection and classification using a centroid mutation model with an integrated rescue optimization algorithm. Arslan and Karhan [13] focused on categorizing PCG signal analysis using the decomposition of the IMF using the Hilbert-Huang transform and employed various classification techniques. In [14], the authors proposed a hybrid ensemble feature selection algorithm that combines the filter method with ensemble weighted scores and a wrapper-based optimal feature subset. Agarwalla and Mukhopadhyay [15] constructed a framework for disease diagnosis and classification using a hybrid population search based on a multi-objective player selection strategy. Maldonado et al. [16] performed neighborhood estimation based on the Euclidean distance to evaluate minority classes. Finally, Dash et al. [17] implemented the BSFLA metaheuristic algorithm for optimizing the feature selection process.
Bioinformatics provides microarray data for processing and detecting various diseases. This study introduces an MI-based feature selection model for microarray data. The proposed MFSO model uses a fuzzy-set model to determine the feature set using if-then rules. The MFSO model uses MF estimation with swarm-based objective function estimation. The proposed MFSO model exhibits improved performance compared to the existing model, with a classification accuracy of 99%.
No potential conflict of interest relevant to this article was reported.
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 117-129
Published online June 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.2.117
Copyright © The Korean Institute of Intelligent Systems.
Peddarapu Rama Krishna and Pothuraju Rajarajeswari
Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India
Correspondence to:Peddarapu Rama Krishna (peddarapuramakrishna@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Bioinformatics has emerged as a promising field with innovative applications in various biological domains. Microarray data analysis has become the preferred technology for estimating gene expression and diagnosing diseases. However, processing and computing the vast amount of information contained in microarray data pose significant challenges. This study focuses on developing an effective feature selection model for disease diagnosis using microarray datasets. The proposed approach, mutual fuzzy swarm optimization (MFSO), uses mutual information (MI) values to compute features in a microarray gene dataset. The computed MI values are then applied to a fuzzy expert system (FES) for sample classification. To improve classification accuracy, fuzzy logic is iteratively estimated for unclassified instances in the microarray sample. The feature selection model employs a particle swarm optimization model to extract microarray sample features based on the derived MI. The swarm optimization movement is guided by an objective function. The expert system integrates fuzzy logic with human expert knowledge to perform medical diagnoses, with a specific focus on the selected features. The proposed MFSO model utilizes if-then rules to estimate the membership values of the microarray dataset features. Through simulation analysis, the performance of the proposed MFSO model is evaluated using sensitivity, specificity, and ROC values for five different microarray datasets. The results demonstrate improved performance compared to existing methods. In conclusion, this study presents a novel approach for effective feature selection in a microarray dataset for disease diagnosis. The proposed MFSO model integrates MI computation, fuzzy logic, and expert knowledge to achieve improved classification accuracy. The simulation analysis validates the effectiveness of the proposed model, highlighting its superior performance compared to existing methods.
Keywords: Microarray, Fuzzy expert system (FES), Particle swarm optimization, Bioinformatics, Feature selection
Bioinformatics is an interdisciplinary field that encompasses statistics, biology, computer science, and management. It involves modeling and analyzing data related to biological sequences, genome content, and arrangement to predict macromolecule structures. With the the advancement of genomic data, bioinformatics has gained prominence in biological and medical research, leading to increased accounts throughout the years. Bioinformatics combines information and computational biology, exhibiting interrelated characteristics. It is a conventional discipline that utilizes computers, engineers, statisticians, and mathematicians to support information management in healthcare.
Microarray data are widely regarded as effective tools for examining DNA expression in large-scale gene sequences. Microarray technology facilitates the interaction between gene expression and biology, enabling genome-wide analysis. It involves the parallel processing of gene profiles through a hybridization model. The evaluation of eDNA expression at the genome level allows for the potential application of microarrays in the biomedical community. Researchers have reported the use of microarray data in pharmaceutical drug development for selection, assessment, and quality control, as well as in disease diagnosis and monitoring to derive adverse effects from therapeutic interventions. Tumors can be classified using microarray technology based on molecular characteristics, providing an easier alternative to traditional diagnostic schemes.
Microarray techniques have been utilized in the study of carcinogenesis, drug discovery, and the subclassification of cancer cells. Accurate diagnosis of cancer has been demonstrated through the effective classification of microarray data. Gene expression profiles of cancer tissues have been used to assess cancer status using gene expression classifier profile databases. Microarray data typically have high-dimensional gene expression dimensions, which pose challenges in terms of noise measurement and data analysis due to differences in sample quantities. Moreover, the imbalance in label class samples further complicates the analysis. Reducing the dimensionality of microarray data using computer-based systems can mitigate these challenges. However, the analysis of high-dimensional, low and skewed data incurs higher computation costs and can result in poor learning and classification processes. Therefore, microarray data can significantly impact disease identification and estimation. The microarray classification process involves two steps: gene selection and the construction of a classification model to generate accurate predictions from the data.
Bioinformatics has emerged as a promising and innovative field with applications in various biological domains. In particular, the analysis of microarray data has gained significant attention in the disease diagnosis based on gene expression patterns. However, processing and computing large volumes of information contained in microarray data present substantial challenges. This study aims to address the need for an effective feature selection model for disease diagnosis using microarray datasets, thereby contributing to the field. The specific objectives of this study include constructing a feature selection model and evaluating its performance in accurately diagnosing diseases. The research gap lies in the development of a feature selection model capable of effectively handling the challenges associated with microarray data. While previous studies have explored various feature selection methods, there is a need for an approach that can optimize the selection process to improve disease diagnosis accuracy using microarray datasets. This study aims to fill this gap by proposing a novel approach, the mutual fuzzy swarm optimization (MFSO) model, and evaluating its performance against existing methods. By utilizing mutual information (MI) computation, fuzzy logic, and expert knowledge, the MFSO model aims to enhance classification accuracy and improve disease diagnosis performance. A simulation analysis using multiple microarray datasets is conducted to compare the proposed model with existing methods to demonstrate its superiority.
This study aims to address the research gap in effective feature selection for disease diagnosis using microarray datasets. The proposed MFSO model introduces novel approaches and techniques to improve accuracy and performance. Evaluating the model against existing methods will provide evidence of its effectiveness and contribute to the advancement of bioinformatics for disease diagnosis.
The contributions of this study are as follows:
• Development of an effective feature selection model: This study proposes the MFSO model as a novel approach for feature selection in disease diagnosis using microarray datasets. The MFSO model integrates MI computation, fuzzy logic, and expert knowledge to optimize the selection process and improve the classification accuracy. By developing this model, this study contributes to the advancement of feature selection techniques specifically tailored for microarray data analysis.
• Improved accuracy in disease diagnosis: The proposed MFSO model aims to enhance the accuracy of disease diagnosis using microarray datasets. Through the integration of MI, fuzzy logic, and expert knowledge, the model provides an effective approach for identifying relevant features and classifying samples. Achieving higher accuracy rates improves the reliability and effectiveness of disease diagnosis using microarray data.
• Comparison with existing methods: This study conducts a thorough evaluation of the proposed MFSO model by comparing its performance with that of existing feature selection methods. By benchmarking the MFSO model against established approaches, this study provides evidence of its superiority and demonstrates its potential for practical applications in bioinformatics. The comparison identifies the strengths and advantages of the proposed model over existing methods, contributing to the existing body of knowledge .
• Addressing the research gap: The study focuses on the need for an effective feature selection model for disease diagnosis using microarray datasets, addressing a specific research gap in bioinformatics. By proposing a novel approach, this study fills an important gap in the literature and offers a valuable contribution to bioinformatics research.
• The study contributes to the field of bioinformatics by introducing a novel feature selection model, improving the accuracy of disease diagnosis, providing a comparative analysis with existing methods, and addressing research gaps. These contributions have implications for both the theoretical understanding and practical application of bioinformatics in disease diagnosis.
Microarray technology is used to compute the genome probe, generating data points in a high-dimensional space based on the genome size. However, microarray data classification often involves large sample sizes.
In their work [12], an optimization model was presented for feature selection and classification of medical data. The developed model utilizes the centroid mutation-based Search and Rescue (cmSAR) optimization algorithm that incorporates a k-nearest neighbor (kNN) classifier for disease classification.
The cmSAR model performs selects features from an optimal group by estimating two classes: convergence and local searches. It employs an fuzzy logic system that considers multivalue fuzzy set logic for the mutation operations. Statistical results demonstrated the effectiveness of the metaheuristics-based optimization model for classification. The disease dataset used in the study comprised 15 disease data features extracted from the UCI dataset. Additionally, the cmSAR model uses the CEC-C06 2019 objective function and performs classification based on the Friedman and Bonferroni–Dunn tests. The examination results confirmed that this cmSAR model achieves improved performance for medical datasets.
In another study [13], phonocardiogram (PCG) signal analysis was conducted by applying the decomposition of the intrinsic mode function (IMF) Hilbert-Huang transform. The analysis employed the mel-frequency cepstral coefficient for feature extraction and the Hilbert-Huang transform for signal classification. Feature selection methods such as the kNN, multilayer perceptron, support vector machine (SVM), and deep neural network (DNN) were employed for classification. This PCG model uses multiple classification classes, including healthy, aortic stenosis, mitral stenosis, mitral regurgitation, and mitral valve prolapse, and employs a five-fold cross-validation approach. The DNN model demonstrated higher precision, recall, F1-score, and accuracy values of 98.9%, 98.7%, 98.8%, and 98.9%, respectively.
In [14], a four-step hybrid ensemble feature selection algorithm was examined using cross-validation. The deployed model utilizes a filter method with an ensemble-weighted score based on ranking features using a sequential forward selection model to create a wrapper-based optimal feature subset. The optimal subset features were sampled to perform the classification. An experimental analysis was performed on a 20-benchmark dataset based on the dimensionalities. The hybrid approach exhibited a higher dataset dimensionality. The proposed hybrid approach comprises of feature selection algorithm, such as naive bayes, radial basic function and SVM, kNN, and random forest, to perform classification. The experimental results demonstrated that the proposed methodology exhibits higher accuracy, specificity, sensitivity, AUC, and F1-score for the selected features. The results exhibited improved performance in achieving competitiveness with state-of-the-art methods.
In [15], a framework for disease diagnosis and classification was constructed. The developed GENE model performs gene selection using a multiobjective player selection strategy-based hybrid population search (MOPS-HPS). The model uses a multifiltering adaptive parameter to perform gene selection. Using a wrapper-based scheme, a rotational blending operator is employed for the transmute binary search space through stochastic search phase estimation to perform classification. The effectiveness of the developed GENEmops model was evaluated on 16 biological datasets and its performance was compared to that of eight binary and eight multi-class models. GENEmops exhibited improved intelligent computational multiclass features with enhanced efficiency.
Neighborhood estimation based on the Euclidean distance was performed to evaluate minority classes. High-dimensional microarray features were evaluated based on the Euclidean distance settings, assuming equal weights for the different features. An effective SMOTE-weighted Minkowski distance computation model was developed for estimating samples from the minority class [16].
The model features were evaluated in a relevant classification task, performing attributes selection with a threshold value of weights. To assess the predictive performance of the model, an experimental analysis was conducted on the 42-imbalance classes. The analysis aimed to compare the performance with the traditional SMOTE approach in both low- and high-dimensional settings, while ensuring that complexity was not affected.
A binary shuffled frog leaping algorithm (BSFLA) meta-heuristics model is implemented in [17] for gene selection. The gene subset was computed based on optimal features, considering 20 subsets of genes extracted from the original dataset. The gene subset was implemented using the kNN classifier model and compared with an artificial neural network (ANN) and SVM. The comparative analysis showed improved performance through particle swarm optimization.
In [18], a relief attribute evaluation was deployed with the light gradient boosting machine (GBM) and DNN model for classifying non-cancer and cancer cells using 20 machine-learning techniques. The analysis was performed on 111 patients with 22,278 features using Python and Weka software. The evaluation metrics used were accuracy, precision, recall, and AUC, utilizing decision trees, naive Bayes, AdaBoost, quantitative descriptive analysis, and DNN models. The classification network demonstrated exceptional performance by achieving 100% accuracy on the microarray dataset, effectively distinguishing between non-cancerous and cancerous cells.
A framework for feature selection was developed in [11] to increase feature discrimination and stability. Features were selected using aggregation strategies within the sampled dataset. These strategies were combined with a multiclass model to achieve classification accuracy. The stability features were examined using binary and multiclass metrics. The simulation results indicated that this method has a higher stability score compared to other classifiers.
In [19], a feature-level ensemble model was constructed for microarray classification using feature subspace partitioning (referred to as fuzzy logic and adaptive evolutionary optimizer-based feature selection for optimization of fuzzy inference systems). The ensemble classification model consists of three steps: generating subsets through partitioning, feature selection, and aggregation of the ranking function. The developed framework model uses a microarray dataset, considering parameters such as accuracy, stability, and runtime.
The proposed methodology introduces the MFSO model to estimate MI for optimizing variables in microarray datasets. The MFSO model computes the optimization of microarray variables to extract features from the dataset.
(1) Fuzzy logic: Fuzzy logic is a mathematical framework that addresses uncertainty and imprecision by allowing the representation and manipulation of vague or fuzzy concepts. It extends classical binary logic by introducing the concept of partial truth, where membership functions (MFs) quantify the degree to which an element belongs to a particular set. In this study, fuzzy logic is utilized to estimate the membership values of features in the microarray dataset using if-then rules.
The proposed MFSO model incorporates fuzzy logic by iteratively estimating the membership values for unclassified instances in the microarray sample. By utilizing if-then rules and membership estimation, the model integrates human expert knowledge and fuzzy reasoning to improve classification accuracy and enhance disease diagnosis performance.
(2) Particle swarm optimization (PSO): PSO is a metaheuristic optimization algorithm inspired by the behaviors of bird flocks and fish schools. It is used to solve optimization problems by simulating the social behavior of particles in a multi-dimensional search space. Each particle represents a potential solution and its movements are influenced by its own best position (pbest) and the global best position (gbest) found by the swarm.
In the proposed MFSO model, PSO is employed to extract microarray sample features based on the derived MI. The swarm optimization movement is guided by an objective function that aims to optimize the feature selection process. By leveraging the PSO, the model explores the search space and identifies the most relevant and informative features for disease diagnosis in microarray datasets.
Fuzzy logic and PSO are selected for this study because of their ability to handle uncertainty, optimize feature selection, and integrate expert knowledge. By incorporating these methodologies into the proposed MFSO model, this study aims to leverage their strengths and benefits to enhance the accuracy and effectiveness of disease diagnosis using microarray datasets.
MI is a measure that captures the mutual dependence between two variables. The proposed MFSO model consists of a stochastic system with a variable input X and an output Y. The data are represented as a discrete variable X with possible values denoted as
Here, the probability value of
In
The information is estimated by computing the average uncertainty value considering the random variable
The steps of the MFSO are as follows:
1. Compute the gene expression sample ′
2. Evaluate each input gene ′
a. Estimate the initial entropy value using
b. Compute the conditional entropy using
c. Evaluate the MI between the gene expressions using
3. Based on the estimated MI values, store and sort the genes.
4. For every gene 1, 2, ...,
a. Partition the sample population ′
b. Use the optimization model to measure the fuzzy set values for training and testing with variables
c. Classify the sample based on testing sample features
For every unclassified sample
The microarray data for the gene expression elements are computed based on the fuzzy expert system (FES), as illustrated in Figure 2. With the FES, the MFs are aggregated using the if-then rules to derive decisions regarding the data. The prediction class is indicated using discrete sample labels with nonlinear mapping of the input and output spaces. The FES microarray data architecture is illustrated in Figure 2.
The FES model consists of the constructed if-then linguistic general form “IF A THEN B.” The proposition of variables A and B is estimated with the rule set B for premises and consequences. The uncertainty between the variables is evaluated based on the linguistic variable and fuzzy set value if-then. The FL model is crucial for accumulating and summarizing data to derive decision-relevant information. The universal fuzzy set
In
Here, the gene input value is represented by
The relationships between the variables are computed using the Cartesian product universal set model. The universal set X is evaluated with the fuzzy set A with ordered pairs based on
In this case, the input variable x in A is denoted as
The MFSO PSO model uses food sources for the phases through a random search model to derive the decision-making process for the selection of food sources. The PSO model used to select features from the gene data is illustrated in Figure 3.
The MFSO model computes the optimal rule-set model with PSO. The estimation of the variable is based on the optimal rule set with combinational optimization features in the genes. Particle swarm features are computed based on the tuned MF for the allocated points uniformly in the gene expression range. Tuning with the membership model is evaluated using continuous optimization features in the dataset. The optimal variables eliminate the overall relationship between gene features, as presented in Figure 4.
The PSO model initializes the parameter values by computing the particles in parts that are within the permissible range. A particle swarm within a permissible range is evaluated and effectively processed using the constructed path. The path explored in the optimization model features is evaluated based on a combination of discrete values to derive the decision values based on the objective function. The objective function features are estimated using the objective function model. If the process is yes, the optimal path is selected, and the variables are computed with an updated pheromone and a permissible range set using the iteration process. Each iteration value is evaluated for a random particle swarm to derive the optimum solution.
The particle swarm between variables is used to compute the shortest optimal path for the computation of the edges to estimate the optimal path. To minimize the interaction between the swarms in the region, the movement values are based on global and local functions to derive the optimal path in the region. The flow of the PSO model is presented in Figure 5.
The MFSO model is computed based on the initial variable matrix to obtain the optimal solution for gene features. The gene features considered in this model are blood cells, pancreatic islets, liver cells, and neurons. The initial random solution space primary matrix is computed using
In
The PSO model demonstrated enhanced estimation of the swarm path movement through the utilization of a self-estimation model derived using
MFSO is performed using the iteration value denoted as
Based on the PSO algorithm, a new swarm movement is computed using
The swarm path
In
Therefore, the maximal fitness function
The feature computation is performed using the greedy selection model, considering both old and new fitness values. The relative features of the variables are evaluated with global and local feature estimations using a fuzzy set model. Through feature estimation, the optimal cost for the features is determined to improve processing speed. The termination condition is set to achieve the maximum number of iterations, allowing the evaluation of the output and derivation of the best solution features through the iteration process. The pseudocode representation of the proposed MFSO model is as Algorithm 1.
The MFSO model utilizes a fuzzy-set model to classify a microarray dataset consisting of five different datasets: type 2 diabetes (T2D), colon cancer, prostate cancer, insulin sensitive resistance (ISR), and leukemia. The dataset comprises a two-class gene expression profile that is publicly accessible. Table 1 provides an overview of the microarray dataset used for feature selection.
The results of the proposed MFSO model are obtained through 30 independent trials to compute unique values and estimate the control parameters and convergence values. The PSO model feature is estimated based on the optimization function that considers swarm movement, as shown in Table 2.
Table 11. Comparison of membership function.
Dataset | Mean number of membership functions ( | ||||
---|---|---|---|---|---|
BCGA | PSO | GSA | HCA | MFSO | |
T2D | 0.11 | 0.9 | 2.3 | 2.13 | 4.08 |
ISR | 0.6 | 1.3 | 2.2 | 2.28 | 3.98 |
Colon | 0.1 | 1.1 | 2.1 | 2.23 | 3.45 |
Leukemia | 0.2 | 1.3 | 2.1 | 2.46 | 3.06 |
Prostate | 0.4 | 1.2 | 2.3 | 2.69 | 4.67 |
The MFSO model utilizes the fuzzy set rule to define or generate a specific set of rules for eliminate irrelevant genes in deriving outcome measures. The convergence learning values for the proposed MFSO model are listed in Table 3.
The performance of the proposed MFSO model is evaluated using a conventional optimization model in the task of feature extraction. The optimization models considered for the analysis are the binary coded genetic algorithm (BCGA), genetic swarm optimization (GSA), real coded genetic algorithm (RCGA), and PSO. It is observed that the proposed MFSO model increased the fitness function value for an iteration value between 40 and 70. The results indicate that the optimal number of iterations for the microarray dataset is 140. The label features for the microarray dataset used to compute the average and worst fitness values increase significantly, as presented in Table 4.
The gene id (GID) generates the MF estimation, as presented in Table 4. The execution MF increases with the simultaneous estimation of the fine-tuned gene number and the MF through the mapped linguistic label in the self-evident variable’s computation features. The MFSO model uses a fuzzy set rule assessment scheme to interpret the feature set in a microarray dataset. The sample microarray dataset features are used to estimate the optimal rule set value based on the MFSO model, as indicated in Table 5.
Among the provided rule datasets, T2D Rule set 3 demonstrates the highest accuracy, which can be attributed to several factors.
• The value of Ncorrect: T2D Rule set 3 exhibits a relatively high number of correctly classified instances (Ncorrect = 13). This indicates the effectiveness of the rules in accurately predicting the target variable or classifying instances.
• Value of coverage: Although T2D Rule set 3 has a lower coverage value (coverage = 47.32), it focuses on specific patterns or conditions in the data, resulting in higher accuracy. By targeting these specific instances or patterns, this rule set achieves improved accuracy in classification.
• Rule set complexity: T2D Rule set 3 may have a higher complexity than other rule sets, either in terms of the number of rules or the complexity of individual rules. This increased complexity enables a more fine-grained and accurate classification of instances, leading to higher overall accuracy.
Training data characteristics: The training data used to generate T2D Rule set 3 possesses distinct patterns or characteristics that are captured well by the rules in this particular rule set. These patterns are likely indicative of the target variable, contributing to a higher accuracy in prediction.
• Notably, the accuracy of a rule set can vary depending on the specific dataset and problem domain. The aforementioned factors collectively contribute to the higher accuracy observed in T2D Rule set 3 within the given dataset.
Additionally, the interpretability variables of the proposed MFSO model are computed for the estimated features. The MFSO model utilizes estimated microarray feature selection, which is comparatively presented in Table 6.
A comparative analysis is conducted to assess the convergence values of the proposed MFSO model in comparison to existing GA and HCA models. The results indicate that the proposed MFSO model significantly outperforms the conventional GA and HCA models. The generated fuzzy set model enabled the identification of the minimal number of genes required. The results indicate that the proposed MFSO model achieves higher interpretability than the GA and HCA models. The generalized computations of the dataset features across different datasets are presented in Table 7.
Furthermore, the comparative analysis of the generalization ability reveals that the MFSO model exhibits a minimal number of incorrectly classified samples for the dataset. Thus, the proposed MFSO model significantly enhances the performance of existing methods for disease diagnosis. Table 8 provides a comparison of the sensitivity and specificity of the MFSO model.
The comparative analysis of specificity and sensitivity values for different datasets using the proposed MFSO model demonstrates higher values compared to existing classification models (Figure 6). Additionally, the analysis of false-positive and false-negative values shows significantly lower values compared to existing classifier models. Table 9 presents the mean coverage values for the different datasets.
The mean coverage values for different datasets indicate that the MFSO model achieves a higher mean coverage value. Specifically, for the T2D dataset, the mean coverage is estimated to be 35, which is substantially higher than that of the existing model. Table 10 provides a comparative estimation of variables with different features. The MFs estimated using the proposed MFSO model are listed in Table 11.
The results reveal that the proposed MFSO model achieves more variables compared to the features of the conventional classifier model. The feature selection estimation with the proposed MFSO model is effective, and the computation of MFs for the features yields higher values compared to the existing model.
The text also mentions various research papers proposing different feature-selection and classification models. For instance, Houssein et al. [12] presented an optimization model for medical data feature selection and classification using a centroid mutation model with an integrated rescue optimization algorithm. Arslan and Karhan [13] focused on categorizing PCG signal analysis using the decomposition of the IMF using the Hilbert-Huang transform and employed various classification techniques. In [14], the authors proposed a hybrid ensemble feature selection algorithm that combines the filter method with ensemble weighted scores and a wrapper-based optimal feature subset. Agarwalla and Mukhopadhyay [15] constructed a framework for disease diagnosis and classification using a hybrid population search based on a multi-objective player selection strategy. Maldonado et al. [16] performed neighborhood estimation based on the Euclidean distance to evaluate minority classes. Finally, Dash et al. [17] implemented the BSFLA metaheuristic algorithm for optimizing the feature selection process.
Bioinformatics provides microarray data for processing and detecting various diseases. This study introduces an MI-based feature selection model for microarray data. The proposed MFSO model uses a fuzzy-set model to determine the feature set using if-then rules. The MFSO model uses MF estimation with swarm-based objective function estimation. The proposed MFSO model exhibits improved performance compared to the existing model, with a classification accuracy of 99%.
Flow diagram showing MI computation in the MFSO.
Design of the fuzzy system.
Flow chart of the PSO model.
Representation of solution variable in the hybrid ant stem algorithm.
Flow chart of the PSO.
Comparison of (a) sensitivity, (b) specificity, (c) false positive, and (d) false negative.
Algorithm 1. MFSO feature selection in microarray..
Begin |
for |
Compute the fitness value of swarm optimization P using an objective function |
Estimate the fitness function based on the swarm objective function |
Repeat the fuzzy set function |
Calculate the swarm movement in the objects: |
Estimate the fitness value of the particle swarm function |
Compare the optimal value |
Perform the differentiation process for the MF |
End for |
return |
End |
Table 1. Dataset features.
Dataset | Sample count | Gene | Labels | Class samples |
---|---|---|---|---|
ISR | 11 | 6,783 | IS | 55 |
IR | 55 | |||
T2D | 33 | 21,793 | DM2 | 16 |
NGT | 16 | |||
Colon cancer | 59 | 1,974 | Tumor | 38 |
Normal | 23 | |||
Leukemia | 69 | 6,943 | ALL | 43 |
AML | 29 | |||
Prostate | 106 | 11,678 | Tumor | 49 |
Normal | 53 |
IS, insulin sensitivity; IR, insulin resistance; DM, diabetes mellitus; NGT, normoglycaemia; ALL, Acute lymphocytic leukemia; AML, acute myeloid leukemia..
Table 2. Estimation of features using particle swarm optimization.
Parameters | Value |
---|---|
Delay factor ( | 0.5 |
Number of swarm | 75 |
Minimal value ( | 0.02 |
Error factor ( | 1 |
Csize | 27 |
Table 3. Interpretability of the MFSO based on rules.
Dataset | Rule set | Ncorrect | Ncovers | Coverage | Accuracy (%) |
---|---|---|---|---|---|
ISR | Rule 1 | 5 | 6 | 78 | 94 |
Rule 2 | 6 | 8 | 91 | 90 | |
T2D | Rule 1 | 14 | 18 | 83 | 91 |
Rule 2 | 9 | 11 | 86 | 89 | |
Rule 3 | 4 | 17 | 88 | 93 | |
Rule 4 | 5 | 9 | 72 | 93 |
Table 4. Gene rule set with the MFSO.
Datasets | Gene ID | #MF | Labels |
---|---|---|---|
T2D | NM_021131.1 | 1 | medium |
NM_0223491.1 | 2 | low, medium | |
AW291218 | 1 | low | |
BC000229.1 | 1 | low | |
AL523575 | 1 | high | |
NM_005260.2 | 1 | High | |
ISR | D32129_f_at | 2 | low, medium |
Z83805_at | 1 | medium | |
Y09615_at | 2 | low, high | |
X07730_at | 2 | medium, high |
Table 5. Rule set with the MFSO.
Dataset | Rule set | Value of Ncovers | Value of Ncorrect | Value of Coverage | Accuracy of Rule (%) |
---|---|---|---|---|---|
ISR | 1 | 8 | 6 | 77 | 94.56 |
2 | 10 | 8 | 90 | 98.73 | |
T2D | 1 | 21 | 17 | 74.63 | 97,64 |
2 | 17 | 11 | 69.84 | 98.74 | |
3 | 21 | 13 | 47.32 | 99.04 | |
4 | 9 | 9 | 38.94 | 98.74 |
Table 6. Comparison of interpretability based on datasets.
Dataset | Mean coverage | Number of variables | Membership function | ||||||
---|---|---|---|---|---|---|---|---|---|
GA | HCA | MFSO | GA | HCA | MFSO | GA | HCA | MFSO | |
T2D | 21.3 | 25.83 | 36.85 | 2.9 | 3.98 | 16.84 | 0.45 | 3.78 | 22.74 |
ISR | 25.8 | 228.93 | 39.43 | 3.2 | 4.94 | 19.32 | 0.83 | 3.74 | 29.83 |
Table 7. Generalization ability.
Dataset | Approach | LOOCV evaluation | ||
---|---|---|---|---|
Correct | Incorrect | Unclassified | ||
ISR | GA | 84.3 | 5.4 | 10.3 |
PSO | 87.8 | 5.9 | 6.3 | |
HCA | 92.4 | 4.3 | 3.3 | |
MFSO | 98.2 | 0.8 | 0.8 | |
T2D | GA | 44.11 | 29.41 | 26.48 |
PSO | 85.3 | 8.82 | 5.88 | |
HCA | 94.11 | 2.8 | 3.09 | |
MFSO | 98.4 | 0.7 | 1.1 |
Table 8. Comparison of classification performance.
Data set | Approaches | # IG | Sensitivity | Specificity | False positive | False negative |
---|---|---|---|---|---|---|
ISR | MI-PSO | 9 | 0.985 | 0.835 | 0.038 | 0.068 |
MI-HCA | 5 | 0.993 | 0.797 | 0.184 | 0.296 | |
MI- MFSO | 5 | 0.986 | 0.825 | 0.059 | 0.083 | |
T2D | MI-PSO | 10 | 0.983 | 0.873 | 0.048 | 0.63 |
MI-HCA | 6 | 0.993 | 0.926 | 0.041 | 0.783 | |
MI-MFSO | 6 | 0.997 | 0.948 | 0.028 | 0.73 |
Table 9. Comparison of mean coverage.
Dataset | Mean coverage ( | ||||
---|---|---|---|---|---|
BCGA | PSO | GSA | HCA | MFSO | |
T2D | 26.4 | 23.6 | 24.6 | 23.73 | 36.82 |
ISR | 23.7 | 27.2 | 27.3 | 28.42 | 41.75 |
Colon | 34.67 | 37.93 | 34.9 | 31.35 | 40.94 |
Leukemia | 32.82 | 39.1 | 37.1 | 33.94 | 44.88 |
Prostate | 28.63 | 26.4 | 28.4 | 36.72 | 53.94 |
Table 10. Comparison of variables.
Dataset | Number of variables (#V) | ||||
---|---|---|---|---|---|
BCGA | PSO | GSA | HCA | MFSO | |
T2D | 4.2 | 3.8 | 4.3 | 5.38 | 14.63 |
ISR | 3.8 | 3.9 | 4.9 | 5.16 | 13.94 |
Colon | 4.2 | 5.3 | 5.1 | 5.53 | 10.94 |
Leukemia | 4.9 | 5.2 | 5.9 | 5.49 | 9.84 |
Prostate | 5.1 | 4.1 | 5.6 | 5.28 | 10.35 |
Table 11. Comparison of membership function.
Dataset | Mean number of membership functions ( | ||||
---|---|---|---|---|---|
BCGA | PSO | GSA | HCA | MFSO | |
T2D | 0.11 | 0.9 | 2.3 | 2.13 | 4.08 |
ISR | 0.6 | 1.3 | 2.2 | 2.28 | 3.98 |
Colon | 0.1 | 1.1 | 2.1 | 2.23 | 3.45 |
Leukemia | 0.2 | 1.3 | 2.1 | 2.46 | 3.06 |
Prostate | 0.4 | 1.2 | 2.3 | 2.69 | 4.67 |
Iqbal M. Batiha, Shaher Momani, Radwan M. Batyha, Iqbal H. Jebril, Duha Abu Judeh, and Jamal Oudetallah
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(1): 74-82 https://doi.org/10.5391/IJFIS.2024.24.1.74Jinwan Park
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 378-390 https://doi.org/10.5391/IJFIS.2021.21.4.378SeJoon Park, NagYoon Song, Wooyeon Yu, and Dohyun Kim
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(4): 307-314 https://doi.org/10.5391/IJFIS.2019.19.4.307Flow diagram showing MI computation in the MFSO.
|@|~(^,^)~|@|Design of the fuzzy system.
|@|~(^,^)~|@|Flow chart of the PSO model.
|@|~(^,^)~|@|Representation of solution variable in the hybrid ant stem algorithm.
|@|~(^,^)~|@|Flow chart of the PSO.
|@|~(^,^)~|@|Comparison of (a) sensitivity, (b) specificity, (c) false positive, and (d) false negative.