Article Search
닫기

## Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 325-338

Published online September 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.3.325

© The Korean Institute of Intelligent Systems

## Ensemble Rumor Text Classification Model Applied to Different Tweet Features

Amit Kumar Sharma1,2, Rakshith Alaham Gangeya1, Harshit Kumar1, Sandeep Chaurasia1, and Devesh Kumar Srivastava3

1Department of Computer Science and Engineering, Manipal University Jaipur, India
2Department of Computer Science and Engineering, ICFAI Tech School, ICFAI University, Jaipur, India
3Department of Information Technology, Manipal University Jaipur, India

Correspondence to :
Sandeep Chaurasia (sandeep.chaurasia@jaipur.manipal.edu)

Received: April 21, 2022; Revised: July 11, 2022; Accepted: August 11, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Today, social media has evolved into user-friendly and useful platforms to spread messages and receive information about different activities. Thus, social media users’ daily approaches to billions of messages, and checking the credibility of such information is a challenging task. False information or rumors are spread to misguide people. Previous approaches utilized the help of users or third parties to flag suspect information, but this was highly inefficient and redundant. Moreover, previous studies have focused on rumor classification using state-ofthe- art and deep learning methods with different tweet features. This analysis focused on combining three different feature models of tweets and suggested an ensemble model for classifying tweets as rumor and non-rumor. In this study, the experimental results of four different rumor and non-rumor classification models, which were based on a neural network model, were compared. The strength of this research is that the experiments were performed on different tweet features, such as word vectors, user metadata features, reaction features, and ensemble model probabilistic features, and the results were classified using layered neural network architectures. The correlations of the different features determine the importance and selection of useful features for experimental purposes. These findings suggest that the ensemble model performed well, and provided better validation accuracy and better results for unseen data.

Keywords: Rumor, Word embedding, User metadata features, Reaction features, Ensemble probabilistic features, Feature scaling, Correlation, Bi-LSTM, Neural network

A rumor is described as the circulation of false messages of events, objects, and issues to a group of people. Social media is the easiest platform for propagating rumors or false information in the public [14]. Twitter has evolved as a significant source of information sharing across all social media platforms where, on average, millions of tweets are shared per day, and each user can tweet up to 140-character long messages or share files, such as pictures or videos. These tweets sometimes contain relevant information, such as public announcements by governing organizations and sharing of events, personal experiences, beliefs, and thoughts. Camouflaged among these data, rumors are widely spreading and destroying the credibility of this information. This problem is elevated owing to crowdsourcing, where a rumor can be backed by many users because of their lack of knowledge regarding the matter and beliefs. To address this issue, much work has been done in the research community to find a prominent solution. This research attempts to understand the root causes of rumor generation on Twitter and which features may help classify rumors. In previous studies, machine learning methods, such as logistic regression, decision tree, support vector machine (SVM), naive Bayes, and k-nearest neighbor, have been widely used for text categorization [1,5]. Word-embedding approaches have been widely used for word vector generation [5]. In this context, deep learning methods including recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional LSTMs (Bi-LSTM), gated recurrent units, and convolutional neural networks are also being applied in the field of text analysis [6]. Previous research has focused on word vector features, and recently, researchers have focused on other features of tweets. In this manner, user metadata and reaction (sentiment) features are used to obtain accurate results. User metadata have numerical and categorical features, which are important for identifying rumors. The features have different values, and their values must be scaled. Log transformation and standard feature methods are widely used for feature scaling. The correlation between the features or attributes helps identify the features that can be used to identify the results. In this study, four different feature models that use different features of tweet data and classify the results as rumor and non-rumor using neural network architecture were used for analysis. All the models were developed as a supervised learning task on a publicly available dataset from Twitter from which tweet-id and labels were gathered; then using the respective tweet-id, tweets were collected in JSON format using the Twitter API.

In previous studies, the main focus was text analysis with text sequential features for classifying rumors and non-rumors. In previous studies, there were some research gaps and questions that arose, such as how to automatically identify rumor text from large unstructured data and what is the main focus of the research on automatically identifying rumor texts. The improvement of this study over previous research is as follows. This research has provided solutions to these questions by focusing on different features, such as reaction and user metadata features, along with the text features or attributes of the tweets. Previous research was based on word-embedding features of the text, and some researchers suggested sentiment features of the post; however, in this study, other features have also been used, and these features provide input for the neural network architecture. In this research, we cannot directly combine the metadata feature, reaction features, and sentences; thus, we separately trained all the models with different features or attributes; then, we combined all the weights of different models and trained the model with the same dataset.

The primary goal of this research is to identify tweet attributes or features, such as tweet text, user information, and other tweet metadata, by which we can determine the veracity of a tweet. The objective of this study was to determine the numerical importance of certain features or attributes. The primary goal of this study was to develop an automated method for accurately identifying rumors by comparing it to existing models that employ a variety of combined features from tweets, including word embedding, reaction, and metadata features. This work focuses on finding attribute importance for classification, whereas previous work aimed to find better techniques for classification.

The objective of this study was to identify important features to distinguish rumor-spreading tweets from authentic tweets. For this purpose and because the dataset comprised various data formats, four models were developed. (1) Bi-LSTM model for tweet text, which was mapped to the target class to check how much tweet text can classify tweets as rumors. (2) Neural model for tweet metadata trained with attributes such as user information (e.g., user account age and user follower count) and tweet information (such as count, retweet count, share count, and tweet text sentiment), which was also mapped to the target class to check how much tweet metadata can classify tweets as rumors. (3) Bi-LSTM model for comments on tweets; for each tweet, the first 20 comments were collected and used to train the model to understand the impact of comments on tweet rumor classification. (4) Ensemble model to merge the above models and train them with the target class to determine whether combining models trained on different parts of tweets has an effect.

The remainder of this paper is organized as follows: Section 2 contains related works. The conceptual underpinning of this study is presented in Section 3, which also provides a conceptual background for areas related to model development. Section 4 defines the loss function, correlation functions, weight normalization techniques, and other methods used in model development. It includes the mathematical equations of the techniques and the justification of their use in our model development. Finally, the proposed alternative methods of rumor classification are also presented in Section 4. The experimental evaluation is presented in Section 5, and an analysis of the results is presented in Section 6. Section 7 concludes the paper by outlining the future goals of the research.

In previous studies, models for rumor detection using state-of-the-art deep learning approaches with different text features, metadata features, and user profile features were proposed. In [1], a future research direction on rumor detection that was data-oriented, feature-oriented, model-oriented, and application-oriented was suggested. To learn hidden representations, researchers [6] suggested a model based on RNNs. Owing to extra hidden layers, experimental results on datasets from two real-world microblog platforms showed that the RNN outperformed state-of-the-art rumor detection models that used handcrafted features and that the RNN-based method detected rumors more quickly and accurately than existing techniques. A unique machine learning-based approach for the automatic detection of users distributing rumor information was proposed in [3]. The study used the concept of believability and the retweet network using word embedding to ensure user trust. In another study [7], researchers combined a feature engineering and deep learning model and developed a hierarchical social attention model that led to the improvement of rumor detection. In [8], a hybrid SVM classifier was proposed based on a graph kernel that captures high-order propagation patterns as well as semantic information, such as topics and sentiments. A deep attention model based on RNNs was presented in [1] to learn hidden representations of consecutive posts for spotting rumors. The proposed model used soft attention to probe recurrence to extract different features with a specific emphasis while simultaneously producing hidden representations that capture contextual fluctuations of relevant posts over time. Another study [9] investigated whether historical data-based knowledge could potentially aid in the detection of freshly formed rumors. We found and applied patterns from prior labeled data to assist in detecting emerging rumors because a disputed factual claim arouses specific behaviors, such as curiosity, skepticism, and astonishment. In [10], experiments were conducted to craft new features representing rumor diffusion and use them to correlate the different features. In the early detection of rumors, a study [10] proposed a probabilistic model based on eminent features, such as user and linguistic features. In [11], a method for early rumor identification that used convolutional neural networks to learn the hidden representations of individual rumor-related tweets to gain insights about their believability was proposed. Another study [12] presented a unique rumor detection approach that learned from the sequential reporting of news on social media to detect rumors in fresh stories. When compared with a state-of-the-art rumor detection system, the results show that the proposed classifier outperformed the state-of-the-art classifier. Furthermore, researchers [6] presented the propagation tree kernel, a kernel-based technique that captures high-order patterns distinguishing different types of rumors by comparing their propagation tree architectures. State-of-the-art rumor detection algorithms were outperformed by the proposed kernel-based methodology, which detected rumors faster and more accurately. Previous research has shown the importance of different social media features to help obtain more accurate results of rumor classification. A study [13] suggested a statistical and automatic approach based on a wide range of different attributes. In this approach, user profile attributes are used to develop a supervised learning model for rumor detection.

### 3.1 Data Pre-processing

Real-world databases are highly susceptible to noisy, missing, incomplete, and inconsistent data owing to their large size and origin from various heterogeneous sources. Low-quality data result in low-quality mining outcomes. Data cleansing, integration, transformation, and reduction are just a few of the data pre-processing techniques accessible. In this research, many tweets contain noise and inconsistencies, for example,

• • Many tweet texts and comments were of different origin, had much numerical data, emoji, and words that were shortened, which needed to be replaced by the right words.

• • Most metadata were skewed.

• • Text was vectorized as per word embedding.

### 3.2 Features of Tweets

A tweet has different types of features that are used for information extraction. In previous studies, tweet text features using word-embedding methods were used to generate word vectors and analyze the results. In the present study, the focus was on tweet user features [7], tweet metadata features, and sentimental features (follower count, reply count, retweet count, verified, favorite count, number of symbols, profile image URL, user mentions, number of hashtags, default profile image, number of URLs, polarity, text length, location, post age, status count, friend count, account age, favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity) [4,14].

### 3.3 Word Embedding

In tweet text analysis, the word-embedding model has been used to find the vectors of tweets, and these vectors have been input for text classification methods. In previous studies, the TF-IDF, word2vec, GloVe, and fastText word-embedding models have been widely used for text classification. In this study, the fastText word-embedding model was used for word vector generation [5,15,17]. This word-embedding method is an extension of the word2vec model; it represents each word as an n-gram of characters instead of learning vectors for words directly.

### 3.4 Data Scaling

For the non-categorical skewed features, the log-transform method, and the standard scalar method were used. The standard scalar method standardizes a feature by removing the mean and scaling it to the unit variance. The log-transform method transforms the data into a normal distribution for more reliable prediction. The standard deviation was used to divide all numbers to obtain the unit variance.

### 3.5 Correlation Methods

To find the correlation between user metadata features or attributes, the point biserial method of Pearson correlation was used to identify the correlation between the features. It is used to determine the correlation between a continuous variable and a binary variable (Eq. 1).

rpb=M1-MSnpq0,

where M1 represents the mean of the positive binary variable group, M0 represents the mean of the group with a negative binary variable, sn represents the standard deviation, p represents the proportion of cases in the “0” group, q represents the proportion of cases in the “1” group, and the resultant correlation coefficient ranges between 0 and 1. If rpb > 0, then it signifies a positive correlation between the features.

### 3.6 Activation Function in Layered Architecture

The activation function of the layer determines the output of the layer given a set of inputs. The rectified linear unit (ReLU) and sigmoid activation functions were used in the implemented neural network models, and the ReLU function gives the output directly if the output is positive; otherwise, it gives 0, and the output range is between (0, ∞) (Eq. 2).

y={0for x0αx for x>0}.

The sigmoid function is a real, bounded, and differentiable function that defines all the input values, and the output range is (0, 1) (Eq. 3).

y=1(1+e-x).

### 3.7 Weight Initialization

Weight initialization is a procedure for setting the initial weights of a neural network using different methods, and it is used in the implemented neural network models. There are two methods of weight initialization: He-normal and Glorot-normal. In He-normal, the weights are initialized by considering the size of the previous layer. This helps achieve a faster and more accurate gradient descent. It draws weights from a normal distribution centered on 0 and with a standard deviation, as calculated from the formula below. For ReLU function,

Y=w1x1+w2x2+w3x3++wnxn,σ(standard deviation)=2fanin,

where Y is the output of the layer before applying the activation function, (w1, w2, w3, ..., wn) are the weights of the respective nodes (x1, x2, x3, ..., xn), and fanin is the number of input units.

In Glorot normal, we used fanin and fanout to determine the initial weights of the nodes. It draws weights from a normal distribution centered on 0 and with a standard deviation, as calculated from the formula below:

σ(standard deviation)=2fanin+fanout,

where fanin is the number of input units and fanout is the number of output units. A Glorot-normal function was used with sigmoid activation functions in the implemented neural networks.

### 3.8 Batch Normalization

Batch normalization is a technique that transforms the data in such a way that the mean becomes 0 and the standard deviation becomes 1. When the input to the first layer of the neural network is already normalized, the output from the activation functions may no longer be on the same scale, resulting in internal covariate shifts in the data. To avoid this problem, batch normalization was applied to each input. Batch normalization is a two-step process in which the input is normalized, and rescaling and offsetting are performed later. In this method, we first computed the mean using the input from the ith layer (Eq. 7):

μ(mean)=1mhi,

where hi are the inputs from layer h, and m is the number of neurons in layer h. Second, the standard deviation of hidden activations is calculated using (Eq. 8).

σ(standard deviation)=[1m(hi-μ)].

Subsequently, the mean is subtracted from each input and divided by the sum of the standard deviations and smoothing term ɛ to obtain the normalized data (Eq. 9).

hi(norm)=(hi-μ)σ+ɛ.

Finally, rescaling parameter γ and shifting parameter β (Eq. 10) were incorporated.

hi=γhi(norm)+β.

### 3.9 Loss Function

The loss function is used to determine the error between the predicted and desired outputs. The primary loss function used for classification problems was cross-entropy [2,17]. In this study, we categorized tweets into rumor or non-rumor data; hence, binary cross-entropy was used (Eq. 11).

Loss=-y log(y^)-(1-y)log(1-y^),

where ŷ is the desired output, and y is the predicted output.

### 3.10 Bi-LSTM Deep Classification

A Bi-LSTM is a sequence-processing model that consists of two LSTMs: one that takes the input forward and the other that takes it backward. Bi-LSTMs effectively improve the quantity of data available to the network and provide the algorithm with more context. This strategy is used to determine which words come after and before a word in a sentence. This is an important method in text classification that has been used for sequential learning [7,18].

### 3.11 Datasets

In this research, the PHEME, Twitter15, and Twitter16 datasets were used to evaluate all models. The total number of tweets in these datasets is listed in Tables 1 and 2.

A collection of rumors and non-rumors reported on Twitter during breaking news events can be found in the PHEME dataset. It includes rumors about nine different events, and each rumor is annotated with its veracity value, which may be true, false, or unverified (Table 2).

The original data were used from the PHEME, Twitter15, and Twitter16 datasets, whose tweets were re-fetched over a period of 2 days to obtain the latest metadata values. No specific keyword was used for crawling; the data were fetched from Twitter using the tweet IDs present in the old dataset.

### 4. Proposed Models for Rumor Classification

Previous research has suggested text classification approaches that use different text and metadata features of tweets for rumor detection. In this study, we analyzed the results using four different models of tweet classification, all of which use different features. The first model used word vector features, the second used user and metadata features, the third used tweet reaction features, and the fourth was an ensemble model that combined all three models and extracted the probabilities of each model on complete datasets for tweet text classification. The loss and accuracy of these models were calculated using deep learning classification approaches. The class target in this research was the authenticity of tweets classified into whether a tweet was a rumor or not.

### 4.1 Rumor Text Classification using Word Vectors Features

In this model, word vectors were used for text sentence classification. The text was preprocessed by removing hyperlinks, emojis, hashtags, mentions, stop words, and numbers, replacing the contracted words with their expanded forms, and replacing the abbreviated words with their true forms. After these processes, tokenization and fastText word embedding were applied to create the embedding matrix. The generated vectors were input in the deep learning Bi-LSTM sequential methods. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 1).

### 4.2 Rumor Text Classification using User Metadata Features

In this model, tweet user metadata features were used for sentence classification. The Twitter features were pre-processed by applying the log-transformation method for data transformation and determining the correlation between the features using the Pearson correlation method. The generated features were input into the sequential TensorFlow model. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 2).

### 4.3 Rumor Text Classification Using Reaction Features

In this model, tweet reaction features were used for sentence classification. In the first step, the tweet texts were cleaned for similar text classification using word vectors, and then the tweet data were lemmatized. In the second step, the polarity of the tweets was found and the weighted polarities were determined by applying multiple retweet counts, favorite counts, and the sum of favorite and retweet count features or attributes. After performing feature selection, it was observed that favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity had a similar negative correlation to the label. Finally, in reaction-based feature selection, the favorite count, sentiments (polarity), and the number of hashtag features were used for classification. The generated features were input into the sequential TensorFlow model. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 3).

### 4.4 Rumor Text Classification Using Ensemble Probabilistic Features

The last three models were combined to predict the probabilistic weight of each model on the complete dataset and treated as separate features to train the model. These weights determine the extent to which the rumor classification task is dependent on the models, which in turn provide the weight required to understand the importance of different tweet attributes. These probabilistic features were input into a sequential TensorFlow model for rumor and non-rumor classification. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 4).

Experimental Setup: In this research, PHEME [14,15], Twitter15 [17], and Twitter16 [17] datasets were used to evaluate all the models. A total of 6,915 tweets were chosen from the datasets for this experiment, with 85% of the tweets used to train the model and 15% used for testing. These tweets were labeled into rumor and non-rumor categories, with 3,815 rumors in the non-rumor category, 2,281 in the rumor category, and 819 tweets had an unverified labeled; these were dropped from the experiment. The rumors were labeled as true rumors (1,309) and false rumors (972); true rumors were found to contain correct information at a later time but started spreading as rumors initially. False rumors were defined as incorrect information or deliberate misinformation after extensive search either from a third-party organization like Politifact or Gossipcop or by flagging such tweets with the help of users. Tweets labeled unverified were tweets whose veracity is still questionable, and people are still verifying them.

### Experiment 1

This experiment was built on a Twitter text categorization model that was applied to word vectors using deep learning methodologies. The output was examined using a dropout layer with a value of 0.5, four layers of Bi-LSTM with 64 nodes, one layer of Bi-LSTM with 32 nodes, a dense layer with 32 nodes, and the ReLU activation function. The model was then trained using a layer of batch normalization, which was processed with a dropout layer with a value of 0.5 and ultimately a dense layer with 1 node and a sigmoid activation function. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 5).

### Result 1

This model was trained using various parameters taken from various layers of the model (Table 3). Bi-LSTM deep learning models were used to assess the results. We calculated the loss values and the model training and testing accuracies after 150 epochs. The results indicate that the model was well trained, with an accuracy of 88.6% and a testing data accuracy of 80.1% (Table 4).

### Experiment 2

This experiment was based on a Twitter rumor text classification model using neural network-based approaches applied to Twitter user metadata features. In this process, a total of 27 different features were used for experimental purposes (Tabe 5). In the first process, log transformation was applied for data transformation. In the second process, we found a correlation between these features using the Pearson correlation method and 18 selected important features (Table 6). Furthermore, these features were analyzed using the sequential TensorFlow model of the neural network. In this process, the first layer was a dense layer with 64 nodes; the ReLU activation function and L2 regularization with a value of 0.01. were used. Subsequently, a dropout layer with a value of 0.3, dense layer with 64 nodes, ReLU activation function, and L2 regularization with a value of 0.01 were used. Next, a dropout layer with a value of 0.3 and finally, a dense layer with a sigmoid activation function and glorot uniform weights initializer function were used for training the model. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 6).

### Result 2

This model was trained using various parameters retrieved from various layers of the model (Table 7). Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. The results show that the model was well trained, with an accuracy of 77.5% and a testing data accuracy of 74.5% (Table 4).

### Experiment 3

This experiment was based on a Twitter rumor text classification model using neural network-based approaches applied to Twitter reaction features. In this process, in the first step, the tweet texts were cleaned as similar text classifications using word vectors, and then the tweet data were lemmatized. In the second step, we found the polarity of the tweets and weighted polarities by multiplying with retweet counts, favorite counts, and the sum of favorite and retweet count features or attributes (Table 8). After performing feature selection, it was observed that favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity had a similar negative correlation with the label. Finally, in reaction-based feature selection, the favorite count, sentiments (polarity), and the number of hashtag features were used for classification (Table 9). The generated features were input into the sequential TensorFlow model. In this process, the model was trained with the first dense layer with 32 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 16 nodes, a ReLU activation function, a dropout layer with a value of 0.2, a dense layer with 16 nodes, a ReLU activation function, and a final layer, which was a dense layer with a sigmoid activation function. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 7).

### Result 3

This model was trained using various parameters retrieved from various layers of the model (Table 10]. Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. We found that the model was well trained, with an accuracy of 63.5% and a testing data accuracy of 63.6% (Table 4).

### Experiment 4

This experiment predicted the probabilities from each of the last three models; these predicted probabilities were treated as separate features to train the model. This experiment was based on a Twitter rumor text classification model using sequential TensorFlow neural network-based approaches applied to Twitter ensemble probabilistic weights of the three proposed models. Ensemble probabilistic weight features were input into the sequential TensorFlow model. In this process, the model was trained with a dense layer with 64 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 64 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 1 node, a sigmoid activation function, and a glorot uniform weight initializer. This process was compiled with the Adam optimizer, and the loss was compiled with binary cross-entropy (Figure 8).

### Result 4

This model was trained using various parameters taken from various layers of the model (Table 11). Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. From the results, the model was well trained with an accuracy of 88.9% and testing data accuracy of 82.5% (Table 4).

### Experiment 5

This experiment was based on the prediction of unseen test data. In this experiment, a total of 1038 unseen tweets were selected from PHEME, Twitter15, and Twitter16 datasets. The results were evaluated with all the models, and their accuracy was determined (Figure 9).

This study compared the results of four different rumor classification models. The results of the first experiment with rumor text classification using the word-embedding model suggested the importance of sequential text, with 80.1% validation accuracy. Experiments 2 and 3 showed the importance of user metadata and reaction features of the tweet with validation accuracies of 74.5% and 63.6%, respectively. Finally, Experiment 4 suggested the importance of all combined probabilistic features of the different models. These ensemble features provided a validation accuracy of 82.5%, which was higher than those of all the other models (Figure 9). In Experiment 5, we used unseen data and obtained the most accurate results from the ensemble probabilistic features model (Figure 10).

All models were combined and trained on the dataset to obtain their final weights. These weights determined the extent to which the rumor classification task was dependent on the model, which in turn provides the weights required to understand the importance of different tweet attributes. The main contributions of this research are as follows: (1) proposing a feature extraction mechanism and preprocessing techniques to make data more relevant; (2) constructing a hypothesis based on metadata in the user and tweet datasets and mapping the relationship to the problem; (3) correlating derived and extracted features to actualize attribute relationships and reduce dimensionality for effective resource utilization; (4) ensemble strategies for merging models as well as training and assessing the final model for a general answer; (5) formulating the relevance of feature variables by understanding their weight distribution; (6) defining loss metrics for dichotomous rumor data categorization problem; and (7) the accuracy of the ensemble model can be improved through live training and manual weight adjustment.

To understand the importance of the user metadata features in an ensemble model, the built-in feature importance function of the decision tree classifier was used to obtain the feature importance scores (Figures 11 and 12).

This study compared four different models using different features of tweet data. The results of the ensemble probabilistic features rumor classification model provided better training accuracy, validation accuracy, and loss value. Furthermore, we tested the models on unseen data and concluded that the ensemble model overestimates the importance of the Tweet text model because of the limited availability of data. The ensemble probabilistic feature rumor classification model performed better than the other individual feature models. This study demonstrated the significance of traits that played a significant role in the task of identifying tweets as rumors or non-rumors. Future work on improving these models will include live training of ensemble feature models using real-time data to increase the accuracy of the model.

Fig. 1.

Rumor text classification using word vector features.

Fig. 2.

Rumor text classification using user metadata features.

Fig. 3.

Rumor text classification using reaction features.

Fig. 4.

Rumor text classification using ensemble features.

Fig. 5.

Architecture for rumor text classification using word vectors.

Fig. 6.

Fig. 7.

Architecture for rumor text classification using Twitter reaction features.

Fig. 8.

Architecture for rumor text classification using ensemble probabilistic features.

Fig. 9.

Accuracy analysis of unseen data.

Fig. 10.

Training accuracy and validation accuracy comparisons of all models.

Fig. 11.

Fig. 12.

Feature importance analysis of the ensemble model.

Table. 1.

# of users276,663173,487
# of source tweets1,490818
# of non-rumors374205
# of false rumors370205
# of true rumors372205
# of unverified rumors374203
Avg. time length (hr)1,337848
Avg. # of posts223251
Max #of posts1,7682,765
Min # of posts5581

Table. 2.

Table 2. Statistics of PHEME dataset.

Charlie Hebdo2,07938,2684581,621193116149
Sydney siege1,22123,9965226993828654
Ferguson1,14324,175284859108266
Ottawa shooting89012,2844704203297269
German wings crash4694,4892382319411133
Putin missing23883512611209117
Prince Toronto233902229402227
Gurlitt13817961775902
Ebola Essien142261400140
Total6,425105,3542,4024,0231,067638697

Table. 3.

Table 3. Parameters received from different layers of rumor text classification model using word vectors.

Layer (type)Output shapeParam#
embedding(Embedding)(None,93,300)2,724,600
dropout(Dropout)(None,93,300)0
bidirectional(Bidirectional LSTM)(None,93,128)186,880
Bidirectional_1(Bidirectional LSTM)(None,93,128)98,816
bidirectional_2(Bidirectional LSTM)(None,93,128)98,816
bidirectional_3(Bidirectional LSTM)(None,93,128)98,816
bidirectional_4(Bidirectional LSTM)(None,64)41,216
dense(Dense)(None,32)2,080
batch_normalization(BatchNormalization)(None,32)128
dropout_1(Dropout)(None,32)0
Total params: 6,317,985
Trainable params: 526,721
Non-trainable params: 5,791,264

Table. 4.

Table 4. Loss and accuracy comparisons of different rumor text classification models.

ModelsLossAccuracy (%)
TrainingValidationTrainingValidation
Rumor text classification using word vector features0.2740.44988.680.1
Rumor text classification using user metadata features0.4690.55077.574.5
Rumor text classification using reaction features0.6440.64763.563.6
Rumor text classification using ensemble probabilistic features0.2700.43088.982.5

Table. 5.

Table 5. Tweet user metadata features.

Numerical featuresCategorical features
Retweet countVerified
Favorite countIs_quote_status
No. of symbolsProfile_image_url
No. of user mentionsprofile_background_image_url
No. of hashtagsDefault profile image
No. of URLsDefault profile
PolarityProfile_use_background_image
Text lengthHas_location
Post ageHas_url
Status count
Friends count
Favorites count of user
Listed count
Account age
Screen name length
The time gap between user-created time and tweeted time

Table. 6.

Table 6. Extracted important features of tweet user metadata after applied standard scalar method.

Numerical featuresCategorical features
Text lengthProfile_use_background_image
No. of URLsProfile_background_image_url
Post ageDefault_profile_image
Status countDefault_profile
Listed countVerified
No. of symbols
Posted_in
No. of user mentions
Polarity
Favorites count of user
Favorites count of a tweet
Screen name length
No. of hashtags

Table. 7.

Table 7. Parameters received from different layers of rumor text classification model using tweet user metadata features.

Layer (type)Output shapeParam #
dense_1(Dense)(None,64)1216
dropout_1(Dropout)(None,64)0
dense_2(Dense)(None,64)4160
dropout_2(Dropout)(None,64)0
dense_3(Dense)(None,1)65
Total params: 5,441
Trainable params: 5,441
Non-trainable params: 0

Table. 8.

Retweet count
Favorite count
Sentiments
No. of hashtags
Favorite weighted polarity
Retweet-weighted polarity
Favorite retweet-weighted polarity

Table. 9.

Table 9. Useful Twitter reaction features after applied feature correlation method.

Favorite count
Sentiments (Polarity)
No. of hashtags

Table. 10.

Table 10. Parameters received from different layers of rumor text classification model using tweet reaction features.

Layer (type)Output shapeParam #
dense_1(Dense)(4598,32)256
drouput_1(Dropout)(4598,32)0
dense_2(Dense)(4598,16)528
dropout_2(Dropout)(4598,16)0
dense_3(Dense)(4598,16)272
dense_4(Dense)(4598,1)17
Total params: 1,073
Trainable params: 1,073
Non-trainable params: 0

Table. 11.

Table 11. Parameters received from different layers of rumor text classification model using ensemble probabilistic features.

Layer (type)Output shapeParam #
dense_1(Dense)(None,64)256
droupout_1(Dropout)(None,64)0
dense_2(Dense)(None,64)4160
dropout_2(Dropout)(None,64)0
dense_8(Dense)(None,1)65
Total params: 4,481
Trainable params: 4,481
Non-trainable params: 0

1. Shu, K, Sliva, A, Wang, S, Tang, J, and Liu, H (2017). Fake news detection on social media: a data mining perspective. ACM SIGKDD Explorations Newsletter. 19, 22-36. https://doi.org/10.1145/3137597.3137600
2. Chen, T, Li, X, Yin, H, and Zhang, J (2018). Call attention to rumors: deep attention based recurrent neural networks for early rumor detection. Trends and Applications in Knowledge Discovery and Data Mining. Cham, Switzerland: Springer, pp. 40-52 https://doi.org/10.1007/978-3-030-04503-6_4
3. Rath, B, Gao, W, Ma, J, and Srivastava, J . From retweet to believability: utilizing trust to identify rumor spreaders on Twitter., Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2017, Sydney, Austria, Array, pp.179-186. https://doi.org/10.1145/3110025.3110121
4. Ma, J, Gao, W, and Wong, KF . Detect rumors in microblog posts using propagation structure via kernel learning., Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017, Vancouver, Canada, pp.708-717.
5. Putra, YA, and Khodra, ML . Deep learning and distributional semantic model for Indonesian tweet categorization., Proceedings of 2016 International Conference on Data and Software Engineering (ICoDSE), 2016, Denpasar, Indonesia, Array, pp.1-6. https://doi.org/10.1109/ICODSE.2016.7936108
6. Ma, J, Gao, W, Mitra, P, Kwon, S, Jansen, BJ, Wong, KF, and Cha, M . Detecting rumors from microblogs with recurrent neural networks., Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016, New York, NY, pp.3818-3824.
7. Guo, H, Cao, J, Zhang, Y, Guo, J, and Li, J . Rumor detection with hierarchical social attention network., Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, Torino, Italy, Array, pp.943-951. https://doi.org/10.1145/3269206.3271709
8. Wu, K, Yang, S, and Zhu, KQ . False rumors detection on Sina Weibo by propagation structures., Proceedings of 2015 IEEE 31st International Conference on Data Engineering, 2015, Seoul, Korea, Array, pp.651-662. https://doi.org/10.1109/ICDE.2015.7113322
9. Wu, L, Li, J, Hu, X, and Liu, H . Gleaning wisdom from the past: Early detection of emerging rumors in social media., Proceedings of the 2017 SIAM International Conference on Data Mining, 2017, Houston, TX, Array, pp.99-107. https://doi.org/10.1137/1.9781611974973.12
10. Margolin, DB, Hannak, A, and Weber, I (2018). Political fact-checking on Twitter: when do corrections have an effect?. Political Communication. 35, 196-219. https://doi.org/10.1080/10584609.2017.1334018
11. Nguyen, TN, Li, C, and Niederee, C (2017). On early-stage debunking rumors on twitter: leveraging the wisdom of weak learners. Social Informatics. Springer Cham: Switzerland: Springer, pp. 141-158 https://doi.org/10.1007/978-3-319-67256-4_13
12. Zubiaga, A, Liakata, M, and Procter, R. (2016) . Learning reporting dynamics during breaking news for rumour detection in social media. Available: https://arxiv.org/abs/1610.07363
13. Castillo, C, Mendoza, M, and Poblete, B (2013). Predicting information credibility in time-sensitive social media. Internet Research. 23, 560-588. https://doi.org/10.1108/IntR-05-2012-0095
14. Fard, AE, Mohammadi, M, Chen, Y, and Van deWalle, B (2019). Computational rumor detection without non-rumor: a one-class classification approach. IEEE Transactions on Computational Social Systems. 6, 830-846. https://doi.org/10.1109/TCSS.2019.2931186
15. Alsaeedi, A, and Al-Sarem, M (2020). Detecting rumors on social media based on a CNN deep learning technique. Arabian Journal for Science and Engineering. 45, 10813-10844. https://doi.org/10.1007/s13369-020-04839-2
16. Alkhodair, SA, Ding, SH, Fung, BC, and Liu, J (2020). Detecting breaking news rumors of emerging topics in social media. Information Processing & Management. 57. article no 102018
17. Zhao, Z, Resnick, P, and Mei, Q . Enquiring minds: early detection of rumors in social media from enquiry posts., Proceedings of the 24th International Conference on World Wide Web, 2015, Florence, Italy, Array, pp.1395-1405. https://doi.org/10.1145/2736277.2741637
18. Asghar, MZ, Habib, A, Habib, A, Khan, A, Ali, R, and Khattak, A (2021). Exploring deep neural networks for rumor detection. Journal of Ambient Intelligence and Humanized Computing. 12, 4315-4333. https://doi.org/10.1007/s12652-019-01527-4

Amit Kumar Sharma received his M. Tech Degree in computer science and engineering with specialization in information security from Central University of Rajasthan, India. He is currently pursuing the Ph.D. degree in the Department of Computer Science & Engineering, Manipal University Jaipur, India. His research interests are in the fields of machine learning, deep learning, and text analysis.

E-mail: amitchandnia@gmail.com

Rakshith Alaham Gangeya is currently pursuing B. Tech in the field of computer science from Manipal University Jaipur, Rajasthan, India. Data science, machine learning, deep learning, web development, blockchain, applied statistics and probability, quantitative analysis, natural language processing, and game theory are some of his research interests.

Harshit Kumar is currently pursuing B. Tech in the field of computer science from Manipal University Jaipur, Rajasthan, India. His research interests include natural language processing, computer vision, blockchain, machine learning, deep learning, cyber security, quantum computing, and cloud computing.

Sandeep Chaurasia (Senior Member, IEEE) is currently a Professor with the Department of CSE, School of Computing and I.T., Manipal University Jaipur, India. He has more than 12 years of rich experience in academics and one year in the industry. He has over 30 papers published in international and national journals, as well as in conference proceedings. His research interests are in the fields of machine learning, soft computing, deep learning, and AI.

Devesh Kumar Srivastava is a Professor in the Department of Information Technology, School of Computing & Information Technology, Manipal University Jaipur, India. His current research interests include software engineering, operating systems, data mining, big data, DBMS, web tech, computer architecture, computer networks, and Oops Technology.

E-mail: devesh988@yahoo.com

### Article

#### Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 325-338

Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.325

## Ensemble Rumor Text Classification Model Applied to Different Tweet Features

Amit Kumar Sharma1,2, Rakshith Alaham Gangeya1, Harshit Kumar1, Sandeep Chaurasia1, and Devesh Kumar Srivastava3

1Department of Computer Science and Engineering, Manipal University Jaipur, India
2Department of Computer Science and Engineering, ICFAI Tech School, ICFAI University, Jaipur, India
3Department of Information Technology, Manipal University Jaipur, India

Correspondence to:Sandeep Chaurasia (sandeep.chaurasia@jaipur.manipal.edu)

Received: April 21, 2022; Revised: July 11, 2022; Accepted: August 11, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

Today, social media has evolved into user-friendly and useful platforms to spread messages and receive information about different activities. Thus, social media users’ daily approaches to billions of messages, and checking the credibility of such information is a challenging task. False information or rumors are spread to misguide people. Previous approaches utilized the help of users or third parties to flag suspect information, but this was highly inefficient and redundant. Moreover, previous studies have focused on rumor classification using state-ofthe- art and deep learning methods with different tweet features. This analysis focused on combining three different feature models of tweets and suggested an ensemble model for classifying tweets as rumor and non-rumor. In this study, the experimental results of four different rumor and non-rumor classification models, which were based on a neural network model, were compared. The strength of this research is that the experiments were performed on different tweet features, such as word vectors, user metadata features, reaction features, and ensemble model probabilistic features, and the results were classified using layered neural network architectures. The correlations of the different features determine the importance and selection of useful features for experimental purposes. These findings suggest that the ensemble model performed well, and provided better validation accuracy and better results for unseen data.

Keywords: Rumor, Word embedding, User metadata features, Reaction features, Ensemble probabilistic features, Feature scaling, Correlation, Bi-LSTM, Neural network

### 1. Introduction

A rumor is described as the circulation of false messages of events, objects, and issues to a group of people. Social media is the easiest platform for propagating rumors or false information in the public [14]. Twitter has evolved as a significant source of information sharing across all social media platforms where, on average, millions of tweets are shared per day, and each user can tweet up to 140-character long messages or share files, such as pictures or videos. These tweets sometimes contain relevant information, such as public announcements by governing organizations and sharing of events, personal experiences, beliefs, and thoughts. Camouflaged among these data, rumors are widely spreading and destroying the credibility of this information. This problem is elevated owing to crowdsourcing, where a rumor can be backed by many users because of their lack of knowledge regarding the matter and beliefs. To address this issue, much work has been done in the research community to find a prominent solution. This research attempts to understand the root causes of rumor generation on Twitter and which features may help classify rumors. In previous studies, machine learning methods, such as logistic regression, decision tree, support vector machine (SVM), naive Bayes, and k-nearest neighbor, have been widely used for text categorization [1,5]. Word-embedding approaches have been widely used for word vector generation [5]. In this context, deep learning methods including recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional LSTMs (Bi-LSTM), gated recurrent units, and convolutional neural networks are also being applied in the field of text analysis [6]. Previous research has focused on word vector features, and recently, researchers have focused on other features of tweets. In this manner, user metadata and reaction (sentiment) features are used to obtain accurate results. User metadata have numerical and categorical features, which are important for identifying rumors. The features have different values, and their values must be scaled. Log transformation and standard feature methods are widely used for feature scaling. The correlation between the features or attributes helps identify the features that can be used to identify the results. In this study, four different feature models that use different features of tweet data and classify the results as rumor and non-rumor using neural network architecture were used for analysis. All the models were developed as a supervised learning task on a publicly available dataset from Twitter from which tweet-id and labels were gathered; then using the respective tweet-id, tweets were collected in JSON format using the Twitter API.

In previous studies, the main focus was text analysis with text sequential features for classifying rumors and non-rumors. In previous studies, there were some research gaps and questions that arose, such as how to automatically identify rumor text from large unstructured data and what is the main focus of the research on automatically identifying rumor texts. The improvement of this study over previous research is as follows. This research has provided solutions to these questions by focusing on different features, such as reaction and user metadata features, along with the text features or attributes of the tweets. Previous research was based on word-embedding features of the text, and some researchers suggested sentiment features of the post; however, in this study, other features have also been used, and these features provide input for the neural network architecture. In this research, we cannot directly combine the metadata feature, reaction features, and sentences; thus, we separately trained all the models with different features or attributes; then, we combined all the weights of different models and trained the model with the same dataset.

The primary goal of this research is to identify tweet attributes or features, such as tweet text, user information, and other tweet metadata, by which we can determine the veracity of a tweet. The objective of this study was to determine the numerical importance of certain features or attributes. The primary goal of this study was to develop an automated method for accurately identifying rumors by comparing it to existing models that employ a variety of combined features from tweets, including word embedding, reaction, and metadata features. This work focuses on finding attribute importance for classification, whereas previous work aimed to find better techniques for classification.

The objective of this study was to identify important features to distinguish rumor-spreading tweets from authentic tweets. For this purpose and because the dataset comprised various data formats, four models were developed. (1) Bi-LSTM model for tweet text, which was mapped to the target class to check how much tweet text can classify tweets as rumors. (2) Neural model for tweet metadata trained with attributes such as user information (e.g., user account age and user follower count) and tweet information (such as count, retweet count, share count, and tweet text sentiment), which was also mapped to the target class to check how much tweet metadata can classify tweets as rumors. (3) Bi-LSTM model for comments on tweets; for each tweet, the first 20 comments were collected and used to train the model to understand the impact of comments on tweet rumor classification. (4) Ensemble model to merge the above models and train them with the target class to determine whether combining models trained on different parts of tweets has an effect.

The remainder of this paper is organized as follows: Section 2 contains related works. The conceptual underpinning of this study is presented in Section 3, which also provides a conceptual background for areas related to model development. Section 4 defines the loss function, correlation functions, weight normalization techniques, and other methods used in model development. It includes the mathematical equations of the techniques and the justification of their use in our model development. Finally, the proposed alternative methods of rumor classification are also presented in Section 4. The experimental evaluation is presented in Section 5, and an analysis of the results is presented in Section 6. Section 7 concludes the paper by outlining the future goals of the research.

### 2. Related Work

In previous studies, models for rumor detection using state-of-the-art deep learning approaches with different text features, metadata features, and user profile features were proposed. In [1], a future research direction on rumor detection that was data-oriented, feature-oriented, model-oriented, and application-oriented was suggested. To learn hidden representations, researchers [6] suggested a model based on RNNs. Owing to extra hidden layers, experimental results on datasets from two real-world microblog platforms showed that the RNN outperformed state-of-the-art rumor detection models that used handcrafted features and that the RNN-based method detected rumors more quickly and accurately than existing techniques. A unique machine learning-based approach for the automatic detection of users distributing rumor information was proposed in [3]. The study used the concept of believability and the retweet network using word embedding to ensure user trust. In another study [7], researchers combined a feature engineering and deep learning model and developed a hierarchical social attention model that led to the improvement of rumor detection. In [8], a hybrid SVM classifier was proposed based on a graph kernel that captures high-order propagation patterns as well as semantic information, such as topics and sentiments. A deep attention model based on RNNs was presented in [1] to learn hidden representations of consecutive posts for spotting rumors. The proposed model used soft attention to probe recurrence to extract different features with a specific emphasis while simultaneously producing hidden representations that capture contextual fluctuations of relevant posts over time. Another study [9] investigated whether historical data-based knowledge could potentially aid in the detection of freshly formed rumors. We found and applied patterns from prior labeled data to assist in detecting emerging rumors because a disputed factual claim arouses specific behaviors, such as curiosity, skepticism, and astonishment. In [10], experiments were conducted to craft new features representing rumor diffusion and use them to correlate the different features. In the early detection of rumors, a study [10] proposed a probabilistic model based on eminent features, such as user and linguistic features. In [11], a method for early rumor identification that used convolutional neural networks to learn the hidden representations of individual rumor-related tweets to gain insights about their believability was proposed. Another study [12] presented a unique rumor detection approach that learned from the sequential reporting of news on social media to detect rumors in fresh stories. When compared with a state-of-the-art rumor detection system, the results show that the proposed classifier outperformed the state-of-the-art classifier. Furthermore, researchers [6] presented the propagation tree kernel, a kernel-based technique that captures high-order patterns distinguishing different types of rumors by comparing their propagation tree architectures. State-of-the-art rumor detection algorithms were outperformed by the proposed kernel-based methodology, which detected rumors faster and more accurately. Previous research has shown the importance of different social media features to help obtain more accurate results of rumor classification. A study [13] suggested a statistical and automatic approach based on a wide range of different attributes. In this approach, user profile attributes are used to develop a supervised learning model for rumor detection.

### 3.1 Data Pre-processing

Real-world databases are highly susceptible to noisy, missing, incomplete, and inconsistent data owing to their large size and origin from various heterogeneous sources. Low-quality data result in low-quality mining outcomes. Data cleansing, integration, transformation, and reduction are just a few of the data pre-processing techniques accessible. In this research, many tweets contain noise and inconsistencies, for example,

• • Many tweet texts and comments were of different origin, had much numerical data, emoji, and words that were shortened, which needed to be replaced by the right words.

• • Most metadata were skewed.

• • Text was vectorized as per word embedding.

### 3.2 Features of Tweets

A tweet has different types of features that are used for information extraction. In previous studies, tweet text features using word-embedding methods were used to generate word vectors and analyze the results. In the present study, the focus was on tweet user features [7], tweet metadata features, and sentimental features (follower count, reply count, retweet count, verified, favorite count, number of symbols, profile image URL, user mentions, number of hashtags, default profile image, number of URLs, polarity, text length, location, post age, status count, friend count, account age, favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity) [4,14].

### 3.3 Word Embedding

In tweet text analysis, the word-embedding model has been used to find the vectors of tweets, and these vectors have been input for text classification methods. In previous studies, the TF-IDF, word2vec, GloVe, and fastText word-embedding models have been widely used for text classification. In this study, the fastText word-embedding model was used for word vector generation [5,15,17]. This word-embedding method is an extension of the word2vec model; it represents each word as an n-gram of characters instead of learning vectors for words directly.

### 3.4 Data Scaling

For the non-categorical skewed features, the log-transform method, and the standard scalar method were used. The standard scalar method standardizes a feature by removing the mean and scaling it to the unit variance. The log-transform method transforms the data into a normal distribution for more reliable prediction. The standard deviation was used to divide all numbers to obtain the unit variance.

### 3.5 Correlation Methods

To find the correlation between user metadata features or attributes, the point biserial method of Pearson correlation was used to identify the correlation between the features. It is used to determine the correlation between a continuous variable and a binary variable (Eq. 1).

$rpb=M1-MSnpq0,$

where M1 represents the mean of the positive binary variable group, M0 represents the mean of the group with a negative binary variable, sn represents the standard deviation, p represents the proportion of cases in the “0” group, q represents the proportion of cases in the “1” group, and the resultant correlation coefficient ranges between 0 and 1. If rpb > 0, then it signifies a positive correlation between the features.

### 3.6 Activation Function in Layered Architecture

The activation function of the layer determines the output of the layer given a set of inputs. The rectified linear unit (ReLU) and sigmoid activation functions were used in the implemented neural network models, and the ReLU function gives the output directly if the output is positive; otherwise, it gives 0, and the output range is between (0, ∞) (Eq. 2).

$y={0 for x≤0αx for x>0}.$

The sigmoid function is a real, bounded, and differentiable function that defines all the input values, and the output range is (0, 1) (Eq. 3).

$y=1(1+e-x).$

### 3.7 Weight Initialization

Weight initialization is a procedure for setting the initial weights of a neural network using different methods, and it is used in the implemented neural network models. There are two methods of weight initialization: He-normal and Glorot-normal. In He-normal, the weights are initialized by considering the size of the previous layer. This helps achieve a faster and more accurate gradient descent. It draws weights from a normal distribution centered on 0 and with a standard deviation, as calculated from the formula below. For ReLU function,

$Y=w1x1+w2x2+w3x3+…+wnxn,$$σ (standard deviation)=2fanin,$

where Y is the output of the layer before applying the activation function, (w1, w2, w3, ..., wn) are the weights of the respective nodes (x1, x2, x3, ..., xn), and fanin is the number of input units.

In Glorot normal, we used fanin and fanout to determine the initial weights of the nodes. It draws weights from a normal distribution centered on 0 and with a standard deviation, as calculated from the formula below:

$σ(standard deviation)=2fanin+fanout,$

where fanin is the number of input units and fanout is the number of output units. A Glorot-normal function was used with sigmoid activation functions in the implemented neural networks.

### 3.8 Batch Normalization

Batch normalization is a technique that transforms the data in such a way that the mean becomes 0 and the standard deviation becomes 1. When the input to the first layer of the neural network is already normalized, the output from the activation functions may no longer be on the same scale, resulting in internal covariate shifts in the data. To avoid this problem, batch normalization was applied to each input. Batch normalization is a two-step process in which the input is normalized, and rescaling and offsetting are performed later. In this method, we first computed the mean using the input from the ith layer (Eq. 7):

$μ (mean)=1m∫hi,$

where hi are the inputs from layer h, and m is the number of neurons in layer h. Second, the standard deviation of hidden activations is calculated using (Eq. 8).

$σ(standard deviation)=[1m∫(hi-μ)].$

Subsequently, the mean is subtracted from each input and divided by the sum of the standard deviations and smoothing term ɛ to obtain the normalized data (Eq. 9).

$hi(norm)=(hi-μ)σ+ɛ.$

Finally, rescaling parameter γ and shifting parameter β (Eq. 10) were incorporated.

$hi=γhi(norm)+β.$

### 3.9 Loss Function

The loss function is used to determine the error between the predicted and desired outputs. The primary loss function used for classification problems was cross-entropy [2,17]. In this study, we categorized tweets into rumor or non-rumor data; hence, binary cross-entropy was used (Eq. 11).

$Loss=-y log(y^)-(1-y) log(1-y^),$

where ŷ is the desired output, and y is the predicted output.

### 3.10 Bi-LSTM Deep Classification

A Bi-LSTM is a sequence-processing model that consists of two LSTMs: one that takes the input forward and the other that takes it backward. Bi-LSTMs effectively improve the quantity of data available to the network and provide the algorithm with more context. This strategy is used to determine which words come after and before a word in a sentence. This is an important method in text classification that has been used for sequential learning [7,18].

### 3.11 Datasets

In this research, the PHEME, Twitter15, and Twitter16 datasets were used to evaluate all models. The total number of tweets in these datasets is listed in Tables 1 and 2.

A collection of rumors and non-rumors reported on Twitter during breaking news events can be found in the PHEME dataset. It includes rumors about nine different events, and each rumor is annotated with its veracity value, which may be true, false, or unverified (Table 2).

The original data were used from the PHEME, Twitter15, and Twitter16 datasets, whose tweets were re-fetched over a period of 2 days to obtain the latest metadata values. No specific keyword was used for crawling; the data were fetched from Twitter using the tweet IDs present in the old dataset.

### 4. Proposed Models for Rumor Classification

Previous research has suggested text classification approaches that use different text and metadata features of tweets for rumor detection. In this study, we analyzed the results using four different models of tweet classification, all of which use different features. The first model used word vector features, the second used user and metadata features, the third used tweet reaction features, and the fourth was an ensemble model that combined all three models and extracted the probabilities of each model on complete datasets for tweet text classification. The loss and accuracy of these models were calculated using deep learning classification approaches. The class target in this research was the authenticity of tweets classified into whether a tweet was a rumor or not.

### 4.1 Rumor Text Classification using Word Vectors Features

In this model, word vectors were used for text sentence classification. The text was preprocessed by removing hyperlinks, emojis, hashtags, mentions, stop words, and numbers, replacing the contracted words with their expanded forms, and replacing the abbreviated words with their true forms. After these processes, tokenization and fastText word embedding were applied to create the embedding matrix. The generated vectors were input in the deep learning Bi-LSTM sequential methods. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 1).

### 4.2 Rumor Text Classification using User Metadata Features

In this model, tweet user metadata features were used for sentence classification. The Twitter features were pre-processed by applying the log-transformation method for data transformation and determining the correlation between the features using the Pearson correlation method. The generated features were input into the sequential TensorFlow model. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 2).

### 4.3 Rumor Text Classification Using Reaction Features

In this model, tweet reaction features were used for sentence classification. In the first step, the tweet texts were cleaned for similar text classification using word vectors, and then the tweet data were lemmatized. In the second step, the polarity of the tweets was found and the weighted polarities were determined by applying multiple retweet counts, favorite counts, and the sum of favorite and retweet count features or attributes. After performing feature selection, it was observed that favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity had a similar negative correlation to the label. Finally, in reaction-based feature selection, the favorite count, sentiments (polarity), and the number of hashtag features were used for classification. The generated features were input into the sequential TensorFlow model. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 3).

### 4.4 Rumor Text Classification Using Ensemble Probabilistic Features

The last three models were combined to predict the probabilistic weight of each model on the complete dataset and treated as separate features to train the model. These weights determine the extent to which the rumor classification task is dependent on the models, which in turn provide the weight required to understand the importance of different tweet attributes. These probabilistic features were input into a sequential TensorFlow model for rumor and non-rumor classification. The effectiveness of this analysis was that the data input was applied to multiple hidden layers (Figure 4).

### 5. Experimental Evaluation

Experimental Setup: In this research, PHEME [14,15], Twitter15 [17], and Twitter16 [17] datasets were used to evaluate all the models. A total of 6,915 tweets were chosen from the datasets for this experiment, with 85% of the tweets used to train the model and 15% used for testing. These tweets were labeled into rumor and non-rumor categories, with 3,815 rumors in the non-rumor category, 2,281 in the rumor category, and 819 tweets had an unverified labeled; these were dropped from the experiment. The rumors were labeled as true rumors (1,309) and false rumors (972); true rumors were found to contain correct information at a later time but started spreading as rumors initially. False rumors were defined as incorrect information or deliberate misinformation after extensive search either from a third-party organization like Politifact or Gossipcop or by flagging such tweets with the help of users. Tweets labeled unverified were tweets whose veracity is still questionable, and people are still verifying them.

### Experiment 1

This experiment was built on a Twitter text categorization model that was applied to word vectors using deep learning methodologies. The output was examined using a dropout layer with a value of 0.5, four layers of Bi-LSTM with 64 nodes, one layer of Bi-LSTM with 32 nodes, a dense layer with 32 nodes, and the ReLU activation function. The model was then trained using a layer of batch normalization, which was processed with a dropout layer with a value of 0.5 and ultimately a dense layer with 1 node and a sigmoid activation function. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 5).

### Result 1

This model was trained using various parameters taken from various layers of the model (Table 3). Bi-LSTM deep learning models were used to assess the results. We calculated the loss values and the model training and testing accuracies after 150 epochs. The results indicate that the model was well trained, with an accuracy of 88.6% and a testing data accuracy of 80.1% (Table 4).

### Experiment 2

This experiment was based on a Twitter rumor text classification model using neural network-based approaches applied to Twitter user metadata features. In this process, a total of 27 different features were used for experimental purposes (Tabe 5). In the first process, log transformation was applied for data transformation. In the second process, we found a correlation between these features using the Pearson correlation method and 18 selected important features (Table 6). Furthermore, these features were analyzed using the sequential TensorFlow model of the neural network. In this process, the first layer was a dense layer with 64 nodes; the ReLU activation function and L2 regularization with a value of 0.01. were used. Subsequently, a dropout layer with a value of 0.3, dense layer with 64 nodes, ReLU activation function, and L2 regularization with a value of 0.01 were used. Next, a dropout layer with a value of 0.3 and finally, a dense layer with a sigmoid activation function and glorot uniform weights initializer function were used for training the model. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 6).

### Result 2

This model was trained using various parameters retrieved from various layers of the model (Table 7). Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. The results show that the model was well trained, with an accuracy of 77.5% and a testing data accuracy of 74.5% (Table 4).

### Experiment 3

This experiment was based on a Twitter rumor text classification model using neural network-based approaches applied to Twitter reaction features. In this process, in the first step, the tweet texts were cleaned as similar text classifications using word vectors, and then the tweet data were lemmatized. In the second step, we found the polarity of the tweets and weighted polarities by multiplying with retweet counts, favorite counts, and the sum of favorite and retweet count features or attributes (Table 8). After performing feature selection, it was observed that favorite weighted polarity, retweet-weighted polarity, and favorite retweet-weighted polarity had a similar negative correlation with the label. Finally, in reaction-based feature selection, the favorite count, sentiments (polarity), and the number of hashtag features were used for classification (Table 9). The generated features were input into the sequential TensorFlow model. In this process, the model was trained with the first dense layer with 32 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 16 nodes, a ReLU activation function, a dropout layer with a value of 0.2, a dense layer with 16 nodes, a ReLU activation function, and a final layer, which was a dense layer with a sigmoid activation function. This process was compiled with the Adam optimizer, and the loss was compiled using binary cross-entropy (Figure 7).

### Result 3

This model was trained using various parameters retrieved from various layers of the model (Table 10]. Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. We found that the model was well trained, with an accuracy of 63.5% and a testing data accuracy of 63.6% (Table 4).

### Experiment 4

This experiment predicted the probabilities from each of the last three models; these predicted probabilities were treated as separate features to train the model. This experiment was based on a Twitter rumor text classification model using sequential TensorFlow neural network-based approaches applied to Twitter ensemble probabilistic weights of the three proposed models. Ensemble probabilistic weight features were input into the sequential TensorFlow model. In this process, the model was trained with a dense layer with 64 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 64 nodes, a ReLU activation function, a dropout layer with a value of 0.3, a dense layer with 1 node, a sigmoid activation function, and a glorot uniform weight initializer. This process was compiled with the Adam optimizer, and the loss was compiled with binary cross-entropy (Figure 8).

### Result 4

This model was trained using various parameters taken from various layers of the model (Table 11). Sequential TensorFlow models were used to assess results. We calculated the loss values and the model training and testing accuracies after 150 epochs. From the results, the model was well trained with an accuracy of 88.9% and testing data accuracy of 82.5% (Table 4).

### Experiment 5

This experiment was based on the prediction of unseen test data. In this experiment, a total of 1038 unseen tweets were selected from PHEME, Twitter15, and Twitter16 datasets. The results were evaluated with all the models, and their accuracy was determined (Figure 9).

### 6. Results Analysis

This study compared the results of four different rumor classification models. The results of the first experiment with rumor text classification using the word-embedding model suggested the importance of sequential text, with 80.1% validation accuracy. Experiments 2 and 3 showed the importance of user metadata and reaction features of the tweet with validation accuracies of 74.5% and 63.6%, respectively. Finally, Experiment 4 suggested the importance of all combined probabilistic features of the different models. These ensemble features provided a validation accuracy of 82.5%, which was higher than those of all the other models (Figure 9). In Experiment 5, we used unseen data and obtained the most accurate results from the ensemble probabilistic features model (Figure 10).

All models were combined and trained on the dataset to obtain their final weights. These weights determined the extent to which the rumor classification task was dependent on the model, which in turn provides the weights required to understand the importance of different tweet attributes. The main contributions of this research are as follows: (1) proposing a feature extraction mechanism and preprocessing techniques to make data more relevant; (2) constructing a hypothesis based on metadata in the user and tweet datasets and mapping the relationship to the problem; (3) correlating derived and extracted features to actualize attribute relationships and reduce dimensionality for effective resource utilization; (4) ensemble strategies for merging models as well as training and assessing the final model for a general answer; (5) formulating the relevance of feature variables by understanding their weight distribution; (6) defining loss metrics for dichotomous rumor data categorization problem; and (7) the accuracy of the ensemble model can be improved through live training and manual weight adjustment.

To understand the importance of the user metadata features in an ensemble model, the built-in feature importance function of the decision tree classifier was used to obtain the feature importance scores (Figures 11 and 12).

### 7. Conclusion and Future Work

This study compared four different models using different features of tweet data. The results of the ensemble probabilistic features rumor classification model provided better training accuracy, validation accuracy, and loss value. Furthermore, we tested the models on unseen data and concluded that the ensemble model overestimates the importance of the Tweet text model because of the limited availability of data. The ensemble probabilistic feature rumor classification model performed better than the other individual feature models. This study demonstrated the significance of traits that played a significant role in the task of identifying tweets as rumors or non-rumors. Future work on improving these models will include live training of ensemble feature models using real-time data to increase the accuracy of the model.

### Fig 1.

Figure 1.

Rumor text classification using word vector features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 2.

Figure 2.

Rumor text classification using user metadata features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 3.

Figure 3.

Rumor text classification using reaction features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 4.

Figure 4.

Rumor text classification using ensemble features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 5.

Figure 5.

Architecture for rumor text classification using word vectors.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 6.

Figure 6.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 7.

Figure 7.

Architecture for rumor text classification using Twitter reaction features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 8.

Figure 8.

Architecture for rumor text classification using ensemble probabilistic features.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 9.

Figure 9.

Accuracy analysis of unseen data.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 10.

Figure 10.

Training accuracy and validation accuracy comparisons of all models.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 11.

Figure 11.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

### Fig 12.

Figure 12.

Feature importance analysis of the ensemble model.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 325-338https://doi.org/10.5391/IJFIS.2022.22.3.325

# of users276,663173,487
# of source tweets1,490818
# of non-rumors374205
# of false rumors370205
# of true rumors372205
# of unverified rumors374203
Avg. time length (hr)1,337848
Avg. # of posts223251
Max #of posts1,7682,765
Min # of posts5581

Statistics of PHEME dataset.

Charlie Hebdo2,07938,2684581,621193116149
Sydney siege1,22123,9965226993828654
Ferguson1,14324,175284859108266
Ottawa shooting89012,2844704203297269
German wings crash4694,4892382319411133
Putin missing23883512611209117
Prince Toronto233902229402227
Gurlitt13817961775902
Ebola Essien142261400140
Total6,425105,3542,4024,0231,067638697

Parameters received from different layers of rumor text classification model using word vectors.

Layer (type)Output shapeParam#
embedding(Embedding)(None,93,300)2,724,600
dropout(Dropout)(None,93,300)0
bidirectional(Bidirectional LSTM)(None,93,128)186,880
Bidirectional_1(Bidirectional LSTM)(None,93,128)98,816
bidirectional_2(Bidirectional LSTM)(None,93,128)98,816
bidirectional_3(Bidirectional LSTM)(None,93,128)98,816
bidirectional_4(Bidirectional LSTM)(None,64)41,216
dense(Dense)(None,32)2,080
batch_normalization(BatchNormalization)(None,32)128
dropout_1(Dropout)(None,32)0
Total params: 6,317,985
Trainable params: 526,721
Non-trainable params: 5,791,264

Loss and accuracy comparisons of different rumor text classification models.

ModelsLossAccuracy (%)
TrainingValidationTrainingValidation
Rumor text classification using word vector features0.2740.44988.680.1
Rumor text classification using user metadata features0.4690.55077.574.5
Rumor text classification using reaction features0.6440.64763.563.6
Rumor text classification using ensemble probabilistic features0.2700.43088.982.5

Numerical featuresCategorical features
Retweet countVerified
Favorite countIs_quote_status
No. of symbolsProfile_image_url
No. of user mentionsprofile_background_image_url
No. of hashtagsDefault profile image
No. of URLsDefault profile
PolarityProfile_use_background_image
Text lengthHas_location
Post ageHas_url
Status count
Friends count
Favorites count of user
Listed count
Account age
Screen name length
The time gap between user-created time and tweeted time

Extracted important features of tweet user metadata after applied standard scalar method.

Numerical featuresCategorical features
Text lengthProfile_use_background_image
No. of URLsProfile_background_image_url
Post ageDefault_profile_image
Status countDefault_profile
Listed countVerified
No. of symbols
Posted_in
No. of user mentions
Polarity
Favorites count of user
Favorites count of a tweet
Screen name length
No. of hashtags

Parameters received from different layers of rumor text classification model using tweet user metadata features.

Layer (type)Output shapeParam #
dense_1(Dense)(None,64)1216
dropout_1(Dropout)(None,64)0
dense_2(Dense)(None,64)4160
dropout_2(Dropout)(None,64)0
dense_3(Dense)(None,1)65
Total params: 5,441
Trainable params: 5,441
Non-trainable params: 0

Retweet count
Favorite count
Sentiments
No. of hashtags
Favorite weighted polarity
Retweet-weighted polarity
Favorite retweet-weighted polarity

Useful Twitter reaction features after applied feature correlation method.

Favorite count
Sentiments (Polarity)
No. of hashtags

Parameters received from different layers of rumor text classification model using tweet reaction features.

Layer (type)Output shapeParam #
dense_1(Dense)(4598,32)256
drouput_1(Dropout)(4598,32)0
dense_2(Dense)(4598,16)528
dropout_2(Dropout)(4598,16)0
dense_3(Dense)(4598,16)272
dense_4(Dense)(4598,1)17
Total params: 1,073
Trainable params: 1,073
Non-trainable params: 0

Parameters received from different layers of rumor text classification model using ensemble probabilistic features.

Layer (type)Output shapeParam #
dense_1(Dense)(None,64)256
droupout_1(Dropout)(None,64)0
dense_2(Dense)(None,64)4160
dropout_2(Dropout)(None,64)0
dense_8(Dense)(None,1)65
Total params: 4,481
Trainable params: 4,481
Non-trainable params: 0

### References

1. Shu, K, Sliva, A, Wang, S, Tang, J, and Liu, H (2017). Fake news detection on social media: a data mining perspective. ACM SIGKDD Explorations Newsletter. 19, 22-36. https://doi.org/10.1145/3137597.3137600
2. Chen, T, Li, X, Yin, H, and Zhang, J (2018). Call attention to rumors: deep attention based recurrent neural networks for early rumor detection. Trends and Applications in Knowledge Discovery and Data Mining. Cham, Switzerland: Springer, pp. 40-52 https://doi.org/10.1007/978-3-030-04503-6_4
3. Rath, B, Gao, W, Ma, J, and Srivastava, J . From retweet to believability: utilizing trust to identify rumor spreaders on Twitter., Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2017, Sydney, Austria, Array, pp.179-186. https://doi.org/10.1145/3110025.3110121
4. Ma, J, Gao, W, and Wong, KF . Detect rumors in microblog posts using propagation structure via kernel learning., Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017, Vancouver, Canada, pp.708-717.
5. Putra, YA, and Khodra, ML . Deep learning and distributional semantic model for Indonesian tweet categorization., Proceedings of 2016 International Conference on Data and Software Engineering (ICoDSE), 2016, Denpasar, Indonesia, Array, pp.1-6. https://doi.org/10.1109/ICODSE.2016.7936108
6. Ma, J, Gao, W, Mitra, P, Kwon, S, Jansen, BJ, Wong, KF, and Cha, M . Detecting rumors from microblogs with recurrent neural networks., Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016, New York, NY, pp.3818-3824.
7. Guo, H, Cao, J, Zhang, Y, Guo, J, and Li, J . Rumor detection with hierarchical social attention network., Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, Torino, Italy, Array, pp.943-951. https://doi.org/10.1145/3269206.3271709
8. Wu, K, Yang, S, and Zhu, KQ . False rumors detection on Sina Weibo by propagation structures., Proceedings of 2015 IEEE 31st International Conference on Data Engineering, 2015, Seoul, Korea, Array, pp.651-662. https://doi.org/10.1109/ICDE.2015.7113322
9. Wu, L, Li, J, Hu, X, and Liu, H . Gleaning wisdom from the past: Early detection of emerging rumors in social media., Proceedings of the 2017 SIAM International Conference on Data Mining, 2017, Houston, TX, Array, pp.99-107. https://doi.org/10.1137/1.9781611974973.12
10. Margolin, DB, Hannak, A, and Weber, I (2018). Political fact-checking on Twitter: when do corrections have an effect?. Political Communication. 35, 196-219. https://doi.org/10.1080/10584609.2017.1334018
11. Nguyen, TN, Li, C, and Niederee, C (2017). On early-stage debunking rumors on twitter: leveraging the wisdom of weak learners. Social Informatics. Springer Cham: Switzerland: Springer, pp. 141-158 https://doi.org/10.1007/978-3-319-67256-4_13
12. Zubiaga, A, Liakata, M, and Procter, R. (2016) . Learning reporting dynamics during breaking news for rumour detection in social media. Available: https://arxiv.org/abs/1610.07363
13. Castillo, C, Mendoza, M, and Poblete, B (2013). Predicting information credibility in time-sensitive social media. Internet Research. 23, 560-588. https://doi.org/10.1108/IntR-05-2012-0095
14. Fard, AE, Mohammadi, M, Chen, Y, and Van deWalle, B (2019). Computational rumor detection without non-rumor: a one-class classification approach. IEEE Transactions on Computational Social Systems. 6, 830-846. https://doi.org/10.1109/TCSS.2019.2931186
15. Alsaeedi, A, and Al-Sarem, M (2020). Detecting rumors on social media based on a CNN deep learning technique. Arabian Journal for Science and Engineering. 45, 10813-10844. https://doi.org/10.1007/s13369-020-04839-2
16. Alkhodair, SA, Ding, SH, Fung, BC, and Liu, J (2020). Detecting breaking news rumors of emerging topics in social media. Information Processing & Management. 57. article no 102018
17. Zhao, Z, Resnick, P, and Mei, Q . Enquiring minds: early detection of rumors in social media from enquiry posts., Proceedings of the 24th International Conference on World Wide Web, 2015, Florence, Italy, Array, pp.1395-1405. https://doi.org/10.1145/2736277.2741637
18. Asghar, MZ, Habib, A, Habib, A, Khan, A, Ali, R, and Khattak, A (2021). Exploring deep neural networks for rumor detection. Journal of Ambient Intelligence and Humanized Computing. 12, 4315-4333. https://doi.org/10.1007/s12652-019-01527-4