International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 401-413
Published online December 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.4.401
© The Korean Institute of Intelligent Systems
Alif Tri Handoyo1*, Hidayaturrahman1, Criscentia Jessica Setiadi2, Derwin Suhartono1
1Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia, 11480
2English Department, Faculty of Humanities, Bina Nusantara University, Jakarta, Indonesia, 11480
Correspondence to :
Alif Tri Handoyo (alif.handoyo@binus.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sarcasm is the use of words commonly used to ridicule someone or for humorous purposes. Several studies on sarcasm detection have utilized different learning algorithms. However, most of these learning models have always focused on the contents of expression only, thus leaving the contextual information in isolation. As a result, they failed to capture the contextual information in the sarcastic expression. Moreover, some datasets used in several studies have an unbalanced dataset, thus impacting the model result. In this paper, we propose a contextual model for sarcasm identification in Twitter using various pre-trained models and augmenting the dataset by applying Global Vector representation (GloVe) for the construction of word embedding and context learning to generate more sarcastic data, and also perform additional experiments by using the data duplication method. Data augmentation and duplication impact is tested in various datasets and augmentation sizes. In particular, we achieved the best performance after using the data augmentation method to increase 20% of the data labeled as sarcastic and improve the performance by 2.1% with an F1 Score of 40.44% compared to 38.34% before using data augmentation in the iSarcasm dataset.
Keywords: Twitter sarcasm detection, RoBERTa, Word embedding, Data augmentation
Twitter Data is one of the big data in terms of size because it has millions of tweets every day. It is also one of the largest social networking sites. We use this Twitter data for commercial and industrial or social purposes in accordance with our requirements and data processing. It is a large amount of data in size that increases every second, known as velocity. Due to the large amount of data that increases every day, we cannot easily analyze this data. Many companies and organizations have been interested in these data to study the opinion of people towards political events [1], popular products [2], or movies [3].
A challenge specific to sarcasm detection is the difficulty in acquiring ground-truth annotations. Human-annotated datasets usually contain only a few thousand texts, resulting in many small datasets. In comparison, automatic data collection using distant supervision signals like hashtags yielded substantially larger datasets [4]. Nevertheless, the automatic approach also led to label noise. For example, it is found nearly half of the tweets with sarcasm hashtags in one dataset are non-sarcastic. There is also a noise-free example of a dataset where the sarcasm tweets are labeled by their authors but have huge, unbalanced data resulting in a very low performance [5].
Some people are more sarcastic than others. However, sarcasm is very common in general, although it is difficult to recognize [6]. Sarcasm is a form of irony that occurs when there is a discrepancy between the literal and intended meanings of an utterance. This discrepancy is often used in the form of contempt to represent the distance from the previous statement [5]. But in general, people not only sarcastically make jokes and humor in their daily lives but also when criticize, speak out about ideas, and events. Therefore, it is widely used on social networks, especially microblogging sites such as Twitter.
A particular challenge when detecting sarcasm is the difficulty of getting ground truth annotations. Human-annotated datasets typically contain only thousands of texts, resulting in small datasets. In comparison, automated data acquisition using distant supervision signals like hashtags produced much larger datasets [4]. However, the automatic approach also caused label noise. For example, it is found that almost half of tweets that have a sarcastic hashtag in their dataset are actually not sarcastic. There are also examples of noise-free datasets where the sarcastic tweets are labeled by their authors, but they contain a huge amount of imbalanced data, resulting in a very low performance [5].
In recent years, various research has been conducted to build a model that can predict sarcastic tweets on Twitter. However, only a few research was focusing on data augmentation techniques in sarcasm detection problems. Some of the research was reviewed in the next couple of paragraphs.
In 2020, Hankyol Lee et al. proposed, CRA (Contextual Response Augmentation), which utilizes conversational context to generate a meaningful context for the training sample. In their research, they experiment to make data augmentation with labeled data in Twitter datasets from the FingLang2020 sarcasm challenge. Each training sample in the labeled dataset consists of a contextual utterance, a response, and its label (?SARCASM? or ?NOT_SARCASM?). To augment the not sarcasm data, they use context sequence as a new data point and label it as NOT_SARCASM. As for the sarcasm text, they use a simple back-translation method to augment the data. The labeled data were trained using an ensemble model mixed with BERT [7], BiLSTM [8], and NeXtVLAD [9]. The data augmentation technique gives a performance boost of only 0,66% with an 86.13% F1 Score compared to 85.46% F1 Score without using data augmentation in the Twitter dataset [10].
In 2021 Abeer Abuzayed et al. proposed seven fine-tuned BERT-based models with data augmentation to solve imbalanced data problems in ArSarcasm-v2 [11] dataset. The ArSarcasmv2 dataset was made to solve both sarcasm detection and sentiment analysis tasks. The sarcasm detection dataset contains Arabic tweets with an imbalance class which is 10380 data for non-sarcastic tweets and only 2168 data for sarcastic tweets. For the sentiment analysis task, the dataset contains neutral, negative, and positive sentiments. To solve the imbalanced data problem in the sarcasm detection task, they take the negative sentiment in the sentiment analysis task dataset and use it as their sarcastic labeled data. This is because, from their analysis, they found that 89% of data labeled as sarcastic were having negative sentiments at the same time. Thus, they theorize that negative tweets can be sarcastic too. Seven BERT-based models were used to detect the sarcastic tweets and MARBERT give the best score with an F1 Score for sarcastic label data before augmentation and give an 80% F1 score for sarcastic label data after performing data augmentation [12].
Recently in 2022, another research by Shubham K and Mosab S experimented on two different data augmentation techniques by using word embeddings and repeating instances technique on the original dataset which is an internal dataset provided by the organizer in SemEval-2022 task. In their experiment, they use original dataset and performing the dataset augmentation using word embedding technique with GloVe [13] and original-repetition technique to augment the data. The experiment was carried out by building a model using each of those augmentation techniques to predict sarcasm tweets. The model was trained using BERT [7] model and provided 79% of test accuracy before performing data augmentation, 81% for the original-repetition data augmentation technique, and 83% test accuracy when using the word-embedding augmentation technique [14].
Another research in 2022 that has been made at SemEval-2022 task was focusing on the analysis of sarcasm detection when using data augmentation. That research was proposed by Amirhossein A, et al. by proposing a mutation-based data augmentation technique. Mutation-based augmentation technique work by performing three processes in sequential order, starting with shuffling, deleting, and replacing. Thus, the augmentation work by first shuffling the sentences, deleting some words, and replacing the word with their synonym which has been taken from the Thesaurus website. The best results were given when building the prediction model using RoBERTa [15] with a 41.4 % of F1 Score in iSarcasm [5] dataset [16].
Based on the previous work [10,12,14,16] it can be concluded that there are still few labeled sarcastic data resource for sarcasm detection. Thus, some research was performed to tackle the imbalanced data problem due to the lack of sarcastic labeled data. Various work has been made to explore many different data augmentation techniques. However, many of them performed the augmentation in their own private dataset, and only a few of them performed it in publicly available datasets. Their analysis was also still deficient, mainly because they only focused on the F1 Score metric and lack of confusion matrix analysis before performing the augmentation and after performing the data augmentation.
Therefore, to tackle those issues, in this research we make the following contributions:
(1) Application of three different pre-trained model, BERT [7], RoBERTa [15] and DistilBERT [17] on four different publicly available twitter sarcasm detection dataset which is iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] to analyze the performance of those models in detecting sarcastic tweets
(2) Performing data augmentation and data duplication technique in various dataset like iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] and analyze the performance impact in terms of F1 Score.
(3) Analyze the performance impact of data augmentation and data duplication techniques by analyzing the confusion matrix results using MCC performance and performing a deep analysis of confusion matrix results by comparing the amount of true positive, true negative, false positive, and false negative after predicting sarcastic text across multiple datasets [5, 18, 19, 20].
In this paper, we explore data augmentation in the sarcasm detection model based on two questions: (1) How to tackle the imbalanced dataset problem in some of the sarcasm tweet datasets? (2) What is the performance impact of data augmentation when predicting sarcasm sentences?
We analyze the performance impact when using data augmentation in four different datasets, iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20]. Data augmentation is applied in the sarcastic labeled data to increase the size of the data. GloVe [13] was used for word embeddings to generate more data by replacing similar words in a sentence like “good” with “happy” that express similar meanings. We use three pre-trained neural representation model which is BERT [7], RoBERTa [15] and DistilBERT [17] to predict sarcastic tweets. The results will be compared based on before and after performing data augmentation to increase sarcastic labeled data by 10%, 20%, and 30% respectively, and also comparing it with a simple data duplication technique.
The remainder of this paper is structured as follows. Section II describes some of the related work. Section III describes in detail our experiment method. Section IV illustrates our experiments, results, and discussion. Section V concludes the result of this work and mentions some scope for future research.
Acquiring large and reliable datasets has been quite a challenge for sarcasm detection. Due to the cost of annotation, manually labeled datasets such as iSarcasm [5] and SemEval-18 [9] typically contain only a few thousand data and have an unbalanced dataset. Automatic crawling datasets such as Ghosh [7] and Ptacek [8] generate much more data but the result is considerably noisy. As a case study, after examining some automatic crawling datasets, it is found that nearly half of the tweets with sarcastic labels are non-sarcastic [5].
Sarcasm identification tasks have been studied using a variety of methods, such as lexicon-based, traditional machine learning, deep learning, and hybrid approach. Several sarcasm detections had also been reviewed through Systematic Literature Review (SLR) to identify sarcasm in textual data [12]. This study consists of reviewing the datasets, pre-processing, feature engineering, classification algorithms, and performance measurements in the sarcastic identification literature. This study also found that content-based are the most common features used in sarcasm classification. In this study, they found that standard metrics such as precision, recall, accuracy, f-measure, and Area Under the Curve (AUC) are the most commonly used evaluation metric for evaluating classifiers performance. In addition, it is also found that the AUC performance measure is the best choice to measure the performance when the dataset is imbalanced, due to its robustness in resisting the imbalanced dataset.
Many researchers have analyzed the task of sarcasm identification. A study reported two methods, one for “Incongruent words-only” and the other is “all-words: method” in “Expect the Unexpected: Harnessing Sentence Completion for Sarcasm Detection” research by employing “Sentence completion” for sarcastic analysis [13]. Two datasets were used for evaluating purposes, including Twitter data collected by Riloff, et al. [14] which consist of 2278 tweets (506 sarcastic, and 1772 nonsarcastic). Another dataset was collected by [15], which contain balanced tweets that were manually labeled (752 sarcastic and 752 non-sarcastic). Meanwhile, WordNet and word2vec were used to measure the performance similarity. For evaluation, cross-validation with 2-fold was used and provided the overall F1 Score of 54% when Word2Vec similarity for the all-words method is applied. Furthermore, when using WordNet incongruous words-only method it gives a better accuracy with an F1 Score of 80.24%. In comparison, an 80.28% F1 Score is obtained when using the WordNet similarity and incongruous words-only method using 2-fold cross-validation.
The optimized model known as the Robustly Optimized BERT Approach (RoBERTa), used 10 times more data (160GB as compared with the 16GB originally exploited) and is trained with more epochs than the original BERT model (500K vs 100K), and 8-times larger batch sizes, and a byte-level BPE vocabulary instead of character-level vocabulary that was originally utilized. Another improvement was the dynamic masking technique used in RoBERTa instead of the single static mask used in BERT. Moreover, RoBERTa model removes the next sentence prediction objective used in BERT, following the recommendation by other studies that question the NSP loss term [11]. RoBERTa [15] have been used in various sarcasm identification tasks and achieved satisfactory result. A study performed a contextual embedding method with RoBERTa [15] and obtained an 82.44% F1 Score [21]. A study using RoBERTa [15] in sarcasm identification tasks with the Context Separators method achieves a 77.2% F1 Score [22]. A study in Transformers on sarcasm detection with context, obtained the highest performance results using RoBERTa [15] with a 77.22% F1 Score [23]. Another study in Transformer-based Context-aware Sarcasm Detection in Conversation Threads from Social -Media obtained a great result with a 78.3% F1 Score for Twitter and a 74.4% F1 Score for Reddit datasets [24].
A study in Transformers on sarcasm detection with context, obtained the highest performance results using RoBERTa [15] with a 77.22% F1 Score [23]. Another study in Transformer-based Context-aware Sarcasm Detection in Conversation Threads from Social-Media obtained a great result with a 78.3% F1 Score for Twitter and a 74.4% F1 Score for Reddit datasets [24].
Data Augmentation is a well-known technique in the field of computer vision for introducing different distributions of data to extend a dataset and improve the performance of a model for multiple tasks. It is generally believed that the more data that is trained in a neural network, the more effective it will be. Augmentations are applied by performing simple image transformations such as rotation, cropping, flipping, translation, and addition of Gaussian noise to the image. In one study, a data augmentation technique was used to increase the training data when training a deep neural network on the ImageNet [25] dataset. Increasing training data samples reduces the overfitting of the model and increased the model performance [26]. These techniques allow the model to learn additional patterns in the image and identify new positional aspects of the objects in the image.
Similarly, data augmentation methods were also explored in the field of text processing to improve the efficacy of models. Studies were conducted by replacing random words in sentences with their respective synonyms, generating augmented data, and training the recurrent network for sentence similarity tasks [27]. In another study, they used word embeddings in sentences to generate augmented data to increase the data size and trained a multi-class classifier with tweet data. They used cosine similarity to find the nearest neighbor of the word vector and used it as a replacement for the original word. The word selection was done stochastically [28].
For information extraction, one study [29] applied data augmentation on a legal dataset [30]. Class-specific probability classifiers have been trained to identify a particular contract element type for each token in a sentence. They categorized a token in a sentence based on the window that surrounds them. They used the word embedding obtained by pre-training the word2vec model [31] on unlabeled contract data.
In their study, we investigated three methods of data augmentation namely, interpolation, extrapolation, and random noise. The augmentation methods manipulate word embeddings to obtain a new set of sentence representations. The study also emphasized that the interpolation method performed better than other methods like extrapolation and random noise [29].
Another study examined various augmentation methods where they introduced other techniques like gaussian noise or Bernoulli noise into the word embeddings in text-related classification tasks such as sentence classification, sentiment classification, and relation classification [32].
Given a set of tweets, we aim to classify each of them depending on whether it is sarcastic or not. We use data augmentation to increase 10%, 20%, and 30% sarcastic labeled data respectively. Then from each tweet, we extract a set of features referring to a training set and use three different models which are BERT [7], RoBERTa [15], and DistilBERT [17] to perform the classification. The features are extracted in a way that covers the different types of sarcasm we identified. We also perform an additional experiment to perform a data duplication technique by duplicating 20% of the sarcastic data to further analyze the impact of data augmentation in sarcasm detection. The workflow of our proposed methodology can be seen in Figure 1.
We conduct sarcasm detection experiments using two automatic datasets and two manually annotated datasets. The two automatically collected datasets include Ptacek [19] and Ghosh [18], which treat tweets having particular hashtags such as #sarcastic, #sarcasm, #not as sarcastic, and others as non-sarcastic.
iSarcasm [5] dataset contains tweets written by participants of an online survey and thus is an example of intended sarcasm detection while SemEval-18 [20] consists of both sarcastic and ironic tweets supervised by third-party annotators and thus is used for perceived sarcasm detection. Ghosh [18] and Ptacek [19] dataset is balanced. For all datasets, we use the predefined train, validation, and test set from [4] and use their train dataset as our train dataset then concatenate their validation and test dataset and use it as our test dataset. Thus, our proposed method uses a dataset that is split in two which are train and test data.
Details of every dataset used in this research which is iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] can be seen in Table 1.
Also, the proportion of sarcastic and non-sarcastic data in each dataset can be seen in Table 2.
Based on the distribution of data in Table 2, we can see that the most unbalanced dataset is iSarcasm [5] dataset and the rest of the dataset seems fairly balanced.
Preprocessing is an integral part of any NLP study. It is done so that the preprocessed part would not give any weight and biasness to the experiments. These are the steps of preprocessing conducted in this study.
First of all, the datasets were preprocessed using the lexical normalization tool for tweets [33]. We then cleaned the four datasets by dropping all the duplicate tweets within and across datasets and trimmed the texts to a maximum length of 100.
Next, we cleaned the tweet texts by deleting all the texts inside the bracket which contain URL links, hashtags, foreign language characters, stop words removal, non-English ASCII character, and emoji. Lastly, we perform data augmentation.
We use data augmentation to increase the sarcastic labeled text to make datasets more balanced. We then observe the performance impact when applying data augmentation.
To make a conclusive report, we apply data augmentation in the sarcastic-labeled text to all four datasets, regardless of whether it is balanced or not, this way we can conclude the impacts. GloVe [13] was used for word embeddings to generate more data by replacing words in sentences that have similar meanings. We perform data augmentation to increase sarcastic labeled text data with 3 different sizes of data augmentation 10%, 20%, and 30%. In addition, we also perform data duplication with 20% size. The amount of sarcastic data in every dataset after performing data augmentation can be seen in Table 3.
In order to generate the augmented text, we use nlpaug [34] WordEmbsAug augmenter with substitute action using GloVe [13] embeddings to apply the augmentation. This means that we use GloVe [13] pre-trained word embeddings model which is glove.6B.100d to perform similarity searches of certain words in the sentence and replace them with the most similar word. Thus, the data augmentation technique in this proposed word was using word level augmentation. To further understand how data augmentation works in this proposed method, a visualization of the data augmentation process can be seen in Figure 2.
From Figure 2, we can see that the data augmentation process starts by feeding the original text to GloVe [13] to perform word embeddings to get the vector representation of each word. Then, nlpaug [34] library will randomly pick 0.2 percent of words in the text and use GloVe [13] to perform synonym replacement by comparing the selected word vector that has similar word embeddings based on cosine similarity from pretrained glove.6B.100d pre-trained model. Thus, the original text was then replaced with their synonym.
An example of the original text which is taken from iSarcasm [5] dataset and the augmented text result can be seen in Table 4.
From Table 4 results we can see that the augmented text substitute only a small portion of word on the original text, this is because in our proposed model we only use 0.2 percent of the word that will be augmented. We find that by using only a small percentage of word replacement, we still can keep the context of the sentence, and more importantly the sarcastic nature of the original sentence.
In this paper, we experiment using three deep learning pretrained NLP (Natural Language Processing) models which are BERT [7], RoBERTa [15], and DistilBERT [17]. We use a training dataset to tune the hyper-parameters of those models. For bigger automatically collected datasets such as Ptacek [19] and Ghosh [18], we use a batch size of 32 and an epoch size of 8, while for the smaller two manually annotated datasets include iSarcasm [5] and SemEval-18 [20] we use a batch size of 16 and an epoch size of 13. Details of the hyper-parameter settings on those models can be seen in Table 5.
Most of the values of that hyper-parameter were obtained from the default value of the pre-trained models. In addition, we conduct an experiment to obtain the optimal value for num train epochs and train batch size using a gridsearch method.
In this section, the results produced by all the experiments are given. The experiments start by analyzing the performance of three models which is BERT [7], RoBERTa [15], and Distil-BERT [17] for detection sarcasm tweets in four datasets which is iSarcasm [5], Ghosh [18], Ptacek [19] and SemEval-18 [20]. Next, we analyze the performance of every model after performing data augmentation and data duplication and compare the results. This section started with the discussion of how each size configuration when using data augmentation and data duplication techniques impacts the performance of sarcasm detection models, then followed by the discussion of the confusion matrix using a true positive, true negative, false positive, and false negative analysis to further analyze the impact of data augmentation and data duplication in sarcasm detection in various datasets.
The performance results in this experiment are shown in terms of F1 Score and MCC (Matthews correlation coefficient). F1 Score is a good performance metric to use in this case due to the imbalance of iSarcasm [5] dataset. MCC is also used as a metric to better analyze the confusion matrix due to the fact that its metric not only depends on the positive results but also on the negative results, thus providing more information on the models ability to predict both sarcastic and non-sarcastic sentence compared to F1 Score which only focused on the models ability to predict sarcastic sentence.
The performance of three models which is BERT [7], RoBERTa [15], and DistilBERT [17] before and after performing data augmentation increased 10%, 20%, and 30% of sarcastic data, and also after performing data duplication to increase 20% of sarcastic data can be seen in Table 6.
Based on the performance results in Table 6, the first thing we notice is that the best model in terms of performance in detecting sarcasm sentences is RoBERTa [15] given the fact that it gives the highest results in terms of F1 Score and MCC in every scenario of experiments both before augmentation and after augmentation in every dataset. This happens likely because RoBERTa [15] is using a byte-level Byte-pair Enconding (BPE) encoding scheme in the tokenization process, compared to other models like BERT [7] and DistilBERT [17].
BPE ensures that the most common words are represented in the vocabulary as a single token, while the rare words are broken down into two or more subword tokens thus, it will make sure to represent the corpus with the least number of tokens which is also the main goal of the BPE algorithm, that is, compression of data. The least number of tokens in the corpus will help the model to solve the classification task, in this case classifying whether a sentence is sarcastic or not sarcastic. Another important factor that made RoBERTa [15] better models compared to others is because it is trained on larger datasets that goes over 160GB of uncompressed text from various sources.
RoBERTa [15] performance results in Table 6 shows that different conclusion can be made on the impact of data augmentation and data duplication for different datasets. iSarcasm [5], a very imbalanced dataset gives the best result when adding 20% of sarcastic data with the data augmentation technique with the F1 Score of 40.44% compared to 38.34% before performing data augmentation. This means that data augmentation can give a slight performance boost when performing data augmentation in an imbalanced dataset like iSarcasm [5]. Although if we augmented too much data, it could result in a performance drop due to the introduced bias, and if the augmented data is too low there seems to be no difference in terms of performance. The same thing also happens in MCC results, due to the model becoming better at predicting sarcastic sentences, it also impacted MCC results and gain a slight performance boost with 30.84% MCC, compared to 28.42% MCC before performing data augmentation.
As for RoBERTa [15] performance in Ghosh [18] and Ptacek [19] dataset, which is a big and moderately balanced dataset, shows that the augmentation technique did not increase the F1 or MCC performance. However, when performing data duplication technique, both dataset shows a slight performance boost with an 80.42% F1 Score compared to a 79.04% F1 Score before performing data augmentation for Ghosh [18] dataset. As for Ptacek [19] dataset, the results with the data duplication technique give an 87.41% F1 Score compared to an 85.96% F1 Score before performing data augmentation. Data augmentation, in this case, reduces the performance, and it is likely because with the bigger dataset, when we generated the data based on percentage, it will generate a lot of data, thus introducing more bias in the dataset. The MCC also gives a slight performance boost in Ghosh [18] and Ptacek [19] datasets with 65.41% and 74.74% MCC, compared to 62.99% and 74.54% MCC respectively.
Based on Table 6, RoBERTa [15] performance in SemEval-18 [20] dataset, which is a small and very balanced dataset shows a slight increase in performance with 67.46% F1 Score after performing data augmentation by adding 30% sarcastic data, compared to 66.37% F1 Score before performing data augmentation. The same results are also shown in MCC, with a slight performance increase of 43.82% after performing data augmentation, compared to 41.09% before performing data augmentation. Due to the small data in this dataset, performing data augmentation, in this case, did not have any significant impact on the performance.
Based on the overall results of the performance comparison of the F1 Score and MCC in Table 6, it can be observed that duplicating data will provide a greater performance boost than data augmentation in big datasets like Ghosh [18] and Ptacek [19]. This happens because when the data is duplicated, it makes the model more sensitive to the appropriate keywords from the duplicated data. Data augmentation can generate new data that turns the sarcasm keywords into completely new words that introduce more noise in the data and make the model perform poorly. However, if we perform data augmentation in a small dataset like iSarcasm [5] and Ptacek [19], it will give better results with the right amount size, compared to the data duplication technique. Furthermore, if we perform it in small and imbalanced data like iSarcasm [5], data augmentation will surely give a better performance boost compared to data duplication.
A comparison and discussion between our proposed model with other works in detecting sarcasm sentences can be seen in Table 7.
Based on the F1 Score and MCC results in Table 6 we can see that the best model that performs in detecting sarcastic tweets in various datasets like iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] is RoBERTa [15] model. Therefore, our confusion matrix analysis will be focused on this model. Figures 3 to
The true positive (TP) result in Figure 3 means that when the predicted tweet is found to be sarcastic, the result of the classification shows exactly sarcastic after the experimental evaluation. The true negative (TN) results in Figure 4 show when the predicted tweet is non-sarcastic, and the classification result also validates it as non-sarcastic. As for the false positive (FP) result in Figure 5, it shows when the predicted tweets are sarcastic, while the original tweets are actually non-sarcastic. As for the false negative (FN) result, it shows when the predicted tweets are non-sarcastic, while the original tweets are actually sarcastic, thus, the model fails to predict the sarcastic tweets.
Based on the true positive result in Figure 3, it can be observed that in the iSarcasm [5] dataset, the amount of true positive increased mostly by 5.88% at a 20% increase in data augmentation. The result means that the amount of correctly predicted as sarcastic was increased in the iSarcasm [5] dataset. However, for SemEval-18 [20] dataset, it seems that at 20% augmentation the true positive results were decreased by 1.51% even though the F1 score was increased based on Table 6 result. Ghosh [18], Ptacek [19] dataset show an increased true positive amount by 1.22% and 1.24% respectively with the data duplication technique, which is in line with the increased F1-Score from Table 6.
The true negative result in Figure 4, shows that in SemEval18 [20] dataset, there true negative increased by 7.14% at a 20% increase in data augmentation, meaning the amount correctly predicted as non-sarcastic was increased. In addition, other datasets such as iSarcasm [5], Ghosh [18], and Ptacek [19] occasionally show similar trends. It means augmenting the data will helps the model learn non-sarcastic data.
The false positive result in Figure 4, shows that overall, data augmentation reduced the amount of false positive in all datasets. This means that the model is more capable of predicting the non-sarcastic text or giving a wrong result of predicting the non-sarcastic text as sarcastic.
The false negative result in Figure 5 shows that every balanced dataset such as Ghosh [18], Ptacek [19], and SemEval-18 [20], shows a trend of increased false negative. This means that due to the increased sarcastic data, the model will likely also make a false prediction in sarcastic data. However, this is not the same case with an imbalanced dataset like iSarcasm [5], as it decreases the number of false negative result by 2.86% at 20% data augmentation.
After analyzing the confusion matrix from Figures 3 to 4, it can be concluded that data augmentation only improves the model for predicting non-sarcastic texts, this is due to the fact that every dataset shows a significant increase in the amount of true negative results. Furthermore, by observing the true positive results, we can see that most of the results give only a slight increase, or even a decrease, which means there is no improvement in predicting sarcastic texts. Although this is not the case with an imbalanced dataset like iSarcasm [5] as there is a good improvement in positive results when using the right amount of data augmentation.
In this research, we dealt with four datasets and augment the data using three different data augmentation scenarios and also performing additional experiment of performing data duplication. From the data augmentation experiment results, it can be concluded that in general data augmentation is not significantly impacting the model performance in predicting sarcastic text but, it increased the model’s capability to predict non-sarcastic text. However, performing data augmentation with the correct size and word embedding model in an unbalanced dataset like iSarcasm [5] can give a significant F1 Score performance increase by 2.1%. To further investigate the impact of data augmentation, we tried to perform data duplication, and the results show that data duplication gives more F1 Score performance boost compared to data augmentation when applied to bigger datasets like Ghosh [18] and Ptacek [19]. This happens due to the fact that by duplicating the data, it makes the model more sensitive with the appropriate keywords from the duplicated data thus resulting in performance boost, yet data augmentation can generate new data which changes the keyword and introduce more noise in the dataset resulting in a performance drop. However, this was not the case when performing it in smaller datasets like iSarcasm [5] and SemEval-18 [20], as generating new data using data augmentation by percentage size only generates small numbers, thus making data augmentation a better technique to increase performance rather than data duplication.
In the future, further investigation can be carried out to use various data augmentation techniques like insert, swap, or delete method, and using other pre-trained word-embedding models like word2vec or fasttext to generate the data. Moreover, other data augmentation techniques can be used like using contextual word embeddings using various pre-trained models or combining various data augmentation techniques to generated new sentences.
No potential conflict of interest relevant to this article was reported.
True positive quantity change in percentage after performing data augmentation across multiple datasets.
True negative quantity change in percentage after performing data augmentation across multiple datasets.
False positive quantity change in percentage after performing data augmentation across multiple datasets.
False negative quantity change in percentage after performing data augmentation across multiple datasets.
Table 1. Distribution of Training and Test data in various datasets.
Dataset | Train | Test | Total |
---|---|---|---|
iSarcasm | 3,463 | 887 | 4,350 |
Ghosh | 37,082 | 4,121 | 41,203 |
Ptacek | 56,677 | 6,298 | 62,975 |
SemEval-18 | 3,776 | 780 | 4,556 |
Table 2. The proportion of sarcastic and non-sarcastic data in various datasets.
Dataset | Non-sarcastic | Sarcastic | % Sarcasm |
---|---|---|---|
iSarcasm | 3,584 | 766 | 17.62% |
Ghosh | 22,725 | 18,478 | 44.84% |
Ptacek | 31,810 | 31,165 | 49.50% |
SemEval-18 | 2,379 | 2,177 | 49.12% |
Table 3. The amount of sarcastic data after performing data augmentation in each dataset.
Dataset | Sarcastic data with (10% augmentation) | Sarcastic data with (20% augmentation) | Sarcastic data with (30% augmentation) |
---|---|---|---|
iSarcasm | 820 | 875 | 929 |
Ghosh | 20,147 | 21,816 | 23,485 |
Ptacek | 33,970 | 36,776 | 39,581 |
SemEval-18 | 2,363 | 2,550 | 2,736 |
Table 4. Augmented text results.
Original text | Augmented text |
---|---|
good morning, please go and vote ! it only takes minutes and a low turnout will hand victory to the brexit party e uelections 2019 | good morning, please go and vote! it only takes left and very low turnout will right victory this the brexit party u uelections 2019 |
Table 5. Hyper-parameter settings.
Hyper-Parameter | Value |
---|---|
max_seq_length | 40 |
learning_rate | 0.00001 |
weight_decay | 0.01 |
warmup_ratio | 0.2 |
max_grad_norm | 1.0 |
num_train_epochs | 8,13 |
train_batch_size | 16,32 |
fp16 | true |
manual_seed | 128 |
Table 6. Performance Comparison in terms of F1 Score and MCC for three different models on four different datasets, before and after augmentation with various sizes and with data duplication technique.
Model | Dataset | Before augmentation | Augmentation 10% | Augmentation 20% | Augmentation 30% | Data duplication 20% | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
F1 | MCC | F1 | MCC | F1 | MCC | F1 | MCC | F1 | MCC | ||
BERT | iSarcasm | 0.2421 | 0.3450 | 0.2499 | 0.3217 | 0.2567 | 0.2922 | 0.2410 | 0.3406 | ||
Ghosh | 0.7846 | 0.6184 | 0.7760 | 0.6042 | 0.7693 | 0.5963 | 0.7732 | 0.6048 | |||
Ptacek | 0.8596 | 0.7184 | 0.8602 | 0.7192 | 0.8598 | 0.7190 | 0.8587 | 0.7165 | |||
SemEval-18 | 0.6442 | 0.3642 | 0.6503 | 0.3825 | 0.6257 | 0.3505 | 0.6259 | 0.3578 | |||
RoBERTa | iSarcasm | 0.3834 | 0.2842 | 0.3809 | 0.2964 | 0.3828 | 0.2939 | 0.3925 | 0.2914 | ||
Ghosh | 0.7904 | 0.6299 | 0.7830 | 0.6284 | 0.7758 | 0.6193 | 0.7835 | 0.6294 | |||
Ptacek | 0.8735 | 0.7454 | 0.8738 | 0.7491 | 0.8727 | 0.7469 | 0.8717 | 0.7442 | |||
SemEval-18 | 0.6637 | 0.4109 | 0.6666 | 0.4286 | 0.6707 | 0.4362 | 0.6627 | 0.4134 | |||
DistilBERT | iSarcasm | 0.3083 | 0.2080 | 0.2924 | 0.1890 | 0.2784 | 0.1926 | 0.2991 | 0.2224 | ||
Ghosh | 0.7831 | 0.6148 | 0.7651 | 0.5854 | 0.7620 | 0.5799 | 0.7571 | 0.5700 | |||
Ptacek | 0.8542 | 0.7101 | 0.8538 | 0.7094 | 0.8508 | 0.7055 | 0.8569 | 0.7163 | |||
SemEval-18 | 0.6066 | 0.3180 | 0.6240 | 0.3534 | 0.6366 | 0.3822 | 0.6130 | 0.3261 |
Table 7. Comparison table of similar work.
Authors | Techniques used | Discussions |
---|---|---|
Xu Guo et al. [4] | Model: BERT with Latent-Optimization Method | F1 Score: |
Amirhossein et al. [16] | Model: BERT-based Data | F1 Score: |
Our proposed method | Model: BERT, RoBERTa, DistilBERT | F1 Score: |
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 401-413
Published online December 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.4.401
Copyright © The Korean Institute of Intelligent Systems.
Alif Tri Handoyo1*, Hidayaturrahman1, Criscentia Jessica Setiadi2, Derwin Suhartono1
1Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia, 11480
2English Department, Faculty of Humanities, Bina Nusantara University, Jakarta, Indonesia, 11480
Correspondence to:Alif Tri Handoyo (alif.handoyo@binus.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sarcasm is the use of words commonly used to ridicule someone or for humorous purposes. Several studies on sarcasm detection have utilized different learning algorithms. However, most of these learning models have always focused on the contents of expression only, thus leaving the contextual information in isolation. As a result, they failed to capture the contextual information in the sarcastic expression. Moreover, some datasets used in several studies have an unbalanced dataset, thus impacting the model result. In this paper, we propose a contextual model for sarcasm identification in Twitter using various pre-trained models and augmenting the dataset by applying Global Vector representation (GloVe) for the construction of word embedding and context learning to generate more sarcastic data, and also perform additional experiments by using the data duplication method. Data augmentation and duplication impact is tested in various datasets and augmentation sizes. In particular, we achieved the best performance after using the data augmentation method to increase 20% of the data labeled as sarcastic and improve the performance by 2.1% with an F1 Score of 40.44% compared to 38.34% before using data augmentation in the iSarcasm dataset.
Keywords: Twitter sarcasm detection, RoBERTa, Word embedding, Data augmentation
Twitter Data is one of the big data in terms of size because it has millions of tweets every day. It is also one of the largest social networking sites. We use this Twitter data for commercial and industrial or social purposes in accordance with our requirements and data processing. It is a large amount of data in size that increases every second, known as velocity. Due to the large amount of data that increases every day, we cannot easily analyze this data. Many companies and organizations have been interested in these data to study the opinion of people towards political events [1], popular products [2], or movies [3].
A challenge specific to sarcasm detection is the difficulty in acquiring ground-truth annotations. Human-annotated datasets usually contain only a few thousand texts, resulting in many small datasets. In comparison, automatic data collection using distant supervision signals like hashtags yielded substantially larger datasets [4]. Nevertheless, the automatic approach also led to label noise. For example, it is found nearly half of the tweets with sarcasm hashtags in one dataset are non-sarcastic. There is also a noise-free example of a dataset where the sarcasm tweets are labeled by their authors but have huge, unbalanced data resulting in a very low performance [5].
Some people are more sarcastic than others. However, sarcasm is very common in general, although it is difficult to recognize [6]. Sarcasm is a form of irony that occurs when there is a discrepancy between the literal and intended meanings of an utterance. This discrepancy is often used in the form of contempt to represent the distance from the previous statement [5]. But in general, people not only sarcastically make jokes and humor in their daily lives but also when criticize, speak out about ideas, and events. Therefore, it is widely used on social networks, especially microblogging sites such as Twitter.
A particular challenge when detecting sarcasm is the difficulty of getting ground truth annotations. Human-annotated datasets typically contain only thousands of texts, resulting in small datasets. In comparison, automated data acquisition using distant supervision signals like hashtags produced much larger datasets [4]. However, the automatic approach also caused label noise. For example, it is found that almost half of tweets that have a sarcastic hashtag in their dataset are actually not sarcastic. There are also examples of noise-free datasets where the sarcastic tweets are labeled by their authors, but they contain a huge amount of imbalanced data, resulting in a very low performance [5].
In recent years, various research has been conducted to build a model that can predict sarcastic tweets on Twitter. However, only a few research was focusing on data augmentation techniques in sarcasm detection problems. Some of the research was reviewed in the next couple of paragraphs.
In 2020, Hankyol Lee et al. proposed, CRA (Contextual Response Augmentation), which utilizes conversational context to generate a meaningful context for the training sample. In their research, they experiment to make data augmentation with labeled data in Twitter datasets from the FingLang2020 sarcasm challenge. Each training sample in the labeled dataset consists of a contextual utterance, a response, and its label (?SARCASM? or ?NOT_SARCASM?). To augment the not sarcasm data, they use context sequence as a new data point and label it as NOT_SARCASM. As for the sarcasm text, they use a simple back-translation method to augment the data. The labeled data were trained using an ensemble model mixed with BERT [7], BiLSTM [8], and NeXtVLAD [9]. The data augmentation technique gives a performance boost of only 0,66% with an 86.13% F1 Score compared to 85.46% F1 Score without using data augmentation in the Twitter dataset [10].
In 2021 Abeer Abuzayed et al. proposed seven fine-tuned BERT-based models with data augmentation to solve imbalanced data problems in ArSarcasm-v2 [11] dataset. The ArSarcasmv2 dataset was made to solve both sarcasm detection and sentiment analysis tasks. The sarcasm detection dataset contains Arabic tweets with an imbalance class which is 10380 data for non-sarcastic tweets and only 2168 data for sarcastic tweets. For the sentiment analysis task, the dataset contains neutral, negative, and positive sentiments. To solve the imbalanced data problem in the sarcasm detection task, they take the negative sentiment in the sentiment analysis task dataset and use it as their sarcastic labeled data. This is because, from their analysis, they found that 89% of data labeled as sarcastic were having negative sentiments at the same time. Thus, they theorize that negative tweets can be sarcastic too. Seven BERT-based models were used to detect the sarcastic tweets and MARBERT give the best score with an F1 Score for sarcastic label data before augmentation and give an 80% F1 score for sarcastic label data after performing data augmentation [12].
Recently in 2022, another research by Shubham K and Mosab S experimented on two different data augmentation techniques by using word embeddings and repeating instances technique on the original dataset which is an internal dataset provided by the organizer in SemEval-2022 task. In their experiment, they use original dataset and performing the dataset augmentation using word embedding technique with GloVe [13] and original-repetition technique to augment the data. The experiment was carried out by building a model using each of those augmentation techniques to predict sarcasm tweets. The model was trained using BERT [7] model and provided 79% of test accuracy before performing data augmentation, 81% for the original-repetition data augmentation technique, and 83% test accuracy when using the word-embedding augmentation technique [14].
Another research in 2022 that has been made at SemEval-2022 task was focusing on the analysis of sarcasm detection when using data augmentation. That research was proposed by Amirhossein A, et al. by proposing a mutation-based data augmentation technique. Mutation-based augmentation technique work by performing three processes in sequential order, starting with shuffling, deleting, and replacing. Thus, the augmentation work by first shuffling the sentences, deleting some words, and replacing the word with their synonym which has been taken from the Thesaurus website. The best results were given when building the prediction model using RoBERTa [15] with a 41.4 % of F1 Score in iSarcasm [5] dataset [16].
Based on the previous work [10,12,14,16] it can be concluded that there are still few labeled sarcastic data resource for sarcasm detection. Thus, some research was performed to tackle the imbalanced data problem due to the lack of sarcastic labeled data. Various work has been made to explore many different data augmentation techniques. However, many of them performed the augmentation in their own private dataset, and only a few of them performed it in publicly available datasets. Their analysis was also still deficient, mainly because they only focused on the F1 Score metric and lack of confusion matrix analysis before performing the augmentation and after performing the data augmentation.
Therefore, to tackle those issues, in this research we make the following contributions:
(1) Application of three different pre-trained model, BERT [7], RoBERTa [15] and DistilBERT [17] on four different publicly available twitter sarcasm detection dataset which is iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] to analyze the performance of those models in detecting sarcastic tweets
(2) Performing data augmentation and data duplication technique in various dataset like iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] and analyze the performance impact in terms of F1 Score.
(3) Analyze the performance impact of data augmentation and data duplication techniques by analyzing the confusion matrix results using MCC performance and performing a deep analysis of confusion matrix results by comparing the amount of true positive, true negative, false positive, and false negative after predicting sarcastic text across multiple datasets [5, 18, 19, 20].
In this paper, we explore data augmentation in the sarcasm detection model based on two questions: (1) How to tackle the imbalanced dataset problem in some of the sarcasm tweet datasets? (2) What is the performance impact of data augmentation when predicting sarcasm sentences?
We analyze the performance impact when using data augmentation in four different datasets, iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20]. Data augmentation is applied in the sarcastic labeled data to increase the size of the data. GloVe [13] was used for word embeddings to generate more data by replacing similar words in a sentence like “good” with “happy” that express similar meanings. We use three pre-trained neural representation model which is BERT [7], RoBERTa [15] and DistilBERT [17] to predict sarcastic tweets. The results will be compared based on before and after performing data augmentation to increase sarcastic labeled data by 10%, 20%, and 30% respectively, and also comparing it with a simple data duplication technique.
The remainder of this paper is structured as follows. Section II describes some of the related work. Section III describes in detail our experiment method. Section IV illustrates our experiments, results, and discussion. Section V concludes the result of this work and mentions some scope for future research.
Acquiring large and reliable datasets has been quite a challenge for sarcasm detection. Due to the cost of annotation, manually labeled datasets such as iSarcasm [5] and SemEval-18 [9] typically contain only a few thousand data and have an unbalanced dataset. Automatic crawling datasets such as Ghosh [7] and Ptacek [8] generate much more data but the result is considerably noisy. As a case study, after examining some automatic crawling datasets, it is found that nearly half of the tweets with sarcastic labels are non-sarcastic [5].
Sarcasm identification tasks have been studied using a variety of methods, such as lexicon-based, traditional machine learning, deep learning, and hybrid approach. Several sarcasm detections had also been reviewed through Systematic Literature Review (SLR) to identify sarcasm in textual data [12]. This study consists of reviewing the datasets, pre-processing, feature engineering, classification algorithms, and performance measurements in the sarcastic identification literature. This study also found that content-based are the most common features used in sarcasm classification. In this study, they found that standard metrics such as precision, recall, accuracy, f-measure, and Area Under the Curve (AUC) are the most commonly used evaluation metric for evaluating classifiers performance. In addition, it is also found that the AUC performance measure is the best choice to measure the performance when the dataset is imbalanced, due to its robustness in resisting the imbalanced dataset.
Many researchers have analyzed the task of sarcasm identification. A study reported two methods, one for “Incongruent words-only” and the other is “all-words: method” in “Expect the Unexpected: Harnessing Sentence Completion for Sarcasm Detection” research by employing “Sentence completion” for sarcastic analysis [13]. Two datasets were used for evaluating purposes, including Twitter data collected by Riloff, et al. [14] which consist of 2278 tweets (506 sarcastic, and 1772 nonsarcastic). Another dataset was collected by [15], which contain balanced tweets that were manually labeled (752 sarcastic and 752 non-sarcastic). Meanwhile, WordNet and word2vec were used to measure the performance similarity. For evaluation, cross-validation with 2-fold was used and provided the overall F1 Score of 54% when Word2Vec similarity for the all-words method is applied. Furthermore, when using WordNet incongruous words-only method it gives a better accuracy with an F1 Score of 80.24%. In comparison, an 80.28% F1 Score is obtained when using the WordNet similarity and incongruous words-only method using 2-fold cross-validation.
The optimized model known as the Robustly Optimized BERT Approach (RoBERTa), used 10 times more data (160GB as compared with the 16GB originally exploited) and is trained with more epochs than the original BERT model (500K vs 100K), and 8-times larger batch sizes, and a byte-level BPE vocabulary instead of character-level vocabulary that was originally utilized. Another improvement was the dynamic masking technique used in RoBERTa instead of the single static mask used in BERT. Moreover, RoBERTa model removes the next sentence prediction objective used in BERT, following the recommendation by other studies that question the NSP loss term [11]. RoBERTa [15] have been used in various sarcasm identification tasks and achieved satisfactory result. A study performed a contextual embedding method with RoBERTa [15] and obtained an 82.44% F1 Score [21]. A study using RoBERTa [15] in sarcasm identification tasks with the Context Separators method achieves a 77.2% F1 Score [22]. A study in Transformers on sarcasm detection with context, obtained the highest performance results using RoBERTa [15] with a 77.22% F1 Score [23]. Another study in Transformer-based Context-aware Sarcasm Detection in Conversation Threads from Social -Media obtained a great result with a 78.3% F1 Score for Twitter and a 74.4% F1 Score for Reddit datasets [24].
A study in Transformers on sarcasm detection with context, obtained the highest performance results using RoBERTa [15] with a 77.22% F1 Score [23]. Another study in Transformer-based Context-aware Sarcasm Detection in Conversation Threads from Social-Media obtained a great result with a 78.3% F1 Score for Twitter and a 74.4% F1 Score for Reddit datasets [24].
Data Augmentation is a well-known technique in the field of computer vision for introducing different distributions of data to extend a dataset and improve the performance of a model for multiple tasks. It is generally believed that the more data that is trained in a neural network, the more effective it will be. Augmentations are applied by performing simple image transformations such as rotation, cropping, flipping, translation, and addition of Gaussian noise to the image. In one study, a data augmentation technique was used to increase the training data when training a deep neural network on the ImageNet [25] dataset. Increasing training data samples reduces the overfitting of the model and increased the model performance [26]. These techniques allow the model to learn additional patterns in the image and identify new positional aspects of the objects in the image.
Similarly, data augmentation methods were also explored in the field of text processing to improve the efficacy of models. Studies were conducted by replacing random words in sentences with their respective synonyms, generating augmented data, and training the recurrent network for sentence similarity tasks [27]. In another study, they used word embeddings in sentences to generate augmented data to increase the data size and trained a multi-class classifier with tweet data. They used cosine similarity to find the nearest neighbor of the word vector and used it as a replacement for the original word. The word selection was done stochastically [28].
For information extraction, one study [29] applied data augmentation on a legal dataset [30]. Class-specific probability classifiers have been trained to identify a particular contract element type for each token in a sentence. They categorized a token in a sentence based on the window that surrounds them. They used the word embedding obtained by pre-training the word2vec model [31] on unlabeled contract data.
In their study, we investigated three methods of data augmentation namely, interpolation, extrapolation, and random noise. The augmentation methods manipulate word embeddings to obtain a new set of sentence representations. The study also emphasized that the interpolation method performed better than other methods like extrapolation and random noise [29].
Another study examined various augmentation methods where they introduced other techniques like gaussian noise or Bernoulli noise into the word embeddings in text-related classification tasks such as sentence classification, sentiment classification, and relation classification [32].
Given a set of tweets, we aim to classify each of them depending on whether it is sarcastic or not. We use data augmentation to increase 10%, 20%, and 30% sarcastic labeled data respectively. Then from each tweet, we extract a set of features referring to a training set and use three different models which are BERT [7], RoBERTa [15], and DistilBERT [17] to perform the classification. The features are extracted in a way that covers the different types of sarcasm we identified. We also perform an additional experiment to perform a data duplication technique by duplicating 20% of the sarcastic data to further analyze the impact of data augmentation in sarcasm detection. The workflow of our proposed methodology can be seen in Figure 1.
We conduct sarcasm detection experiments using two automatic datasets and two manually annotated datasets. The two automatically collected datasets include Ptacek [19] and Ghosh [18], which treat tweets having particular hashtags such as #sarcastic, #sarcasm, #not as sarcastic, and others as non-sarcastic.
iSarcasm [5] dataset contains tweets written by participants of an online survey and thus is an example of intended sarcasm detection while SemEval-18 [20] consists of both sarcastic and ironic tweets supervised by third-party annotators and thus is used for perceived sarcasm detection. Ghosh [18] and Ptacek [19] dataset is balanced. For all datasets, we use the predefined train, validation, and test set from [4] and use their train dataset as our train dataset then concatenate their validation and test dataset and use it as our test dataset. Thus, our proposed method uses a dataset that is split in two which are train and test data.
Details of every dataset used in this research which is iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] can be seen in Table 1.
Also, the proportion of sarcastic and non-sarcastic data in each dataset can be seen in Table 2.
Based on the distribution of data in Table 2, we can see that the most unbalanced dataset is iSarcasm [5] dataset and the rest of the dataset seems fairly balanced.
Preprocessing is an integral part of any NLP study. It is done so that the preprocessed part would not give any weight and biasness to the experiments. These are the steps of preprocessing conducted in this study.
First of all, the datasets were preprocessed using the lexical normalization tool for tweets [33]. We then cleaned the four datasets by dropping all the duplicate tweets within and across datasets and trimmed the texts to a maximum length of 100.
Next, we cleaned the tweet texts by deleting all the texts inside the bracket which contain URL links, hashtags, foreign language characters, stop words removal, non-English ASCII character, and emoji. Lastly, we perform data augmentation.
We use data augmentation to increase the sarcastic labeled text to make datasets more balanced. We then observe the performance impact when applying data augmentation.
To make a conclusive report, we apply data augmentation in the sarcastic-labeled text to all four datasets, regardless of whether it is balanced or not, this way we can conclude the impacts. GloVe [13] was used for word embeddings to generate more data by replacing words in sentences that have similar meanings. We perform data augmentation to increase sarcastic labeled text data with 3 different sizes of data augmentation 10%, 20%, and 30%. In addition, we also perform data duplication with 20% size. The amount of sarcastic data in every dataset after performing data augmentation can be seen in Table 3.
In order to generate the augmented text, we use nlpaug [34] WordEmbsAug augmenter with substitute action using GloVe [13] embeddings to apply the augmentation. This means that we use GloVe [13] pre-trained word embeddings model which is glove.6B.100d to perform similarity searches of certain words in the sentence and replace them with the most similar word. Thus, the data augmentation technique in this proposed word was using word level augmentation. To further understand how data augmentation works in this proposed method, a visualization of the data augmentation process can be seen in Figure 2.
From Figure 2, we can see that the data augmentation process starts by feeding the original text to GloVe [13] to perform word embeddings to get the vector representation of each word. Then, nlpaug [34] library will randomly pick 0.2 percent of words in the text and use GloVe [13] to perform synonym replacement by comparing the selected word vector that has similar word embeddings based on cosine similarity from pretrained glove.6B.100d pre-trained model. Thus, the original text was then replaced with their synonym.
An example of the original text which is taken from iSarcasm [5] dataset and the augmented text result can be seen in Table 4.
From Table 4 results we can see that the augmented text substitute only a small portion of word on the original text, this is because in our proposed model we only use 0.2 percent of the word that will be augmented. We find that by using only a small percentage of word replacement, we still can keep the context of the sentence, and more importantly the sarcastic nature of the original sentence.
In this paper, we experiment using three deep learning pretrained NLP (Natural Language Processing) models which are BERT [7], RoBERTa [15], and DistilBERT [17]. We use a training dataset to tune the hyper-parameters of those models. For bigger automatically collected datasets such as Ptacek [19] and Ghosh [18], we use a batch size of 32 and an epoch size of 8, while for the smaller two manually annotated datasets include iSarcasm [5] and SemEval-18 [20] we use a batch size of 16 and an epoch size of 13. Details of the hyper-parameter settings on those models can be seen in Table 5.
Most of the values of that hyper-parameter were obtained from the default value of the pre-trained models. In addition, we conduct an experiment to obtain the optimal value for num train epochs and train batch size using a gridsearch method.
In this section, the results produced by all the experiments are given. The experiments start by analyzing the performance of three models which is BERT [7], RoBERTa [15], and Distil-BERT [17] for detection sarcasm tweets in four datasets which is iSarcasm [5], Ghosh [18], Ptacek [19] and SemEval-18 [20]. Next, we analyze the performance of every model after performing data augmentation and data duplication and compare the results. This section started with the discussion of how each size configuration when using data augmentation and data duplication techniques impacts the performance of sarcasm detection models, then followed by the discussion of the confusion matrix using a true positive, true negative, false positive, and false negative analysis to further analyze the impact of data augmentation and data duplication in sarcasm detection in various datasets.
The performance results in this experiment are shown in terms of F1 Score and MCC (Matthews correlation coefficient). F1 Score is a good performance metric to use in this case due to the imbalance of iSarcasm [5] dataset. MCC is also used as a metric to better analyze the confusion matrix due to the fact that its metric not only depends on the positive results but also on the negative results, thus providing more information on the models ability to predict both sarcastic and non-sarcastic sentence compared to F1 Score which only focused on the models ability to predict sarcastic sentence.
The performance of three models which is BERT [7], RoBERTa [15], and DistilBERT [17] before and after performing data augmentation increased 10%, 20%, and 30% of sarcastic data, and also after performing data duplication to increase 20% of sarcastic data can be seen in Table 6.
Based on the performance results in Table 6, the first thing we notice is that the best model in terms of performance in detecting sarcasm sentences is RoBERTa [15] given the fact that it gives the highest results in terms of F1 Score and MCC in every scenario of experiments both before augmentation and after augmentation in every dataset. This happens likely because RoBERTa [15] is using a byte-level Byte-pair Enconding (BPE) encoding scheme in the tokenization process, compared to other models like BERT [7] and DistilBERT [17].
BPE ensures that the most common words are represented in the vocabulary as a single token, while the rare words are broken down into two or more subword tokens thus, it will make sure to represent the corpus with the least number of tokens which is also the main goal of the BPE algorithm, that is, compression of data. The least number of tokens in the corpus will help the model to solve the classification task, in this case classifying whether a sentence is sarcastic or not sarcastic. Another important factor that made RoBERTa [15] better models compared to others is because it is trained on larger datasets that goes over 160GB of uncompressed text from various sources.
RoBERTa [15] performance results in Table 6 shows that different conclusion can be made on the impact of data augmentation and data duplication for different datasets. iSarcasm [5], a very imbalanced dataset gives the best result when adding 20% of sarcastic data with the data augmentation technique with the F1 Score of 40.44% compared to 38.34% before performing data augmentation. This means that data augmentation can give a slight performance boost when performing data augmentation in an imbalanced dataset like iSarcasm [5]. Although if we augmented too much data, it could result in a performance drop due to the introduced bias, and if the augmented data is too low there seems to be no difference in terms of performance. The same thing also happens in MCC results, due to the model becoming better at predicting sarcastic sentences, it also impacted MCC results and gain a slight performance boost with 30.84% MCC, compared to 28.42% MCC before performing data augmentation.
As for RoBERTa [15] performance in Ghosh [18] and Ptacek [19] dataset, which is a big and moderately balanced dataset, shows that the augmentation technique did not increase the F1 or MCC performance. However, when performing data duplication technique, both dataset shows a slight performance boost with an 80.42% F1 Score compared to a 79.04% F1 Score before performing data augmentation for Ghosh [18] dataset. As for Ptacek [19] dataset, the results with the data duplication technique give an 87.41% F1 Score compared to an 85.96% F1 Score before performing data augmentation. Data augmentation, in this case, reduces the performance, and it is likely because with the bigger dataset, when we generated the data based on percentage, it will generate a lot of data, thus introducing more bias in the dataset. The MCC also gives a slight performance boost in Ghosh [18] and Ptacek [19] datasets with 65.41% and 74.74% MCC, compared to 62.99% and 74.54% MCC respectively.
Based on Table 6, RoBERTa [15] performance in SemEval-18 [20] dataset, which is a small and very balanced dataset shows a slight increase in performance with 67.46% F1 Score after performing data augmentation by adding 30% sarcastic data, compared to 66.37% F1 Score before performing data augmentation. The same results are also shown in MCC, with a slight performance increase of 43.82% after performing data augmentation, compared to 41.09% before performing data augmentation. Due to the small data in this dataset, performing data augmentation, in this case, did not have any significant impact on the performance.
Based on the overall results of the performance comparison of the F1 Score and MCC in Table 6, it can be observed that duplicating data will provide a greater performance boost than data augmentation in big datasets like Ghosh [18] and Ptacek [19]. This happens because when the data is duplicated, it makes the model more sensitive to the appropriate keywords from the duplicated data. Data augmentation can generate new data that turns the sarcasm keywords into completely new words that introduce more noise in the data and make the model perform poorly. However, if we perform data augmentation in a small dataset like iSarcasm [5] and Ptacek [19], it will give better results with the right amount size, compared to the data duplication technique. Furthermore, if we perform it in small and imbalanced data like iSarcasm [5], data augmentation will surely give a better performance boost compared to data duplication.
A comparison and discussion between our proposed model with other works in detecting sarcasm sentences can be seen in Table 7.
Based on the F1 Score and MCC results in Table 6 we can see that the best model that performs in detecting sarcastic tweets in various datasets like iSarcasm [5], Ghosh [18], Ptacek [19], and SemEval-18 [20] is RoBERTa [15] model. Therefore, our confusion matrix analysis will be focused on this model. Figures 3 to
The true positive (TP) result in Figure 3 means that when the predicted tweet is found to be sarcastic, the result of the classification shows exactly sarcastic after the experimental evaluation. The true negative (TN) results in Figure 4 show when the predicted tweet is non-sarcastic, and the classification result also validates it as non-sarcastic. As for the false positive (FP) result in Figure 5, it shows when the predicted tweets are sarcastic, while the original tweets are actually non-sarcastic. As for the false negative (FN) result, it shows when the predicted tweets are non-sarcastic, while the original tweets are actually sarcastic, thus, the model fails to predict the sarcastic tweets.
Based on the true positive result in Figure 3, it can be observed that in the iSarcasm [5] dataset, the amount of true positive increased mostly by 5.88% at a 20% increase in data augmentation. The result means that the amount of correctly predicted as sarcastic was increased in the iSarcasm [5] dataset. However, for SemEval-18 [20] dataset, it seems that at 20% augmentation the true positive results were decreased by 1.51% even though the F1 score was increased based on Table 6 result. Ghosh [18], Ptacek [19] dataset show an increased true positive amount by 1.22% and 1.24% respectively with the data duplication technique, which is in line with the increased F1-Score from Table 6.
The true negative result in Figure 4, shows that in SemEval18 [20] dataset, there true negative increased by 7.14% at a 20% increase in data augmentation, meaning the amount correctly predicted as non-sarcastic was increased. In addition, other datasets such as iSarcasm [5], Ghosh [18], and Ptacek [19] occasionally show similar trends. It means augmenting the data will helps the model learn non-sarcastic data.
The false positive result in Figure 4, shows that overall, data augmentation reduced the amount of false positive in all datasets. This means that the model is more capable of predicting the non-sarcastic text or giving a wrong result of predicting the non-sarcastic text as sarcastic.
The false negative result in Figure 5 shows that every balanced dataset such as Ghosh [18], Ptacek [19], and SemEval-18 [20], shows a trend of increased false negative. This means that due to the increased sarcastic data, the model will likely also make a false prediction in sarcastic data. However, this is not the same case with an imbalanced dataset like iSarcasm [5], as it decreases the number of false negative result by 2.86% at 20% data augmentation.
After analyzing the confusion matrix from Figures 3 to 4, it can be concluded that data augmentation only improves the model for predicting non-sarcastic texts, this is due to the fact that every dataset shows a significant increase in the amount of true negative results. Furthermore, by observing the true positive results, we can see that most of the results give only a slight increase, or even a decrease, which means there is no improvement in predicting sarcastic texts. Although this is not the case with an imbalanced dataset like iSarcasm [5] as there is a good improvement in positive results when using the right amount of data augmentation.
In this research, we dealt with four datasets and augment the data using three different data augmentation scenarios and also performing additional experiment of performing data duplication. From the data augmentation experiment results, it can be concluded that in general data augmentation is not significantly impacting the model performance in predicting sarcastic text but, it increased the model’s capability to predict non-sarcastic text. However, performing data augmentation with the correct size and word embedding model in an unbalanced dataset like iSarcasm [5] can give a significant F1 Score performance increase by 2.1%. To further investigate the impact of data augmentation, we tried to perform data duplication, and the results show that data duplication gives more F1 Score performance boost compared to data augmentation when applied to bigger datasets like Ghosh [18] and Ptacek [19]. This happens due to the fact that by duplicating the data, it makes the model more sensitive with the appropriate keywords from the duplicated data thus resulting in performance boost, yet data augmentation can generate new data which changes the keyword and introduce more noise in the dataset resulting in a performance drop. However, this was not the case when performing it in smaller datasets like iSarcasm [5] and SemEval-18 [20], as generating new data using data augmentation by percentage size only generates small numbers, thus making data augmentation a better technique to increase performance rather than data duplication.
In the future, further investigation can be carried out to use various data augmentation techniques like insert, swap, or delete method, and using other pre-trained word-embedding models like word2vec or fasttext to generate the data. Moreover, other data augmentation techniques can be used like using contextual word embeddings using various pre-trained models or combining various data augmentation techniques to generated new sentences.
Workflow of the proposed methodology.
Data augmentation process.
True positive quantity change in percentage after performing data augmentation across multiple datasets.
True negative quantity change in percentage after performing data augmentation across multiple datasets.
False positive quantity change in percentage after performing data augmentation across multiple datasets.
False negative quantity change in percentage after performing data augmentation across multiple datasets.
Table 1 . Distribution of Training and Test data in various datasets.
Dataset | Train | Test | Total |
---|---|---|---|
iSarcasm | 3,463 | 887 | 4,350 |
Ghosh | 37,082 | 4,121 | 41,203 |
Ptacek | 56,677 | 6,298 | 62,975 |
SemEval-18 | 3,776 | 780 | 4,556 |
Table 2 . The proportion of sarcastic and non-sarcastic data in various datasets.
Dataset | Non-sarcastic | Sarcastic | % Sarcasm |
---|---|---|---|
iSarcasm | 3,584 | 766 | 17.62% |
Ghosh | 22,725 | 18,478 | 44.84% |
Ptacek | 31,810 | 31,165 | 49.50% |
SemEval-18 | 2,379 | 2,177 | 49.12% |
Table 3 . The amount of sarcastic data after performing data augmentation in each dataset.
Dataset | Sarcastic data with (10% augmentation) | Sarcastic data with (20% augmentation) | Sarcastic data with (30% augmentation) |
---|---|---|---|
iSarcasm | 820 | 875 | 929 |
Ghosh | 20,147 | 21,816 | 23,485 |
Ptacek | 33,970 | 36,776 | 39,581 |
SemEval-18 | 2,363 | 2,550 | 2,736 |
Table 4 . Augmented text results.
Original text | Augmented text |
---|---|
good morning, please go and vote ! it only takes minutes and a low turnout will hand victory to the brexit party e uelections 2019 | good morning, please go and vote! it only takes left and very low turnout will right victory this the brexit party u uelections 2019 |
Table 5 . Hyper-parameter settings.
Hyper-Parameter | Value |
---|---|
max_seq_length | 40 |
learning_rate | 0.00001 |
weight_decay | 0.01 |
warmup_ratio | 0.2 |
max_grad_norm | 1.0 |
num_train_epochs | 8,13 |
train_batch_size | 16,32 |
fp16 | true |
manual_seed | 128 |
Table 6 . Performance Comparison in terms of F1 Score and MCC for three different models on four different datasets, before and after augmentation with various sizes and with data duplication technique.
Model | Dataset | Before augmentation | Augmentation 10% | Augmentation 20% | Augmentation 30% | Data duplication 20% | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
F1 | MCC | F1 | MCC | F1 | MCC | F1 | MCC | F1 | MCC | ||
BERT | iSarcasm | 0.2421 | 0.3450 | 0.2499 | 0.3217 | 0.2567 | 0.2922 | 0.2410 | 0.3406 | ||
Ghosh | 0.7846 | 0.6184 | 0.7760 | 0.6042 | 0.7693 | 0.5963 | 0.7732 | 0.6048 | |||
Ptacek | 0.8596 | 0.7184 | 0.8602 | 0.7192 | 0.8598 | 0.7190 | 0.8587 | 0.7165 | |||
SemEval-18 | 0.6442 | 0.3642 | 0.6503 | 0.3825 | 0.6257 | 0.3505 | 0.6259 | 0.3578 | |||
RoBERTa | iSarcasm | 0.3834 | 0.2842 | 0.3809 | 0.2964 | 0.3828 | 0.2939 | 0.3925 | 0.2914 | ||
Ghosh | 0.7904 | 0.6299 | 0.7830 | 0.6284 | 0.7758 | 0.6193 | 0.7835 | 0.6294 | |||
Ptacek | 0.8735 | 0.7454 | 0.8738 | 0.7491 | 0.8727 | 0.7469 | 0.8717 | 0.7442 | |||
SemEval-18 | 0.6637 | 0.4109 | 0.6666 | 0.4286 | 0.6707 | 0.4362 | 0.6627 | 0.4134 | |||
DistilBERT | iSarcasm | 0.3083 | 0.2080 | 0.2924 | 0.1890 | 0.2784 | 0.1926 | 0.2991 | 0.2224 | ||
Ghosh | 0.7831 | 0.6148 | 0.7651 | 0.5854 | 0.7620 | 0.5799 | 0.7571 | 0.5700 | |||
Ptacek | 0.8542 | 0.7101 | 0.8538 | 0.7094 | 0.8508 | 0.7055 | 0.8569 | 0.7163 | |||
SemEval-18 | 0.6066 | 0.3180 | 0.6240 | 0.3534 | 0.6366 | 0.3822 | 0.6130 | 0.3261 |
Table 7 . Comparison table of similar work.
Authors | Techniques used | Discussions |
---|---|---|
Xu Guo et al. [4] | Model: BERT with Latent-Optimization Method | F1 Score: |
Amirhossein et al. [16] | Model: BERT-based Data | F1 Score: |
Our proposed method | Model: BERT, RoBERTa, DistilBERT | F1 Score: |
Ho-Seung Kim and Jee-Hyong Lee
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92 https://doi.org/10.5391/IJFIS.2024.24.2.83Amit Kumar Sharma, Rakshith Alaham Gangeya, Harshit Kumar, Sandeep Chaurasia, and Devesh Kumar Srivastava
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 325-338 https://doi.org/10.5391/IJFIS.2022.22.3.325Yagya Raj Pandeya, and Joonwhoan Lee
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160 https://doi.org/10.5391/IJFIS.2018.18.2.154Workflow of the proposed methodology.
|@|~(^,^)~|@|Data augmentation process.
|@|~(^,^)~|@|True positive quantity change in percentage after performing data augmentation across multiple datasets.
|@|~(^,^)~|@|True negative quantity change in percentage after performing data augmentation across multiple datasets.
|@|~(^,^)~|@|False positive quantity change in percentage after performing data augmentation across multiple datasets.
|@|~(^,^)~|@|False negative quantity change in percentage after performing data augmentation across multiple datasets.