International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92
Published online June 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.2.83
© The Korean Institute of Intelligent Systems
Ho-Seung Kim1 and Jee-Hyong Lee2
1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea
Correspondence to :
Jee-Hyong Lee (john@skku.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.
Keywords: Data augmentation, Natural language processing, Text classification
Data augmentation is essential in all fields of artificial intelligence (AI) that deal with data, including natural language processing (NLP) and computer vision (CV) [1, 2]. Deep-learning models learn from a limited training dataset to solve given tasks [3]. When there is a shortage of data, the models may not obtain sufficient information from the dataset. Data augmentation generates new samples from the training dataset, and helps solve tasks with insufficient data. In NLP, the data-augmentation method typically generates new sentences. New sentence generation can be categorized into two approaches. The first is a generation method that uses large language models (LLMs) such as T5 and GPT to generate new sentences [4, 5]. The second type is a modification method that uses simple rules to generate sentences [6].
As in GPT, the generation method utilizes prompts crafted by humans for augmentation [7]. This method can generate various sentences; however, it also has certain drawbacks. One drawback is that human experts must create new prompts, particularly when dealing with data from different domains or languages [8]. As LLMs become larger, they typically require more computational resources and memory owing to the increase in the number of parameters [9]. Therefore, it is difficult to apply this method to large amounts of data. The modification method modifies parts of the original sentence based on the rules used to generate the sentences. It uses simple operations such as deletion, swapping, replacement, and insertion to modify [10]. This method is simple and cost effective. It can be easily applied to large amounts of data. One drawback is limited diversity, as it modifies the original sentence with simple operations. If the diversity of the generated sentences is limited, the effect of augmentation can also be limited [11]. Previous modification methods increased diversity by using various operations such as deletion, swapping, replacement, and insertion [12, 13]. They attempted to add noise to original sentences through various operations. However, the effectiveness of these operations has not yet been proven. We experimentally confirmed that the use of various operations for data augmentation was ineffective. The experimental results are presented in Section 2.
We propose a simpler and more effective augmentation method called STOP-SYM. We used only one operation, insertion, and out-of-vocabulary (OOV) tokens. Because OOV tokens do not appear in the training dataset, they may help increase the diversity of augmented sentences by inserting them into the original sentence [14]. However, the insertion of a new word may change the meaning of the sentence. We selected OOV tokens with representations similar to stopwords and inserted them into the original sentences. Because stopwords may slightly change the meaning of sentences [15], we can generate new sentences that have a similar meaning to the original sentences with higher diversity by inserting OOV tokens similar to stopwords. Using the augmented samples generated by the proposed method, we observed a significant improvement in performance. In limited data situations, we achieved a performance improvement of 46.4% and an overall performance improvement of 4.9%. We demonstrated the superiority of our method over previous methods, including GPT-based models, with five different classification datasets for diverse tasks.
There are many generation methods. Before the widespread use of LLMs, methods such as back-translation, which generated sentences with meanings similar to the original sentence, were commonly employed [16]. In addition to back-translation, most methods rely on seq2seq models [17]. Recently, with the widespread adoption of LLMs, they have been extensively used in data augmentation. LAMBADA [18], SC-GPT [19], and GPT3Mix [20] use GPT models. GPT3Mix is a recent data-augmentation method that utilizes GPT-3, which is an LLM. GPT3Mix operates in three steps: first, it samples data from a dataset to create mixed data by selecting two sentences with different labels. Next, it constructs a GPT3Mix prompt using these mixed samples, which serves as a prompt for the LLMs to generate new sentences. GPT3Mix leverages soft labels predicted by the language model to effectively extract knowledge while simultaneously generating text transformations. This method has been shown to achieve superior performance compared to simple data-augmentation techniques. However, it is more complex and relies on the performance of GPT-based models.
Unlike generation models with numerous trainable parameters, modification methods employ simple operations. This method modifies parts of a sentence to introduce noise. Previous methods, such as hierarchical data augmentation (HDA) [21], mention replacement (MR) [22], and Shuffle within segments (SiS) [23], have adopted various operations, such as deleting unimportant parts, replacing with the same entity type words, and shuffling segments. Advanced easy data augmentation (AEDA) [24] is a data-augmentation method that relies solely on insertions. It inserts punctuation marks to create a new sentence, with the punctuation mark chosen randomly from the six options: {‘.’, ‘;’, ‘?’, ‘:’, ‘!’, ‘,’}. Although it is intuitive and effective, its diversity is relatively low and its performance is limited. Easy data augmentation (EDA) [25] is a representative modification method for generating data samples by applying four operations. These operations include synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD). While EDA offers a straightforward method to perform data augmentation, various operations do not guarantee the diversity of the augmented sentences. We verified how the operations affect the performance of the EDA. The experimental results are listed in Table 1. One notable observation was that there was no significant difference between using four operations and using a single operation. In particular, even in the full-data scenario, insertion is only equal to EDA. It appears that the four operations do not effectively inject diverse noise.
In this section, we propose a new method that uses one operation, insertion, and OOV tokens. We aimed to simplify the noise injection using only insertion. We utilize OOV tokens to introduce noise, but prevent semantic degradation by selecting synonyms of stopwords from the OOV tokens. This approach is more effective and yields better performance than previous methods.
In this section, we explain the selection of OOV words and synonyms of stopwords. We expect to increase the diversity of sentences by inserting words not present in the training dataset. Inserting any word can easily increase the diversity; however, we need to avoid critical words that significantly influence class determination. Critical words are used frequently in specific classes; therefore, we used OOV to avoid such words. If we uniformly insert OOV words into all sentences regardless of class, the probability of them becoming critical words would be low. We can prevent the insertion of critical word using OOV.
However, we cannot prevent semantic degradation caused by using OOV words.
To select words that do not significantly cause semantic degradation of sentences, we propose the use of meaningless words and stopword synonyms. Stopwords are common words that occur frequently but contribute little to meaning of a sentence. They are likely to be present in most datasets, and finding stopwords in OOV is almost impossible. Therefore, we used synonyms of stopwords that have meanings similar s to stopwords. We selected words that were both OOV and synonyms of stopwords as candidates for insertion. We can easily increase the diversity of the augmented sentences and prevent the semantic degradation of sentences using the selected words.
The stopwords were obtained from the ‘NLTK’ package. Synonyms were obtained using the ‘synsets’ package from WordNet. There were 179 stopwords, and these stopwords had 815 synonyms. Based on this, OOV words that were not present in the training dataset were selected. Table 2 lists the number of candidates, denoted by
In this section, we explain how we conducted data augmentation using stopword synonyms. We inserted a randomly chosen synonym for stopwords at random positions in a sentence. We repeated this random insertion
Table 3 lists the augmented samples. When examining the augmented sentences, we observe that many infrequently used or less meaningful words have been inserted. Our proposed method operates in such a way that it introduces additional words with OOV and stopword-like characteristics while preserving the original input tokens. The augmented sentences can be considered more diverse than those generated using the previous method. Figure 1 illustrates the visualization of sentences augmented using both STOP-SYM and EDA, vectorized sentences, and visualized using t-SNE. As observed, the data augmented with STOP-SYM exhibited greater diversity than the data augmented with EDA, resulting in a more scattered pattern.
To compare the EDA and AEDA methods, we employed the same code base. We performed fine-tuning on the pretrained language model BERT (bidirectional encoder representations from transformers) using the generated augmented samples alongside the original training samples. We then utilized the predictions of the test samples based on this model to measure accuracy, using it as the primary metric. The pretrained language model was the BERT-based model provided by Hugging Face. The code was executed using a GeForce RTX 2080 GPU with 12 GB of memory.
The performance of the proposed method was evaluated under various environments. Our method is expected to be applicable to all text-classification tasks. To validate this, we used five different datasets for text classification. These have different domains and sizes. SST-2 [26] is a sentiment analysis of movie reviews containing 6,920 training samples. SUBJ [27] is a subjective analysis of movie reviews with 8,000 training samples. PC [28] is a pros classification with 39,428 training samples. CR [29, 30] is a sentiment analysis of customer reviews and contains 1,863 training samples. TREC [31] is a question classification method with 5,452 training samples.
In an additional environment, we evaluated performance by controlling the data ratio. Simple modification methods are heavily influenced by the quantity of data. This is an effective augmentation method for small quantities. However, it is not effective in large quantities because of its limited diversity. We aimed to demonstrate its effectiveness regardless of the data quantity and applied various dataset ratios. We controlled the data ratios at 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100%. In addition, we compared the overall performance of all the data ratio values.
Previous methods have used a sequential model; however, we used a pre-trained language model. The EDA [25] method suggests that data augmentation may not be necessary for BERT because it is already pre-trained on a large corpus with a vast vocabulary. Similarly, AEDA [24] primarily focused on overall performance comparisons with RNN- and LSTM-based models. However, language models have recently become essential. We evaluated the effectiveness of our data-augmentation method, specifically within a pretrained language model using BERT. In our study, we utilized BERT [32] provided by Hugging Face, and we employed ‘bert-base-uncased’ to evaluate the effects of our data augmentation. BERT was used as the base pretrained model, and we specifically utilized the [CLS] token representation for classification purposes.
To compare our augmentation method, we employed four modification methods: 1) EDA is a representative simple data-augmentation method [25]. 2) AEDA is a method similar to ours that relies solely on the insertion of punctuation marks [24]. 3) TF-Low uses the TF-IDF (term frequency-inverse document frequency) scores. It selected unimportant words by TF-IDF score, and randomly inserted unimportant words in the sentence [33]. 4) STOP uses stopwords Random stopwords were selected and inserted into the sentences [34].
Table 4 lists the experimental results. Only Original used the original data for fine-tuning, and we conducted experiments on five datasets. The other methods were augmented with the four methods using the original training data. Both the augmented and original training data were the final data for fine-tuning. Each experiment was repeated with five different seed numbers and the results were averaged.
Because all datasets showed similar results, we first examined the SST-2 [26] dataset. This shows that STOP-SYM achieved an outstanding overall performance, followed by AEDA, STOP, EDA, TF-Low, and Only Original. Similar to the overall performance, STOP-SYM performed well in most data scenarios. STOP-SYM is a highly effective augmentation method. Comparing STOP-SYM to Only Original, we observed a 46.5% performance improvement in the 1% dataset and a 0.4% improvement in the full dataset. The overall performance was enhanced by 5%. This demonstrated the effectiveness of the proposed method.
We demonstrated the effectiveness of our approach by comparing it with various previous methods. First, STOP-SYM was more effective than the other four operations. To determine whether the sole insertion method was effective, we compared STOP-SYM with the EDA. STOP-SYM achieved a 3.2% improvement in overall performance and performed well for every data ratio. Second, the words selected using the proposed method were suitable for insert augmentation. Except for the EDA, the other methods use only an insertion operation. Because EDA is not the lowest-performing method, the effectiveness of the insertion data augmentation varies depending on the insertion target. STOP-SYM achieved improvements of 0.7%, 2.1%, and 1.5% over AEDA, TF-Low, and STOP respectively. This implies that the better the insertion targets, the better the augmentation performance.
We also show that STOP-SYM has robust performance across classification tasks and domains and is not limited to sentiment classification. To confirm its superior performance in various classification tasks, including subjectivity, pros, and question classification, we applied STOP-SYM. Compared to Only Original, STOP-SYM showed performance improvements of 1.3% on the SUBJ dataset, 3% on PC, and 13.6% on TREC. STOP-SYM utilizes synonyms of stopwords that are not present in the training dataset. Therefore, it exhibits robust performance in various classification tasks regardless of their specific characteristics. In sentiment classification, STOP-SYM is effective in two domains: movies and customer reviews. This demonstrates a 6.2% improvement in performance in the CR domain. Although STOP-SYM has not been extensively tested in many domains, the results show that it has no domain-specific limitations and can achieve superior performance in various domains.
In this section, additional experimental results are presented to verify the effectiveness of the proposed approach. We performed an ablation study on OOV words, stopword synonyms, and the number of augmented samples. We also present the comparative result using a generation method, GPT3Mix.
In this section, we study the effectiveness of OOV. We used the SST-2 dataset for evaluation, and the other experimental environments were the same as those in the main experiment. Figure 2 shows the results of this comparison. We compared four methods: Only Original, SS-IV, SS-ALL, and SS-OOV. Only Original means that it uses only the original training data. Others used words selected from different vocabulary. Synonyms of stopwords-in vocabulary (SS-IV) uses only synonyms of stopwords present in the corpus. Synonyms of stopwords-all (SS-ALL) uses every synonyms of stopwords. Synonyms of stopwords-out of vocabulary (SS-OOV) is the proposed method. From 40% onwards, the results tended to show a similar pattern; therefore, we omitted the details. This indicates that there was a significant difference when the dataset was below 20%. It is easy to compare the performance of each method.
These results indicate that both SS-IV and SS-ALL show significantly lower performances than SS-OOV. This demonstrates the effectiveness of using OOV words as insertion targets. Interestingly, SS-IV and SS-ALL exhibited similar performance. This means that synonyms of stopwords in vocabulary cannot provide a substantial benefit. This demonstrates the potential negative effects of data augmentation.
In this section, we examine whether stopword synonyms are effective. To assess the effectiveness of stopword synonyms, we used Brown Corpus [35] words provided by NLTK instead of stopword synonyms. We used 321 synonyms for stopwords in the SST-2 dataset. To ensure a fairness, 321 words were sampled randomly. Figure 3 shows a comparison of the experimental results. Brown used 321 words as insertion targets. These words are OOV and are not synonyms for stopwords.
We used the SST-2 dataset for this experiment, maintaining the same experimental setup as that described in Section 6.1. Our proposed SS-OOV demonstrated superior performance compared to Only Original and Brown. This suggests the effectiveness of OOV words for insertion. However, the lower performance of Brown compared to SS-OOV shows that incorporating OOV words within the synonyms of stopwords is more effective than randomly sampling OOV words.
We conducted an ablation study to investigate the effect of the number of samples on performance using STOP-SYM. The number of samples is included in the subset {2, 4, 6, 8, 10, 16}. Sparse subsets were used for a more detailed analysis. We used the SST-2 dataset, and the experimental environment was the same as before. The results are summarized in Table 5.
When using only 1% of the data, we observed performance improvement as the number of samples increased. This suggests that in limited dataset scenarios, such as 1%, having more data is beneficial for performance. However, in situations with sufficient data (full dataset), we observed a tendency for the performance to decrease when a certain number of samples were generated. This may be indicative of overfitting, where an excessive number of samples leads to a model with bias. From this experiment, we conclude that it is important to select an appropriate number of samples.
The advantages of this modification method are its low complexity and cost. Therefore, there is a question as to whether performance will suffer as complexity and cost decrease. It is necessary to verify how much the performance degrades compared to recent generation methods. To address this, we compared our results with those of a data-augmentation method using LLMs called GPT3Mix. We cited the performance results from GPT3Mix [20], and conducted experiments with other conditions adjusted to align with our approach. Data augmentation generated 10x samples compared to the original data. The original data subset ratios were adjusted to {0.1%, 0.3%, and 1%} to simulate few-shot scenarios. The BERT model was used for training, and the other parameters were adjusted to be consistent for a fair evaluation. The experimental results are listed in Table 6.
Comparative experiments were conducted using the following three datasets: SST-2, TREC, and SUBJ. Across the different ratios, there was no significant difference in the average performance between STOP-SYM and GPT3Mix. When only 1% of the dataset was used, STOP-SYM outperformed GPT3Mix by approximately 1%, whereas when 0.1% of the dataset was used, GPT3Mix outperformed by approximately 1%. The greatest advantage of LLMs lies in their performance in few-shot scenarios, and the proposed STOP-SYM demonstrated superior performance in such situations. GPT3Mix uses GPT-3 and requires more computational resources and memory than the proposed method. In this study, we demonstrated the value of the proposed approach.
We propose a new data-augmentation method that relies solely on insertion. OOV expanded the corpus with training dataset additions, whereas the synonyms of stopwords generated white noise to minimize their impact on learning. STOP-SYM is composed of both OOV words and synonyms of stop words, which are then used for insertion-based data augmentation. We demonstrated its effectiveness across various text-classification tasks, showing superior performance compared with previous methods. Furthermore, through experiments, we highlighted the significance of the characteristics of OOV and synonyms of stopwords.
No potential conflict of interest relevant to this article was reported.
Table 1. Performance of each individual operation in EDA.
Method | 1% | 10% | 100% | Average |
---|---|---|---|---|
EDA | 67.1 | 81.8 | ||
Deletion | 86.7 | 91.0 | 81.7 | |
Swap | 67.4 | 86.1 | 90.8 | 81.4 |
Replacement | 67.2 | 86.3 | 90.9 | 81.4 |
Insertion | 67.4 | 87.0 |
The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..
Table 2. Synonyms of stopwords examples.
Data | Synonyms of stopwords | |
---|---|---|
SST-2 | 321 | ‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’ |
SUBJ | 306 | ‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’ |
PC | 363 | ‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’ |
CR | 400 | ‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’ |
TREC | 352 | ‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’ |
Table 3. Example of augmentation samples.
Data | Sentence |
---|---|
Original | Too clumsy in key moments to make a big splash |
Sample 1 | Too clumsy |
Sample 2 | Too clumsy in key moments to make |
Sample 3 | Too |
Sample 4 |
Table 4. Overall performance on each dataset.
Data | Method | 1% | 5% | 10% | 20% | 40% | 60% | 80% | 100% | Overall |
---|---|---|---|---|---|---|---|---|---|---|
SST-2 | Only original | 55.7 | 84.5 | 86.7 | 88.1 | 89.3 | 90.2 | 90.8 | 91.2 | 84.6 |
EDA | 67.1 | 84.2 | 87.1 | 88.3 | 89.2 | 90.3 | 90.5 | 91.3 | 86.0 | |
AEDA | 77.8 | 86.5 | 88.6 | 89.8 | 90.8 | 91.1 | 91.2 | 88.1 | ||
TF-Low | 70.7 | 85.1 | 87.6 | 89.0 | 89.8 | 90.6 | 91.0 | 91.4 | 86.9 | |
STOP | 77.5 | 85.6 | 87.9 | 88.2 | 89.1 | 90.3 | 90.3 | 91.1 | 87.5 | |
88.7 | ||||||||||
SUBJ | Only original | 87.2 | 92.4 | 94.4 | 95.0 | 95.6 | 96.1 | 96.1 | 96.3 | 94.1 |
EDA | 91.7 | 94.1 | 95.1 | 95.0 | 95.8 | 95.6 | 96.5 | 96.0 | 95.0 | |
AEDA | 91.5 | 93.3 | 94.7 | 95.5 | 95.5 | 95.8 | 95.6 | 96.5 | 94.8 | |
TF-Low | 87.3 | 93.5 | 94.2 | 95.5 | 95.9 | 95.9 | 96.0 | 96.6 | 94.4 | |
STOP | 91.0 | 94.7 | 94.9 | 94.2 | 94.8 | 95.9 | 96.0 | 94.5 | ||
PC | Only original | 72.3 | 91.7 | 92.9 | 93.5 | 94.2 | 94.5 | 94.9 | 95.2 | 91.2 |
EDA | 88.1 | 92.3 | 93 | 93.8 | 94.3 | 94.3 | 94.7 | 95.0 | 93.2 | |
AEDA | 93.3 | 93.7 | 94.8 | 94.8 | 94.9 | 95.4 | 93.8 | |||
TF-Low | 90.6 | 93.2 | 94 | 94.6 | 94.7 | 94.8 | 95.2 | 93.7 | ||
STOP | 89.2 | 92.1 | 92.9 | 93.4 | 94.1 | 94.3 | 94.7 | 95.2 | 93.2 | |
90.4 | 92.4 | |||||||||
CR | Only original | 63.2 | 66.5 | 68.4 | 71.7 | 84.0 | 84.8 | 86.2 | 76.6 | |
EDA | 54.6 | 75.1 | 79.6 | 82.5 | 86.6 | 85.7 | 79.7 | |||
AEDA | 63.1 | 74.7 | 79.9 | 86.1 | 84.8 | 87.1 | 80.8 | |||
TF-Low | 63.4 | 68.8 | 80.2 | 80.9 | 85.7 | 85.3 | 86.5 | 86.2 | 79.6 | |
STOP | 61.7 | 76.1 | 80.8 | 81.6 | 85.9 | 85.9 | 86.2 | 86.2 | 80.6 | |
83.3 | 86.6 | 87.1 | ||||||||
TREC | Only original | 37.2 | 60.6 | 79.4 | 81.2 | 94.3 | 95 | 95.4 | 95.6 | 79.8 |
EDA | 62.6 | 80.8 | 92.3 | 93.4 | 95.2 | 96.4 | 95.4 | 96.0 | 89.0 | |
AEDA | 62.4 | 84.8 | 92.2 | 94.0 | 95.4 | 95.6 | 95.5 | 89.6 | ||
TF-Low | 60.6 | 86.0 | 92.0 | 93.7 | 95.7 | 96.4 | 96.0 | 96.1 | 89.6 | |
STOP | 62.3 | 85.0 | 91.4 | 94.4 | 95.5 | 96.2 | 96.2 | 96.4 | 89.7 | |
The bold font indicates the best performance in each test..
Table 5. Performance and the number of augmented samples.
# of samples | 1% | 10% | 100% | Average |
---|---|---|---|---|
2 | 80.5 | 87.0 | 91.5 | 86.3 |
4 | 81.6 | 88.7 | 91.6 | 87.3 |
6 | 81.5 | |||
8 | 82.0 | 88.5 | 91.4 | 87.3 |
10 | 82.2 | 87.6 | 91.3 | 87.0 |
16 | 88.0 | 91.0 | 87.1 |
The bold font indicates the best performance in each test..
Table 6. Comparison with GPT3Mix.
Ratio | 0.1% | 0.3% | 1% | |||
---|---|---|---|---|---|---|
Data | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM |
SST-2 | 78.0 | 78.2 | 84.9 | 82.9 | 87.7 | 82.9 |
TREC | 47.7 | 44.0 | 57.8 | 54.6 | 60.5 | 68.2 |
SUBJ | 85.4 | 85.9 | 87.5 | 87.1 | 90.6 | 90.3 |
Average | 70.4 | 69.4 | 76.7 | 74.9 | 79.6 | 80.5 |
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92
Published online June 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.2.83
Copyright © The Korean Institute of Intelligent Systems.
Ho-Seung Kim1 and Jee-Hyong Lee2
1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea
Correspondence to:Jee-Hyong Lee (john@skku.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.
Keywords: Data augmentation, Natural language processing, Text classification
Data augmentation is essential in all fields of artificial intelligence (AI) that deal with data, including natural language processing (NLP) and computer vision (CV) [1, 2]. Deep-learning models learn from a limited training dataset to solve given tasks [3]. When there is a shortage of data, the models may not obtain sufficient information from the dataset. Data augmentation generates new samples from the training dataset, and helps solve tasks with insufficient data. In NLP, the data-augmentation method typically generates new sentences. New sentence generation can be categorized into two approaches. The first is a generation method that uses large language models (LLMs) such as T5 and GPT to generate new sentences [4, 5]. The second type is a modification method that uses simple rules to generate sentences [6].
As in GPT, the generation method utilizes prompts crafted by humans for augmentation [7]. This method can generate various sentences; however, it also has certain drawbacks. One drawback is that human experts must create new prompts, particularly when dealing with data from different domains or languages [8]. As LLMs become larger, they typically require more computational resources and memory owing to the increase in the number of parameters [9]. Therefore, it is difficult to apply this method to large amounts of data. The modification method modifies parts of the original sentence based on the rules used to generate the sentences. It uses simple operations such as deletion, swapping, replacement, and insertion to modify [10]. This method is simple and cost effective. It can be easily applied to large amounts of data. One drawback is limited diversity, as it modifies the original sentence with simple operations. If the diversity of the generated sentences is limited, the effect of augmentation can also be limited [11]. Previous modification methods increased diversity by using various operations such as deletion, swapping, replacement, and insertion [12, 13]. They attempted to add noise to original sentences through various operations. However, the effectiveness of these operations has not yet been proven. We experimentally confirmed that the use of various operations for data augmentation was ineffective. The experimental results are presented in Section 2.
We propose a simpler and more effective augmentation method called STOP-SYM. We used only one operation, insertion, and out-of-vocabulary (OOV) tokens. Because OOV tokens do not appear in the training dataset, they may help increase the diversity of augmented sentences by inserting them into the original sentence [14]. However, the insertion of a new word may change the meaning of the sentence. We selected OOV tokens with representations similar to stopwords and inserted them into the original sentences. Because stopwords may slightly change the meaning of sentences [15], we can generate new sentences that have a similar meaning to the original sentences with higher diversity by inserting OOV tokens similar to stopwords. Using the augmented samples generated by the proposed method, we observed a significant improvement in performance. In limited data situations, we achieved a performance improvement of 46.4% and an overall performance improvement of 4.9%. We demonstrated the superiority of our method over previous methods, including GPT-based models, with five different classification datasets for diverse tasks.
There are many generation methods. Before the widespread use of LLMs, methods such as back-translation, which generated sentences with meanings similar to the original sentence, were commonly employed [16]. In addition to back-translation, most methods rely on seq2seq models [17]. Recently, with the widespread adoption of LLMs, they have been extensively used in data augmentation. LAMBADA [18], SC-GPT [19], and GPT3Mix [20] use GPT models. GPT3Mix is a recent data-augmentation method that utilizes GPT-3, which is an LLM. GPT3Mix operates in three steps: first, it samples data from a dataset to create mixed data by selecting two sentences with different labels. Next, it constructs a GPT3Mix prompt using these mixed samples, which serves as a prompt for the LLMs to generate new sentences. GPT3Mix leverages soft labels predicted by the language model to effectively extract knowledge while simultaneously generating text transformations. This method has been shown to achieve superior performance compared to simple data-augmentation techniques. However, it is more complex and relies on the performance of GPT-based models.
Unlike generation models with numerous trainable parameters, modification methods employ simple operations. This method modifies parts of a sentence to introduce noise. Previous methods, such as hierarchical data augmentation (HDA) [21], mention replacement (MR) [22], and Shuffle within segments (SiS) [23], have adopted various operations, such as deleting unimportant parts, replacing with the same entity type words, and shuffling segments. Advanced easy data augmentation (AEDA) [24] is a data-augmentation method that relies solely on insertions. It inserts punctuation marks to create a new sentence, with the punctuation mark chosen randomly from the six options: {‘.’, ‘;’, ‘?’, ‘:’, ‘!’, ‘,’}. Although it is intuitive and effective, its diversity is relatively low and its performance is limited. Easy data augmentation (EDA) [25] is a representative modification method for generating data samples by applying four operations. These operations include synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD). While EDA offers a straightforward method to perform data augmentation, various operations do not guarantee the diversity of the augmented sentences. We verified how the operations affect the performance of the EDA. The experimental results are listed in Table 1. One notable observation was that there was no significant difference between using four operations and using a single operation. In particular, even in the full-data scenario, insertion is only equal to EDA. It appears that the four operations do not effectively inject diverse noise.
In this section, we propose a new method that uses one operation, insertion, and OOV tokens. We aimed to simplify the noise injection using only insertion. We utilize OOV tokens to introduce noise, but prevent semantic degradation by selecting synonyms of stopwords from the OOV tokens. This approach is more effective and yields better performance than previous methods.
In this section, we explain the selection of OOV words and synonyms of stopwords. We expect to increase the diversity of sentences by inserting words not present in the training dataset. Inserting any word can easily increase the diversity; however, we need to avoid critical words that significantly influence class determination. Critical words are used frequently in specific classes; therefore, we used OOV to avoid such words. If we uniformly insert OOV words into all sentences regardless of class, the probability of them becoming critical words would be low. We can prevent the insertion of critical word using OOV.
However, we cannot prevent semantic degradation caused by using OOV words.
To select words that do not significantly cause semantic degradation of sentences, we propose the use of meaningless words and stopword synonyms. Stopwords are common words that occur frequently but contribute little to meaning of a sentence. They are likely to be present in most datasets, and finding stopwords in OOV is almost impossible. Therefore, we used synonyms of stopwords that have meanings similar s to stopwords. We selected words that were both OOV and synonyms of stopwords as candidates for insertion. We can easily increase the diversity of the augmented sentences and prevent the semantic degradation of sentences using the selected words.
The stopwords were obtained from the ‘NLTK’ package. Synonyms were obtained using the ‘synsets’ package from WordNet. There were 179 stopwords, and these stopwords had 815 synonyms. Based on this, OOV words that were not present in the training dataset were selected. Table 2 lists the number of candidates, denoted by
In this section, we explain how we conducted data augmentation using stopword synonyms. We inserted a randomly chosen synonym for stopwords at random positions in a sentence. We repeated this random insertion
Table 3 lists the augmented samples. When examining the augmented sentences, we observe that many infrequently used or less meaningful words have been inserted. Our proposed method operates in such a way that it introduces additional words with OOV and stopword-like characteristics while preserving the original input tokens. The augmented sentences can be considered more diverse than those generated using the previous method. Figure 1 illustrates the visualization of sentences augmented using both STOP-SYM and EDA, vectorized sentences, and visualized using t-SNE. As observed, the data augmented with STOP-SYM exhibited greater diversity than the data augmented with EDA, resulting in a more scattered pattern.
To compare the EDA and AEDA methods, we employed the same code base. We performed fine-tuning on the pretrained language model BERT (bidirectional encoder representations from transformers) using the generated augmented samples alongside the original training samples. We then utilized the predictions of the test samples based on this model to measure accuracy, using it as the primary metric. The pretrained language model was the BERT-based model provided by Hugging Face. The code was executed using a GeForce RTX 2080 GPU with 12 GB of memory.
The performance of the proposed method was evaluated under various environments. Our method is expected to be applicable to all text-classification tasks. To validate this, we used five different datasets for text classification. These have different domains and sizes. SST-2 [26] is a sentiment analysis of movie reviews containing 6,920 training samples. SUBJ [27] is a subjective analysis of movie reviews with 8,000 training samples. PC [28] is a pros classification with 39,428 training samples. CR [29, 30] is a sentiment analysis of customer reviews and contains 1,863 training samples. TREC [31] is a question classification method with 5,452 training samples.
In an additional environment, we evaluated performance by controlling the data ratio. Simple modification methods are heavily influenced by the quantity of data. This is an effective augmentation method for small quantities. However, it is not effective in large quantities because of its limited diversity. We aimed to demonstrate its effectiveness regardless of the data quantity and applied various dataset ratios. We controlled the data ratios at 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100%. In addition, we compared the overall performance of all the data ratio values.
Previous methods have used a sequential model; however, we used a pre-trained language model. The EDA [25] method suggests that data augmentation may not be necessary for BERT because it is already pre-trained on a large corpus with a vast vocabulary. Similarly, AEDA [24] primarily focused on overall performance comparisons with RNN- and LSTM-based models. However, language models have recently become essential. We evaluated the effectiveness of our data-augmentation method, specifically within a pretrained language model using BERT. In our study, we utilized BERT [32] provided by Hugging Face, and we employed ‘bert-base-uncased’ to evaluate the effects of our data augmentation. BERT was used as the base pretrained model, and we specifically utilized the [CLS] token representation for classification purposes.
To compare our augmentation method, we employed four modification methods: 1) EDA is a representative simple data-augmentation method [25]. 2) AEDA is a method similar to ours that relies solely on the insertion of punctuation marks [24]. 3) TF-Low uses the TF-IDF (term frequency-inverse document frequency) scores. It selected unimportant words by TF-IDF score, and randomly inserted unimportant words in the sentence [33]. 4) STOP uses stopwords Random stopwords were selected and inserted into the sentences [34].
Table 4 lists the experimental results. Only Original used the original data for fine-tuning, and we conducted experiments on five datasets. The other methods were augmented with the four methods using the original training data. Both the augmented and original training data were the final data for fine-tuning. Each experiment was repeated with five different seed numbers and the results were averaged.
Because all datasets showed similar results, we first examined the SST-2 [26] dataset. This shows that STOP-SYM achieved an outstanding overall performance, followed by AEDA, STOP, EDA, TF-Low, and Only Original. Similar to the overall performance, STOP-SYM performed well in most data scenarios. STOP-SYM is a highly effective augmentation method. Comparing STOP-SYM to Only Original, we observed a 46.5% performance improvement in the 1% dataset and a 0.4% improvement in the full dataset. The overall performance was enhanced by 5%. This demonstrated the effectiveness of the proposed method.
We demonstrated the effectiveness of our approach by comparing it with various previous methods. First, STOP-SYM was more effective than the other four operations. To determine whether the sole insertion method was effective, we compared STOP-SYM with the EDA. STOP-SYM achieved a 3.2% improvement in overall performance and performed well for every data ratio. Second, the words selected using the proposed method were suitable for insert augmentation. Except for the EDA, the other methods use only an insertion operation. Because EDA is not the lowest-performing method, the effectiveness of the insertion data augmentation varies depending on the insertion target. STOP-SYM achieved improvements of 0.7%, 2.1%, and 1.5% over AEDA, TF-Low, and STOP respectively. This implies that the better the insertion targets, the better the augmentation performance.
We also show that STOP-SYM has robust performance across classification tasks and domains and is not limited to sentiment classification. To confirm its superior performance in various classification tasks, including subjectivity, pros, and question classification, we applied STOP-SYM. Compared to Only Original, STOP-SYM showed performance improvements of 1.3% on the SUBJ dataset, 3% on PC, and 13.6% on TREC. STOP-SYM utilizes synonyms of stopwords that are not present in the training dataset. Therefore, it exhibits robust performance in various classification tasks regardless of their specific characteristics. In sentiment classification, STOP-SYM is effective in two domains: movies and customer reviews. This demonstrates a 6.2% improvement in performance in the CR domain. Although STOP-SYM has not been extensively tested in many domains, the results show that it has no domain-specific limitations and can achieve superior performance in various domains.
In this section, additional experimental results are presented to verify the effectiveness of the proposed approach. We performed an ablation study on OOV words, stopword synonyms, and the number of augmented samples. We also present the comparative result using a generation method, GPT3Mix.
In this section, we study the effectiveness of OOV. We used the SST-2 dataset for evaluation, and the other experimental environments were the same as those in the main experiment. Figure 2 shows the results of this comparison. We compared four methods: Only Original, SS-IV, SS-ALL, and SS-OOV. Only Original means that it uses only the original training data. Others used words selected from different vocabulary. Synonyms of stopwords-in vocabulary (SS-IV) uses only synonyms of stopwords present in the corpus. Synonyms of stopwords-all (SS-ALL) uses every synonyms of stopwords. Synonyms of stopwords-out of vocabulary (SS-OOV) is the proposed method. From 40% onwards, the results tended to show a similar pattern; therefore, we omitted the details. This indicates that there was a significant difference when the dataset was below 20%. It is easy to compare the performance of each method.
These results indicate that both SS-IV and SS-ALL show significantly lower performances than SS-OOV. This demonstrates the effectiveness of using OOV words as insertion targets. Interestingly, SS-IV and SS-ALL exhibited similar performance. This means that synonyms of stopwords in vocabulary cannot provide a substantial benefit. This demonstrates the potential negative effects of data augmentation.
In this section, we examine whether stopword synonyms are effective. To assess the effectiveness of stopword synonyms, we used Brown Corpus [35] words provided by NLTK instead of stopword synonyms. We used 321 synonyms for stopwords in the SST-2 dataset. To ensure a fairness, 321 words were sampled randomly. Figure 3 shows a comparison of the experimental results. Brown used 321 words as insertion targets. These words are OOV and are not synonyms for stopwords.
We used the SST-2 dataset for this experiment, maintaining the same experimental setup as that described in Section 6.1. Our proposed SS-OOV demonstrated superior performance compared to Only Original and Brown. This suggests the effectiveness of OOV words for insertion. However, the lower performance of Brown compared to SS-OOV shows that incorporating OOV words within the synonyms of stopwords is more effective than randomly sampling OOV words.
We conducted an ablation study to investigate the effect of the number of samples on performance using STOP-SYM. The number of samples is included in the subset {2, 4, 6, 8, 10, 16}. Sparse subsets were used for a more detailed analysis. We used the SST-2 dataset, and the experimental environment was the same as before. The results are summarized in Table 5.
When using only 1% of the data, we observed performance improvement as the number of samples increased. This suggests that in limited dataset scenarios, such as 1%, having more data is beneficial for performance. However, in situations with sufficient data (full dataset), we observed a tendency for the performance to decrease when a certain number of samples were generated. This may be indicative of overfitting, where an excessive number of samples leads to a model with bias. From this experiment, we conclude that it is important to select an appropriate number of samples.
The advantages of this modification method are its low complexity and cost. Therefore, there is a question as to whether performance will suffer as complexity and cost decrease. It is necessary to verify how much the performance degrades compared to recent generation methods. To address this, we compared our results with those of a data-augmentation method using LLMs called GPT3Mix. We cited the performance results from GPT3Mix [20], and conducted experiments with other conditions adjusted to align with our approach. Data augmentation generated 10x samples compared to the original data. The original data subset ratios were adjusted to {0.1%, 0.3%, and 1%} to simulate few-shot scenarios. The BERT model was used for training, and the other parameters were adjusted to be consistent for a fair evaluation. The experimental results are listed in Table 6.
Comparative experiments were conducted using the following three datasets: SST-2, TREC, and SUBJ. Across the different ratios, there was no significant difference in the average performance between STOP-SYM and GPT3Mix. When only 1% of the dataset was used, STOP-SYM outperformed GPT3Mix by approximately 1%, whereas when 0.1% of the dataset was used, GPT3Mix outperformed by approximately 1%. The greatest advantage of LLMs lies in their performance in few-shot scenarios, and the proposed STOP-SYM demonstrated superior performance in such situations. GPT3Mix uses GPT-3 and requires more computational resources and memory than the proposed method. In this study, we demonstrated the value of the proposed approach.
We propose a new data-augmentation method that relies solely on insertion. OOV expanded the corpus with training dataset additions, whereas the synonyms of stopwords generated white noise to minimize their impact on learning. STOP-SYM is composed of both OOV words and synonyms of stop words, which are then used for insertion-based data augmentation. We demonstrated its effectiveness across various text-classification tasks, showing superior performance compared with previous methods. Furthermore, through experiments, we highlighted the significance of the characteristics of OOV and synonyms of stopwords.
Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.
Comparison experiment results of OOV effect.
Comparison experiment results of synonyms effect.
Table 1 . Performance of each individual operation in EDA.
Method | 1% | 10% | 100% | Average |
---|---|---|---|---|
EDA | 67.1 | 81.8 | ||
Deletion | 86.7 | 91.0 | 81.7 | |
Swap | 67.4 | 86.1 | 90.8 | 81.4 |
Replacement | 67.2 | 86.3 | 90.9 | 81.4 |
Insertion | 67.4 | 87.0 |
The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..
Table 2 . Synonyms of stopwords examples.
Data | Synonyms of stopwords | |
---|---|---|
SST-2 | 321 | ‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’ |
SUBJ | 306 | ‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’ |
PC | 363 | ‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’ |
CR | 400 | ‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’ |
TREC | 352 | ‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’ |
Table 3 . Example of augmentation samples.
Data | Sentence |
---|---|
Original | Too clumsy in key moments to make a big splash |
Sample 1 | Too clumsy |
Sample 2 | Too clumsy in key moments to make |
Sample 3 | Too |
Sample 4 |
Table 4 . Overall performance on each dataset.
Data | Method | 1% | 5% | 10% | 20% | 40% | 60% | 80% | 100% | Overall |
---|---|---|---|---|---|---|---|---|---|---|
SST-2 | Only original | 55.7 | 84.5 | 86.7 | 88.1 | 89.3 | 90.2 | 90.8 | 91.2 | 84.6 |
EDA | 67.1 | 84.2 | 87.1 | 88.3 | 89.2 | 90.3 | 90.5 | 91.3 | 86.0 | |
AEDA | 77.8 | 86.5 | 88.6 | 89.8 | 90.8 | 91.1 | 91.2 | 88.1 | ||
TF-Low | 70.7 | 85.1 | 87.6 | 89.0 | 89.8 | 90.6 | 91.0 | 91.4 | 86.9 | |
STOP | 77.5 | 85.6 | 87.9 | 88.2 | 89.1 | 90.3 | 90.3 | 91.1 | 87.5 | |
88.7 | ||||||||||
SUBJ | Only original | 87.2 | 92.4 | 94.4 | 95.0 | 95.6 | 96.1 | 96.1 | 96.3 | 94.1 |
EDA | 91.7 | 94.1 | 95.1 | 95.0 | 95.8 | 95.6 | 96.5 | 96.0 | 95.0 | |
AEDA | 91.5 | 93.3 | 94.7 | 95.5 | 95.5 | 95.8 | 95.6 | 96.5 | 94.8 | |
TF-Low | 87.3 | 93.5 | 94.2 | 95.5 | 95.9 | 95.9 | 96.0 | 96.6 | 94.4 | |
STOP | 91.0 | 94.7 | 94.9 | 94.2 | 94.8 | 95.9 | 96.0 | 94.5 | ||
PC | Only original | 72.3 | 91.7 | 92.9 | 93.5 | 94.2 | 94.5 | 94.9 | 95.2 | 91.2 |
EDA | 88.1 | 92.3 | 93 | 93.8 | 94.3 | 94.3 | 94.7 | 95.0 | 93.2 | |
AEDA | 93.3 | 93.7 | 94.8 | 94.8 | 94.9 | 95.4 | 93.8 | |||
TF-Low | 90.6 | 93.2 | 94 | 94.6 | 94.7 | 94.8 | 95.2 | 93.7 | ||
STOP | 89.2 | 92.1 | 92.9 | 93.4 | 94.1 | 94.3 | 94.7 | 95.2 | 93.2 | |
90.4 | 92.4 | |||||||||
CR | Only original | 63.2 | 66.5 | 68.4 | 71.7 | 84.0 | 84.8 | 86.2 | 76.6 | |
EDA | 54.6 | 75.1 | 79.6 | 82.5 | 86.6 | 85.7 | 79.7 | |||
AEDA | 63.1 | 74.7 | 79.9 | 86.1 | 84.8 | 87.1 | 80.8 | |||
TF-Low | 63.4 | 68.8 | 80.2 | 80.9 | 85.7 | 85.3 | 86.5 | 86.2 | 79.6 | |
STOP | 61.7 | 76.1 | 80.8 | 81.6 | 85.9 | 85.9 | 86.2 | 86.2 | 80.6 | |
83.3 | 86.6 | 87.1 | ||||||||
TREC | Only original | 37.2 | 60.6 | 79.4 | 81.2 | 94.3 | 95 | 95.4 | 95.6 | 79.8 |
EDA | 62.6 | 80.8 | 92.3 | 93.4 | 95.2 | 96.4 | 95.4 | 96.0 | 89.0 | |
AEDA | 62.4 | 84.8 | 92.2 | 94.0 | 95.4 | 95.6 | 95.5 | 89.6 | ||
TF-Low | 60.6 | 86.0 | 92.0 | 93.7 | 95.7 | 96.4 | 96.0 | 96.1 | 89.6 | |
STOP | 62.3 | 85.0 | 91.4 | 94.4 | 95.5 | 96.2 | 96.2 | 96.4 | 89.7 | |
The bold font indicates the best performance in each test..
Table 5 . Performance and the number of augmented samples.
# of samples | 1% | 10% | 100% | Average |
---|---|---|---|---|
2 | 80.5 | 87.0 | 91.5 | 86.3 |
4 | 81.6 | 88.7 | 91.6 | 87.3 |
6 | 81.5 | |||
8 | 82.0 | 88.5 | 91.4 | 87.3 |
10 | 82.2 | 87.6 | 91.3 | 87.0 |
16 | 88.0 | 91.0 | 87.1 |
The bold font indicates the best performance in each test..
Table 6 . Comparison with GPT3Mix.
Ratio | 0.1% | 0.3% | 1% | |||
---|---|---|---|---|---|---|
Data | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM |
SST-2 | 78.0 | 78.2 | 84.9 | 82.9 | 87.7 | 82.9 |
TREC | 47.7 | 44.0 | 57.8 | 54.6 | 60.5 | 68.2 |
SUBJ | 85.4 | 85.9 | 87.5 | 87.1 | 90.6 | 90.3 |
Average | 70.4 | 69.4 | 76.7 | 74.9 | 79.6 | 80.5 |
Alif Tri Handoyo, Hidayaturrahman, Criscentia Jessica Setiadi, Derwin Suhartono
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 401-413 https://doi.org/10.5391/IJFIS.2022.22.4.401Yagya Raj Pandeya, and Joonwhoan Lee
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160 https://doi.org/10.5391/IJFIS.2018.18.2.154Minyoung Kim
Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298 https://doi.org/10.5391/IJFIS.2016.16.4.293Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.
|@|~(^,^)~|@|Comparison experiment results of OOV effect.
|@|~(^,^)~|@|Comparison experiment results of synonyms effect.