Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92

Published online June 25, 2024

https://doi.org/10.5391/IJFIS.2024.24.2.83

© The Korean Institute of Intelligent Systems

Need Text Data Augmentation? Just One Insertion Is Enough

Ho-Seung Kim1 and Jee-Hyong Lee2

1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea

Correspondence to :
Jee-Hyong Lee (john@skku.edu)

Received: February 20, 2024; Accepted: May 30, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.

Keywords: Data augmentation, Natural language processing, Text classification

Data augmentation is essential in all fields of artificial intelligence (AI) that deal with data, including natural language processing (NLP) and computer vision (CV) [1, 2]. Deep-learning models learn from a limited training dataset to solve given tasks [3]. When there is a shortage of data, the models may not obtain sufficient information from the dataset. Data augmentation generates new samples from the training dataset, and helps solve tasks with insufficient data. In NLP, the data-augmentation method typically generates new sentences. New sentence generation can be categorized into two approaches. The first is a generation method that uses large language models (LLMs) such as T5 and GPT to generate new sentences [4, 5]. The second type is a modification method that uses simple rules to generate sentences [6].

As in GPT, the generation method utilizes prompts crafted by humans for augmentation [7]. This method can generate various sentences; however, it also has certain drawbacks. One drawback is that human experts must create new prompts, particularly when dealing with data from different domains or languages [8]. As LLMs become larger, they typically require more computational resources and memory owing to the increase in the number of parameters [9]. Therefore, it is difficult to apply this method to large amounts of data. The modification method modifies parts of the original sentence based on the rules used to generate the sentences. It uses simple operations such as deletion, swapping, replacement, and insertion to modify [10]. This method is simple and cost effective. It can be easily applied to large amounts of data. One drawback is limited diversity, as it modifies the original sentence with simple operations. If the diversity of the generated sentences is limited, the effect of augmentation can also be limited [11]. Previous modification methods increased diversity by using various operations such as deletion, swapping, replacement, and insertion [12, 13]. They attempted to add noise to original sentences through various operations. However, the effectiveness of these operations has not yet been proven. We experimentally confirmed that the use of various operations for data augmentation was ineffective. The experimental results are presented in Section 2.

We propose a simpler and more effective augmentation method called STOP-SYM. We used only one operation, insertion, and out-of-vocabulary (OOV) tokens. Because OOV tokens do not appear in the training dataset, they may help increase the diversity of augmented sentences by inserting them into the original sentence [14]. However, the insertion of a new word may change the meaning of the sentence. We selected OOV tokens with representations similar to stopwords and inserted them into the original sentences. Because stopwords may slightly change the meaning of sentences [15], we can generate new sentences that have a similar meaning to the original sentences with higher diversity by inserting OOV tokens similar to stopwords. Using the augmented samples generated by the proposed method, we observed a significant improvement in performance. In limited data situations, we achieved a performance improvement of 46.4% and an overall performance improvement of 4.9%. We demonstrated the superiority of our method over previous methods, including GPT-based models, with five different classification datasets for diverse tasks.

2.1 Generation Methods

There are many generation methods. Before the widespread use of LLMs, methods such as back-translation, which generated sentences with meanings similar to the original sentence, were commonly employed [16]. In addition to back-translation, most methods rely on seq2seq models [17]. Recently, with the widespread adoption of LLMs, they have been extensively used in data augmentation. LAMBADA [18], SC-GPT [19], and GPT3Mix [20] use GPT models. GPT3Mix is a recent data-augmentation method that utilizes GPT-3, which is an LLM. GPT3Mix operates in three steps: first, it samples data from a dataset to create mixed data by selecting two sentences with different labels. Next, it constructs a GPT3Mix prompt using these mixed samples, which serves as a prompt for the LLMs to generate new sentences. GPT3Mix leverages soft labels predicted by the language model to effectively extract knowledge while simultaneously generating text transformations. This method has been shown to achieve superior performance compared to simple data-augmentation techniques. However, it is more complex and relies on the performance of GPT-based models.

2.2 Modification Methods

Unlike generation models with numerous trainable parameters, modification methods employ simple operations. This method modifies parts of a sentence to introduce noise. Previous methods, such as hierarchical data augmentation (HDA) [21], mention replacement (MR) [22], and Shuffle within segments (SiS) [23], have adopted various operations, such as deleting unimportant parts, replacing with the same entity type words, and shuffling segments. Advanced easy data augmentation (AEDA) [24] is a data-augmentation method that relies solely on insertions. It inserts punctuation marks to create a new sentence, with the punctuation mark chosen randomly from the six options: {‘.’, ‘;’, ‘?’, ‘:’, ‘!’, ‘,’}. Although it is intuitive and effective, its diversity is relatively low and its performance is limited. Easy data augmentation (EDA) [25] is a representative modification method for generating data samples by applying four operations. These operations include synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD). While EDA offers a straightforward method to perform data augmentation, various operations do not guarantee the diversity of the augmented sentences. We verified how the operations affect the performance of the EDA. The experimental results are listed in Table 1. One notable observation was that there was no significant difference between using four operations and using a single operation. In particular, even in the full-data scenario, insertion is only equal to EDA. It appears that the four operations do not effectively inject diverse noise.

In this section, we propose a new method that uses one operation, insertion, and OOV tokens. We aimed to simplify the noise injection using only insertion. We utilize OOV tokens to introduce noise, but prevent semantic degradation by selecting synonyms of stopwords from the OOV tokens. This approach is more effective and yields better performance than previous methods.

3.1 Synonym of Stopwords

In this section, we explain the selection of OOV words and synonyms of stopwords. We expect to increase the diversity of sentences by inserting words not present in the training dataset. Inserting any word can easily increase the diversity; however, we need to avoid critical words that significantly influence class determination. Critical words are used frequently in specific classes; therefore, we used OOV to avoid such words. If we uniformly insert OOV words into all sentences regardless of class, the probability of them becoming critical words would be low. We can prevent the insertion of critical word using OOV.

However, we cannot prevent semantic degradation caused by using OOV words.

To select words that do not significantly cause semantic degradation of sentences, we propose the use of meaningless words and stopword synonyms. Stopwords are common words that occur frequently but contribute little to meaning of a sentence. They are likely to be present in most datasets, and finding stopwords in OOV is almost impossible. Therefore, we used synonyms of stopwords that have meanings similar s to stopwords. We selected words that were both OOV and synonyms of stopwords as candidates for insertion. We can easily increase the diversity of the augmented sentences and prevent the semantic degradation of sentences using the selected words.

The stopwords were obtained from the ‘NLTK’ package. Synonyms were obtained using the ‘synsets’ package from WordNet. There were 179 stopwords, and these stopwords had 815 synonyms. Based on this, OOV words that were not present in the training dataset were selected. Table 2 lists the number of candidates, denoted by N, for each dataset along with examples. For most of the datasets, N typically amounted to approximately 50% of the 813 candidates. Numerous synonyms of stopwords exist in most datasets, but OOV reduces their count. We present examples of such words, indicating infrequently used words. These words are unrelated to their respective domains; therefore, they are not shortcuts for determining class.

3.2 Data Augmentation by Insertion

In this section, we explain how we conducted data augmentation using stopword synonyms. We inserted a randomly chosen synonym for stopwords at random positions in a sentence. We repeated this random insertion k times for each sentence. The range of k was 1 to one-third of the sentence length. In other words, the maximum number of inserted words was one-third of the sentence length. The data generated in this manner were included in the training data and used as input samples. Our approach not only preserves the original input tokens, but also significantly broadens the variation within sentences.

Table 3 lists the augmented samples. When examining the augmented sentences, we observe that many infrequently used or less meaningful words have been inserted. Our proposed method operates in such a way that it introduces additional words with OOV and stopword-like characteristics while preserving the original input tokens. The augmented sentences can be considered more diverse than those generated using the previous method. Figure 1 illustrates the visualization of sentences augmented using both STOP-SYM and EDA, vectorized sentences, and visualized using t-SNE. As observed, the data augmented with STOP-SYM exhibited greater diversity than the data augmented with EDA, resulting in a more scattered pattern.

To compare the EDA and AEDA methods, we employed the same code base. We performed fine-tuning on the pretrained language model BERT (bidirectional encoder representations from transformers) using the generated augmented samples alongside the original training samples. We then utilized the predictions of the test samples based on this model to measure accuracy, using it as the primary metric. The pretrained language model was the BERT-based model provided by Hugging Face. The code was executed using a GeForce RTX 2080 GPU with 12 GB of memory.

4.1 Dataset

The performance of the proposed method was evaluated under various environments. Our method is expected to be applicable to all text-classification tasks. To validate this, we used five different datasets for text classification. These have different domains and sizes. SST-2 [26] is a sentiment analysis of movie reviews containing 6,920 training samples. SUBJ [27] is a subjective analysis of movie reviews with 8,000 training samples. PC [28] is a pros classification with 39,428 training samples. CR [29, 30] is a sentiment analysis of customer reviews and contains 1,863 training samples. TREC [31] is a question classification method with 5,452 training samples.

In an additional environment, we evaluated performance by controlling the data ratio. Simple modification methods are heavily influenced by the quantity of data. This is an effective augmentation method for small quantities. However, it is not effective in large quantities because of its limited diversity. We aimed to demonstrate its effectiveness regardless of the data quantity and applied various dataset ratios. We controlled the data ratios at 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100%. In addition, we compared the overall performance of all the data ratio values.

4.2 Model

Previous methods have used a sequential model; however, we used a pre-trained language model. The EDA [25] method suggests that data augmentation may not be necessary for BERT because it is already pre-trained on a large corpus with a vast vocabulary. Similarly, AEDA [24] primarily focused on overall performance comparisons with RNN- and LSTM-based models. However, language models have recently become essential. We evaluated the effectiveness of our data-augmentation method, specifically within a pretrained language model using BERT. In our study, we utilized BERT [32] provided by Hugging Face, and we employed ‘bert-base-uncased’ to evaluate the effects of our data augmentation. BERT was used as the base pretrained model, and we specifically utilized the [CLS] token representation for classification purposes.

To compare our augmentation method, we employed four modification methods: 1) EDA is a representative simple data-augmentation method [25]. 2) AEDA is a method similar to ours that relies solely on the insertion of punctuation marks [24]. 3) TF-Low uses the TF-IDF (term frequency-inverse document frequency) scores. It selected unimportant words by TF-IDF score, and randomly inserted unimportant words in the sentence [33]. 4) STOP uses stopwords Random stopwords were selected and inserted into the sentences [34].

Table 4 lists the experimental results. Only Original used the original data for fine-tuning, and we conducted experiments on five datasets. The other methods were augmented with the four methods using the original training data. Both the augmented and original training data were the final data for fine-tuning. Each experiment was repeated with five different seed numbers and the results were averaged.

5.1 Overall Performance

Because all datasets showed similar results, we first examined the SST-2 [26] dataset. This shows that STOP-SYM achieved an outstanding overall performance, followed by AEDA, STOP, EDA, TF-Low, and Only Original. Similar to the overall performance, STOP-SYM performed well in most data scenarios. STOP-SYM is a highly effective augmentation method. Comparing STOP-SYM to Only Original, we observed a 46.5% performance improvement in the 1% dataset and a 0.4% improvement in the full dataset. The overall performance was enhanced by 5%. This demonstrated the effectiveness of the proposed method.

We demonstrated the effectiveness of our approach by comparing it with various previous methods. First, STOP-SYM was more effective than the other four operations. To determine whether the sole insertion method was effective, we compared STOP-SYM with the EDA. STOP-SYM achieved a 3.2% improvement in overall performance and performed well for every data ratio. Second, the words selected using the proposed method were suitable for insert augmentation. Except for the EDA, the other methods use only an insertion operation. Because EDA is not the lowest-performing method, the effectiveness of the insertion data augmentation varies depending on the insertion target. STOP-SYM achieved improvements of 0.7%, 2.1%, and 1.5% over AEDA, TF-Low, and STOP respectively. This implies that the better the insertion targets, the better the augmentation performance.

We also show that STOP-SYM has robust performance across classification tasks and domains and is not limited to sentiment classification. To confirm its superior performance in various classification tasks, including subjectivity, pros, and question classification, we applied STOP-SYM. Compared to Only Original, STOP-SYM showed performance improvements of 1.3% on the SUBJ dataset, 3% on PC, and 13.6% on TREC. STOP-SYM utilizes synonyms of stopwords that are not present in the training dataset. Therefore, it exhibits robust performance in various classification tasks regardless of their specific characteristics. In sentiment classification, STOP-SYM is effective in two domains: movies and customer reviews. This demonstrates a 6.2% improvement in performance in the CR domain. Although STOP-SYM has not been extensively tested in many domains, the results show that it has no domain-specific limitations and can achieve superior performance in various domains.

In this section, additional experimental results are presented to verify the effectiveness of the proposed approach. We performed an ablation study on OOV words, stopword synonyms, and the number of augmented samples. We also present the comparative result using a generation method, GPT3Mix.

6.1 Effectiveness of OOV

In this section, we study the effectiveness of OOV. We used the SST-2 dataset for evaluation, and the other experimental environments were the same as those in the main experiment. Figure 2 shows the results of this comparison. We compared four methods: Only Original, SS-IV, SS-ALL, and SS-OOV. Only Original means that it uses only the original training data. Others used words selected from different vocabulary. Synonyms of stopwords-in vocabulary (SS-IV) uses only synonyms of stopwords present in the corpus. Synonyms of stopwords-all (SS-ALL) uses every synonyms of stopwords. Synonyms of stopwords-out of vocabulary (SS-OOV) is the proposed method. From 40% onwards, the results tended to show a similar pattern; therefore, we omitted the details. This indicates that there was a significant difference when the dataset was below 20%. It is easy to compare the performance of each method.

These results indicate that both SS-IV and SS-ALL show significantly lower performances than SS-OOV. This demonstrates the effectiveness of using OOV words as insertion targets. Interestingly, SS-IV and SS-ALL exhibited similar performance. This means that synonyms of stopwords in vocabulary cannot provide a substantial benefit. This demonstrates the potential negative effects of data augmentation.

6.2 Effectiveness of Synonyms

In this section, we examine whether stopword synonyms are effective. To assess the effectiveness of stopword synonyms, we used Brown Corpus [35] words provided by NLTK instead of stopword synonyms. We used 321 synonyms for stopwords in the SST-2 dataset. To ensure a fairness, 321 words were sampled randomly. Figure 3 shows a comparison of the experimental results. Brown used 321 words as insertion targets. These words are OOV and are not synonyms for stopwords.

We used the SST-2 dataset for this experiment, maintaining the same experimental setup as that described in Section 6.1. Our proposed SS-OOV demonstrated superior performance compared to Only Original and Brown. This suggests the effectiveness of OOV words for insertion. However, the lower performance of Brown compared to SS-OOV shows that incorporating OOV words within the synonyms of stopwords is more effective than randomly sampling OOV words.

6.3 Augmented Sample Size

We conducted an ablation study to investigate the effect of the number of samples on performance using STOP-SYM. The number of samples is included in the subset {2, 4, 6, 8, 10, 16}. Sparse subsets were used for a more detailed analysis. We used the SST-2 dataset, and the experimental environment was the same as before. The results are summarized in Table 5.

When using only 1% of the data, we observed performance improvement as the number of samples increased. This suggests that in limited dataset scenarios, such as 1%, having more data is beneficial for performance. However, in situations with sufficient data (full dataset), we observed a tendency for the performance to decrease when a certain number of samples were generated. This may be indicative of overfitting, where an excessive number of samples leads to a model with bias. From this experiment, we conclude that it is important to select an appropriate number of samples.

6.4 Comparison with Generation Method

The advantages of this modification method are its low complexity and cost. Therefore, there is a question as to whether performance will suffer as complexity and cost decrease. It is necessary to verify how much the performance degrades compared to recent generation methods. To address this, we compared our results with those of a data-augmentation method using LLMs called GPT3Mix. We cited the performance results from GPT3Mix [20], and conducted experiments with other conditions adjusted to align with our approach. Data augmentation generated 10x samples compared to the original data. The original data subset ratios were adjusted to {0.1%, 0.3%, and 1%} to simulate few-shot scenarios. The BERT model was used for training, and the other parameters were adjusted to be consistent for a fair evaluation. The experimental results are listed in Table 6.

Comparative experiments were conducted using the following three datasets: SST-2, TREC, and SUBJ. Across the different ratios, there was no significant difference in the average performance between STOP-SYM and GPT3Mix. When only 1% of the dataset was used, STOP-SYM outperformed GPT3Mix by approximately 1%, whereas when 0.1% of the dataset was used, GPT3Mix outperformed by approximately 1%. The greatest advantage of LLMs lies in their performance in few-shot scenarios, and the proposed STOP-SYM demonstrated superior performance in such situations. GPT3Mix uses GPT-3 and requires more computational resources and memory than the proposed method. In this study, we demonstrated the value of the proposed approach.

We propose a new data-augmentation method that relies solely on insertion. OOV expanded the corpus with training dataset additions, whereas the synonyms of stopwords generated white noise to minimize their impact on learning. STOP-SYM is composed of both OOV words and synonyms of stop words, which are then used for insertion-based data augmentation. We demonstrated its effectiveness across various text-classification tasks, showing superior performance compared with previous methods. Furthermore, through experiments, we highlighted the significance of the characteristics of OOV and synonyms of stopwords.

This work was supported in part by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-00421, Artificial Intelligence Graduate School Program (Sung kyunkwan University)) and in part by the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience Program (IITP-2023-2020-0-01821) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).
Fig. 1.

Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.


Fig. 2.

Comparison experiment results of OOV effect.


Fig. 3.

Comparison experiment results of synonyms effect.


Table. 1.

Table 1. Performance of each individual operation in EDA.

Method1%10%100%Average
EDA67.187.191.381.8
Deletion67.586.791.081.7
Swap67.486.190.881.4
Replacement67.286.390.981.4
Insertion67.487.091.381.9

The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..


Table. 2.

Table 2. Synonyms of stopwords examples.

DataNSynonyms of stopwords
SST-2321‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’
SUBJ306‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’
PC363‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’
CR400‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’
TREC352‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’

Table. 3.

Table 3. Example of augmentation samples.

DataSentence
OriginalToo clumsy in key moments to make a big splash
Sample 1Too clumsy unity in key moments to oasis make a big splash front
Sample 2Too clumsy in key moments to make hundred a big splash
Sample 3Too maine clumsy inwards in key coiffe moments to make a big splash
Sample 4Hera too clumsy in key moments to make a big splash nobelium

Table. 4.

Table 4. Overall performance on each dataset.

DataMethod1%5%10%20%40%60%80%100%Overall
SST-2Only original55.784.586.788.189.390.290.891.284.6
EDA67.184.287.188.389.290.390.591.386.0
AEDA77.886.589.088.689.890.891.191.288.1
TF-Low70.785.187.689.089.890.691.091.486.9
STOP77.585.687.988.289.190.390.391.187.5
STOP-SYM81.687.388.789.290.091.091.291.688.8

SUBJOnly original87.292.494.495.095.696.196.196.394.1
EDA91.794.195.195.095.895.696.596.095.0
AEDA91.593.394.795.595.595.895.696.594.8
TF-Low87.393.594.295.595.995.996.096.694.4
STOP91.094.394.794.994.294.895.996.094.5
STOP-SYM91.994.395.395.695.996.296.196.795.3

PCOnly original72.391.792.993.594.294.594.995.291.2
EDA88.192.39393.894.394.394.795.093.2
AEDA90.792.693.393.794.894.894.995.493.8
TF-Low90.692.693.29494.694.794.895.293.7
STOP89.292.192.993.494.194.394.795.293.2
STOP-SYM90.492.493.794.395.195.095.295.594.0

CROnly original63.266.568.471.784.084.886.287.876.6
EDA54.675.179.682.586.686.087.385.779.7
AEDA63.174.779.984.586.186.084.887.180.8
TF-Low63.468.880.280.985.785.386.586.279.6
STOP61.776.180.881.685.985.986.286.280.6
STOP-SYM63.276.681.083.387.586.086.687.181.4

TRECOnly original37.260.679.481.294.39595.495.679.8
EDA62.680.892.393.495.296.495.496.089.0
AEDA62.484.892.294.095.495.695.596.589.6
TF-Low60.686.092.093.795.796.496.096.189.6
STOP62.385.091.494.495.596.296.296.489.7
STOP-SYM63.987.193.695.296.196.596.496.590.7

The bold font indicates the best performance in each test..


Table. 5.

Table 5. Performance and the number of augmented samples.

# of samples1%10%100%Average
280.587.091.586.3
481.688.791.687.3
681.589.091.987.4
882.088.591.487.3
1082.287.691.387.0
1682.488.091.087.1

The bold font indicates the best performance in each test..


Table. 6.

Table 6. Comparison with GPT3Mix.

Ratio0.1%0.3%1%
DataGPT3MiXSTOP-SYMGPT3MiXSTOP-SYMGPT3MiXSTOP-SYM
SST-278.078.284.982.987.782.9
TREC47.744.057.854.660.568.2
SUBJ85.485.987.587.190.690.3
Average70.469.476.774.979.680.5

  1. Van Dyk, DA, and Meng, XL (2001). The art of data augmentation. Journal of Computational and Graphical Statistics. 10, 1-50. https://doi.org/10.1198/10618600152418584
    CrossRef
  2. Frid-Adar, M, Klang, E, Amitai, M, Goldberger, J, and Greenspan, H . Synthetic data augmentation using GAN for improved liver lesion classification., Proceedings of 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, Washington, DC, USA, Array, pp.289-293. https://doi.org/10.1109/ISBI.2018.8363576
    CrossRef
  3. Shorten, C, Khoshgoftaar, TM, and Furht, B (2021). Text data augmentation for deep learning. Journal of Big Data. 8. article no 101. https://doi.org/10.1186/s40537-021-00492-0
    CrossRef
  4. Bird, JJ, Ekart, A, and Faria, DR (2023). Chatbot Interaction with artificial intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing. 14, 3129-3144. https://doi.org/10.1007/s12652-021-03439-8
    CrossRef
  5. Chang, Y, Zhang, R, and Pu, J (2023). I-WAS: a data augmentation method with GPT-2 for simile detection. Document Analysis and Recognition – ICDAR 2023. Cham, Switzerland: Springer, pp. 265-279 https://doi.org/10.1007/978-3-031-41682-8_17
    CrossRef
  6. Wu, Z, Wang, S, Gu, J, Khabsa, M, Sun, F, and Ma, H. (2022) . CLEAR: contrastive learning for sentence representation. Available: https://arxiv.org/abs/2012.15466
  7. Dai, H, Liu, Z, Liao, W, Huang, X, Cao, Y, and Wu, Z. (2023) . AugGPT: Leveraging ChatGPT for text data augmentation. Available: https://arxiv.org/abs/2302.13007
  8. Gao, T, Fisch, A, and Chen, D. (2020) . Making pre-trained language models better few-shot learners. Available: https://arxiv.org/abs/2012.15723
  9. Chen, L, Zaharia, M, and Zou, J. (2023) . FrugalGPT: how to use large language models while reducing cost and improving performance. Available: https://arxiv.org/abs/2305.05176
  10. Shou, Z, Jiang, Y, and Lin, F . AMR-DA: data augmentation by abstract meaning representation., Findings of the Association for Computational Linguistics: ACL 2022, 2002, Dublin, Ireland, Array, pp.3082-3098. https://doi.org/10.18653/v1/2022.findings-acl.244
    CrossRef
  11. Bae, J, and Lee, C . Easy data augmentation for improved malware detection: a comparative study., Proceedings of 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), 2021, Jeju Island, South Korea, Array, pp.214-218. https://doi.org/10.1109/BigComp51126.2021.00048
    CrossRef
  12. Zhong, Z, Zheng, L, Kang, G, Li, S, and Yang, Y . Random erasing data augmentation., Proceedings of the AAAI Conference On Artificial Intelligence, 2020, Array, pp.13001-13008. https://doi.org/10.1609/aaai.v34i07.7000
    CrossRef
  13. Pervaiz, A, Hussain, F, Israr, H, Tahir, MA, Raja, FR, Baloch, NK, Ishmanov, F, and Zikria, YB (2020). Incorporating noise robustness in speech command recognition by noise augmentation of training data. Sensors. 20. article no 2326. https://doi.org/10.3390/s20082326
    Pubmed KoreaMed CrossRef
  14. Liu, NF, May, J, Pust, M, and Knight, K. (2018) . Augmenting statistical machine translation with subword translation of out-of-vocabulary words. Available: https://arxiv.org/abs/1808.05700
  15. Silva, C, and Ribeiro, B . The importance of stop word removal on recall values in text categorization., Proceedings of the International Joint Conference on Neural Networks, 2003, Portland, OR, USA, Array, pp.1661-1666. https://doi.org/10.1109/IJCNN.2003.1223656
  16. Edunov, S, Ott, M, Auli, M, and Grangier, D. (2018) . Understanding back-translation at scale. Available: https://arxiv.org/abs/1808.09381
    CrossRef
  17. Hou, Y, Liu, Y, Che, W, and Liu, T. (2018) . Sequence-to-sequence data augmentation for dialogue language understanding. Available: https://arxiv.org/abs/1807.01554
  18. Anaby-Tavor, A, Carmeli, B, Goldbraich, E, Kantor, A, Kour, G, Shlomov, G, Tepper, N, and Zwerdling, N . Do not have enough data? Deep learning to the rescue!., Proceedings of the AAAI Conference on Artificial Intelligence, 2020, Array, pp.7383-7390. https://doi.org/10.1609/aaai.v34i05.6233
    CrossRef
  19. Peng, B, Zhu, C, Zeng, M, and Gao, J. (2021) . Data augmentation for spoken language understanding via pretrained models. Available: https://arxiv.org/abs/2004.13952
    CrossRef
  20. Yoo, KM, Park, D, Kang, J, Lee, SW, and Park, W. (2021) . GPT3Mix: leveraging large-scale language models for text augmentation. Available: https://arxiv.org/abs/2104.08826
    CrossRef
  21. Yu, S, Yang, J, Liu, D, Li, R, Zhang, Y, and Zhao, S (2019). Hierarchical data augmentation and the application in text classification. IEEE Access. 7, 185476-185485. https://doi.org/10.1109/ACCESS.2019.2960263
    CrossRef
  22. Zhao, J, Wang, T, Yatskar, M, Ordonez, V, and Chang, KW. (2018) . Gender bias in coreference resolution: evaluation and debiasing methods. Available: https://arxiv.org/abs/1804.06876
  23. Dai, X, and Adel, H. (2020) . An analysis of simple data augmentation for named entity recognition. Available: https://arxiv.org/abs/2010.11683
  24. Karimi, A, Rossi, L, and Prati, A (2021). AEDA: an easier data augmentation technique for text classification. Available:
    CrossRef
  25. Wei, J, and Zou, K (2019). EDA: easy data augmentation techniques for boosting performance on text classification tasks. Available:
    CrossRef
  26. Socher, R, Perelygin, A, Wu, J, Chuang, J, Manning, CD, Ng, AY, and Potts, C . Recursive deep models for semantic compositionality over a sentiment treebank., Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, Seattle, WA, USA, pp.1631-1642.
  27. Pang, B, and Lee, L. (2004) . A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. Available: https://arxiv.org/abs/cs/0409058
    CrossRef
  28. Ganapathibhotla, M, and Liu, B . Mining opinions in comparative sentences., Proceedings of the 22nd International Conference on Computational Linguistics (Coling), 2008, Manchester, UK, pp.241-248.
    CrossRef
  29. Hu, M, and Liu, B . Mining and summarizing customer reviews., Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, Seattle, WA, USA, Array, pp.168-177. https://doi.org/10.1145/1014052.1014073
  30. Liu, Q, Gao, Z, Liu, B, and Zhang, Y . Automated rule selection for aspect extraction in opinion mining., Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015, Buenos Aires, Argentina, pp.1291-1297.
  31. Li, X, and Roth, D . Learning question classifiers., Proceedings of 19th International Conference on Computational Linguistics (Coling), 2002, Taipei, Taiwan, pp.1-7.
    CrossRef
  32. Devlin, J, Chang, MW, Lee, K, and Toutanova, K. (2018) . BERT: pre-training of deep bidirectional transformers for language understanding. Available: https://arxiv.org/abs/1810.04805
  33. Kim, HS, Lee, JH, and Kim, HB . TF-EDA: efficient and effective text data augmentation., Proceedings of the 24th International Symposium on Advanced Intelligent Systems (ISIS), 2023, Gwangju, South Korea.
  34. Siagh, A, Laallam, FZ, Kazar, O, Salem, H, and Benglia, ME. (2023) . IDA: an imbalanced data augmentation for text classification. Intelligent Systems and Pattern Recognition, 241-251. https://doi.org/10.1007/978-3-031-46335-8_19
    CrossRef
  35. Francis, WN, and Kucera, H. (1979) . Brown corpus manual. Available: http://korpus.uib.no/icame/brown/bcm.html

Ho-Seung Kim received the B.S. degree in physics from the Korea Military Academy, Seoul, South Korean, in 2013, and M.S. in artificial intelligence from the Sungkyunkwan University, Suwon, South Korea, in 2022, respectively. He is currently pursuing the Ph.D. degree with Sungkyunkwan University, Suwon, South Korea. His research interests include machine learning, intelligent systems, natural language processing, and sentiment analysis.

Jee-Hyong Lee received the B.S., M.S., and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 1993, 1995, and 1999, respectively. From 2000 to 2002, he was an International Fellow at SRI International, USA. In 2002, he joined Sungkyunkwan University, Suwon, South Korea, as a Faculty Member. His research interests include fuzzy theory and applications, intelligent systems, and machine learning.

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92

Published online June 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.2.83

Copyright © The Korean Institute of Intelligent Systems.

Need Text Data Augmentation? Just One Insertion Is Enough

Ho-Seung Kim1 and Jee-Hyong Lee2

1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea

Correspondence to:Jee-Hyong Lee (john@skku.edu)

Received: February 20, 2024; Accepted: May 30, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.

Keywords: Data augmentation, Natural language processing, Text classification

1. Introduction

Data augmentation is essential in all fields of artificial intelligence (AI) that deal with data, including natural language processing (NLP) and computer vision (CV) [1, 2]. Deep-learning models learn from a limited training dataset to solve given tasks [3]. When there is a shortage of data, the models may not obtain sufficient information from the dataset. Data augmentation generates new samples from the training dataset, and helps solve tasks with insufficient data. In NLP, the data-augmentation method typically generates new sentences. New sentence generation can be categorized into two approaches. The first is a generation method that uses large language models (LLMs) such as T5 and GPT to generate new sentences [4, 5]. The second type is a modification method that uses simple rules to generate sentences [6].

As in GPT, the generation method utilizes prompts crafted by humans for augmentation [7]. This method can generate various sentences; however, it also has certain drawbacks. One drawback is that human experts must create new prompts, particularly when dealing with data from different domains or languages [8]. As LLMs become larger, they typically require more computational resources and memory owing to the increase in the number of parameters [9]. Therefore, it is difficult to apply this method to large amounts of data. The modification method modifies parts of the original sentence based on the rules used to generate the sentences. It uses simple operations such as deletion, swapping, replacement, and insertion to modify [10]. This method is simple and cost effective. It can be easily applied to large amounts of data. One drawback is limited diversity, as it modifies the original sentence with simple operations. If the diversity of the generated sentences is limited, the effect of augmentation can also be limited [11]. Previous modification methods increased diversity by using various operations such as deletion, swapping, replacement, and insertion [12, 13]. They attempted to add noise to original sentences through various operations. However, the effectiveness of these operations has not yet been proven. We experimentally confirmed that the use of various operations for data augmentation was ineffective. The experimental results are presented in Section 2.

We propose a simpler and more effective augmentation method called STOP-SYM. We used only one operation, insertion, and out-of-vocabulary (OOV) tokens. Because OOV tokens do not appear in the training dataset, they may help increase the diversity of augmented sentences by inserting them into the original sentence [14]. However, the insertion of a new word may change the meaning of the sentence. We selected OOV tokens with representations similar to stopwords and inserted them into the original sentences. Because stopwords may slightly change the meaning of sentences [15], we can generate new sentences that have a similar meaning to the original sentences with higher diversity by inserting OOV tokens similar to stopwords. Using the augmented samples generated by the proposed method, we observed a significant improvement in performance. In limited data situations, we achieved a performance improvement of 46.4% and an overall performance improvement of 4.9%. We demonstrated the superiority of our method over previous methods, including GPT-based models, with five different classification datasets for diverse tasks.

2. Related Work

2.1 Generation Methods

There are many generation methods. Before the widespread use of LLMs, methods such as back-translation, which generated sentences with meanings similar to the original sentence, were commonly employed [16]. In addition to back-translation, most methods rely on seq2seq models [17]. Recently, with the widespread adoption of LLMs, they have been extensively used in data augmentation. LAMBADA [18], SC-GPT [19], and GPT3Mix [20] use GPT models. GPT3Mix is a recent data-augmentation method that utilizes GPT-3, which is an LLM. GPT3Mix operates in three steps: first, it samples data from a dataset to create mixed data by selecting two sentences with different labels. Next, it constructs a GPT3Mix prompt using these mixed samples, which serves as a prompt for the LLMs to generate new sentences. GPT3Mix leverages soft labels predicted by the language model to effectively extract knowledge while simultaneously generating text transformations. This method has been shown to achieve superior performance compared to simple data-augmentation techniques. However, it is more complex and relies on the performance of GPT-based models.

2.2 Modification Methods

Unlike generation models with numerous trainable parameters, modification methods employ simple operations. This method modifies parts of a sentence to introduce noise. Previous methods, such as hierarchical data augmentation (HDA) [21], mention replacement (MR) [22], and Shuffle within segments (SiS) [23], have adopted various operations, such as deleting unimportant parts, replacing with the same entity type words, and shuffling segments. Advanced easy data augmentation (AEDA) [24] is a data-augmentation method that relies solely on insertions. It inserts punctuation marks to create a new sentence, with the punctuation mark chosen randomly from the six options: {‘.’, ‘;’, ‘?’, ‘:’, ‘!’, ‘,’}. Although it is intuitive and effective, its diversity is relatively low and its performance is limited. Easy data augmentation (EDA) [25] is a representative modification method for generating data samples by applying four operations. These operations include synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD). While EDA offers a straightforward method to perform data augmentation, various operations do not guarantee the diversity of the augmented sentences. We verified how the operations affect the performance of the EDA. The experimental results are listed in Table 1. One notable observation was that there was no significant difference between using four operations and using a single operation. In particular, even in the full-data scenario, insertion is only equal to EDA. It appears that the four operations do not effectively inject diverse noise.

3. Proposed Method: STOP-SYM

In this section, we propose a new method that uses one operation, insertion, and OOV tokens. We aimed to simplify the noise injection using only insertion. We utilize OOV tokens to introduce noise, but prevent semantic degradation by selecting synonyms of stopwords from the OOV tokens. This approach is more effective and yields better performance than previous methods.

3.1 Synonym of Stopwords

In this section, we explain the selection of OOV words and synonyms of stopwords. We expect to increase the diversity of sentences by inserting words not present in the training dataset. Inserting any word can easily increase the diversity; however, we need to avoid critical words that significantly influence class determination. Critical words are used frequently in specific classes; therefore, we used OOV to avoid such words. If we uniformly insert OOV words into all sentences regardless of class, the probability of them becoming critical words would be low. We can prevent the insertion of critical word using OOV.

However, we cannot prevent semantic degradation caused by using OOV words.

To select words that do not significantly cause semantic degradation of sentences, we propose the use of meaningless words and stopword synonyms. Stopwords are common words that occur frequently but contribute little to meaning of a sentence. They are likely to be present in most datasets, and finding stopwords in OOV is almost impossible. Therefore, we used synonyms of stopwords that have meanings similar s to stopwords. We selected words that were both OOV and synonyms of stopwords as candidates for insertion. We can easily increase the diversity of the augmented sentences and prevent the semantic degradation of sentences using the selected words.

The stopwords were obtained from the ‘NLTK’ package. Synonyms were obtained using the ‘synsets’ package from WordNet. There were 179 stopwords, and these stopwords had 815 synonyms. Based on this, OOV words that were not present in the training dataset were selected. Table 2 lists the number of candidates, denoted by N, for each dataset along with examples. For most of the datasets, N typically amounted to approximately 50% of the 813 candidates. Numerous synonyms of stopwords exist in most datasets, but OOV reduces their count. We present examples of such words, indicating infrequently used words. These words are unrelated to their respective domains; therefore, they are not shortcuts for determining class.

3.2 Data Augmentation by Insertion

In this section, we explain how we conducted data augmentation using stopword synonyms. We inserted a randomly chosen synonym for stopwords at random positions in a sentence. We repeated this random insertion k times for each sentence. The range of k was 1 to one-third of the sentence length. In other words, the maximum number of inserted words was one-third of the sentence length. The data generated in this manner were included in the training data and used as input samples. Our approach not only preserves the original input tokens, but also significantly broadens the variation within sentences.

Table 3 lists the augmented samples. When examining the augmented sentences, we observe that many infrequently used or less meaningful words have been inserted. Our proposed method operates in such a way that it introduces additional words with OOV and stopword-like characteristics while preserving the original input tokens. The augmented sentences can be considered more diverse than those generated using the previous method. Figure 1 illustrates the visualization of sentences augmented using both STOP-SYM and EDA, vectorized sentences, and visualized using t-SNE. As observed, the data augmented with STOP-SYM exhibited greater diversity than the data augmented with EDA, resulting in a more scattered pattern.

4. Experiment Setup

To compare the EDA and AEDA methods, we employed the same code base. We performed fine-tuning on the pretrained language model BERT (bidirectional encoder representations from transformers) using the generated augmented samples alongside the original training samples. We then utilized the predictions of the test samples based on this model to measure accuracy, using it as the primary metric. The pretrained language model was the BERT-based model provided by Hugging Face. The code was executed using a GeForce RTX 2080 GPU with 12 GB of memory.

4.1 Dataset

The performance of the proposed method was evaluated under various environments. Our method is expected to be applicable to all text-classification tasks. To validate this, we used five different datasets for text classification. These have different domains and sizes. SST-2 [26] is a sentiment analysis of movie reviews containing 6,920 training samples. SUBJ [27] is a subjective analysis of movie reviews with 8,000 training samples. PC [28] is a pros classification with 39,428 training samples. CR [29, 30] is a sentiment analysis of customer reviews and contains 1,863 training samples. TREC [31] is a question classification method with 5,452 training samples.

In an additional environment, we evaluated performance by controlling the data ratio. Simple modification methods are heavily influenced by the quantity of data. This is an effective augmentation method for small quantities. However, it is not effective in large quantities because of its limited diversity. We aimed to demonstrate its effectiveness regardless of the data quantity and applied various dataset ratios. We controlled the data ratios at 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100%. In addition, we compared the overall performance of all the data ratio values.

4.2 Model

Previous methods have used a sequential model; however, we used a pre-trained language model. The EDA [25] method suggests that data augmentation may not be necessary for BERT because it is already pre-trained on a large corpus with a vast vocabulary. Similarly, AEDA [24] primarily focused on overall performance comparisons with RNN- and LSTM-based models. However, language models have recently become essential. We evaluated the effectiveness of our data-augmentation method, specifically within a pretrained language model using BERT. In our study, we utilized BERT [32] provided by Hugging Face, and we employed ‘bert-base-uncased’ to evaluate the effects of our data augmentation. BERT was used as the base pretrained model, and we specifically utilized the [CLS] token representation for classification purposes.

5. Experiment Results

To compare our augmentation method, we employed four modification methods: 1) EDA is a representative simple data-augmentation method [25]. 2) AEDA is a method similar to ours that relies solely on the insertion of punctuation marks [24]. 3) TF-Low uses the TF-IDF (term frequency-inverse document frequency) scores. It selected unimportant words by TF-IDF score, and randomly inserted unimportant words in the sentence [33]. 4) STOP uses stopwords Random stopwords were selected and inserted into the sentences [34].

Table 4 lists the experimental results. Only Original used the original data for fine-tuning, and we conducted experiments on five datasets. The other methods were augmented with the four methods using the original training data. Both the augmented and original training data were the final data for fine-tuning. Each experiment was repeated with five different seed numbers and the results were averaged.

5.1 Overall Performance

Because all datasets showed similar results, we first examined the SST-2 [26] dataset. This shows that STOP-SYM achieved an outstanding overall performance, followed by AEDA, STOP, EDA, TF-Low, and Only Original. Similar to the overall performance, STOP-SYM performed well in most data scenarios. STOP-SYM is a highly effective augmentation method. Comparing STOP-SYM to Only Original, we observed a 46.5% performance improvement in the 1% dataset and a 0.4% improvement in the full dataset. The overall performance was enhanced by 5%. This demonstrated the effectiveness of the proposed method.

We demonstrated the effectiveness of our approach by comparing it with various previous methods. First, STOP-SYM was more effective than the other four operations. To determine whether the sole insertion method was effective, we compared STOP-SYM with the EDA. STOP-SYM achieved a 3.2% improvement in overall performance and performed well for every data ratio. Second, the words selected using the proposed method were suitable for insert augmentation. Except for the EDA, the other methods use only an insertion operation. Because EDA is not the lowest-performing method, the effectiveness of the insertion data augmentation varies depending on the insertion target. STOP-SYM achieved improvements of 0.7%, 2.1%, and 1.5% over AEDA, TF-Low, and STOP respectively. This implies that the better the insertion targets, the better the augmentation performance.

We also show that STOP-SYM has robust performance across classification tasks and domains and is not limited to sentiment classification. To confirm its superior performance in various classification tasks, including subjectivity, pros, and question classification, we applied STOP-SYM. Compared to Only Original, STOP-SYM showed performance improvements of 1.3% on the SUBJ dataset, 3% on PC, and 13.6% on TREC. STOP-SYM utilizes synonyms of stopwords that are not present in the training dataset. Therefore, it exhibits robust performance in various classification tasks regardless of their specific characteristics. In sentiment classification, STOP-SYM is effective in two domains: movies and customer reviews. This demonstrates a 6.2% improvement in performance in the CR domain. Although STOP-SYM has not been extensively tested in many domains, the results show that it has no domain-specific limitations and can achieve superior performance in various domains.

6. Ablation Study

In this section, additional experimental results are presented to verify the effectiveness of the proposed approach. We performed an ablation study on OOV words, stopword synonyms, and the number of augmented samples. We also present the comparative result using a generation method, GPT3Mix.

6.1 Effectiveness of OOV

In this section, we study the effectiveness of OOV. We used the SST-2 dataset for evaluation, and the other experimental environments were the same as those in the main experiment. Figure 2 shows the results of this comparison. We compared four methods: Only Original, SS-IV, SS-ALL, and SS-OOV. Only Original means that it uses only the original training data. Others used words selected from different vocabulary. Synonyms of stopwords-in vocabulary (SS-IV) uses only synonyms of stopwords present in the corpus. Synonyms of stopwords-all (SS-ALL) uses every synonyms of stopwords. Synonyms of stopwords-out of vocabulary (SS-OOV) is the proposed method. From 40% onwards, the results tended to show a similar pattern; therefore, we omitted the details. This indicates that there was a significant difference when the dataset was below 20%. It is easy to compare the performance of each method.

These results indicate that both SS-IV and SS-ALL show significantly lower performances than SS-OOV. This demonstrates the effectiveness of using OOV words as insertion targets. Interestingly, SS-IV and SS-ALL exhibited similar performance. This means that synonyms of stopwords in vocabulary cannot provide a substantial benefit. This demonstrates the potential negative effects of data augmentation.

6.2 Effectiveness of Synonyms

In this section, we examine whether stopword synonyms are effective. To assess the effectiveness of stopword synonyms, we used Brown Corpus [35] words provided by NLTK instead of stopword synonyms. We used 321 synonyms for stopwords in the SST-2 dataset. To ensure a fairness, 321 words were sampled randomly. Figure 3 shows a comparison of the experimental results. Brown used 321 words as insertion targets. These words are OOV and are not synonyms for stopwords.

We used the SST-2 dataset for this experiment, maintaining the same experimental setup as that described in Section 6.1. Our proposed SS-OOV demonstrated superior performance compared to Only Original and Brown. This suggests the effectiveness of OOV words for insertion. However, the lower performance of Brown compared to SS-OOV shows that incorporating OOV words within the synonyms of stopwords is more effective than randomly sampling OOV words.

6.3 Augmented Sample Size

We conducted an ablation study to investigate the effect of the number of samples on performance using STOP-SYM. The number of samples is included in the subset {2, 4, 6, 8, 10, 16}. Sparse subsets were used for a more detailed analysis. We used the SST-2 dataset, and the experimental environment was the same as before. The results are summarized in Table 5.

When using only 1% of the data, we observed performance improvement as the number of samples increased. This suggests that in limited dataset scenarios, such as 1%, having more data is beneficial for performance. However, in situations with sufficient data (full dataset), we observed a tendency for the performance to decrease when a certain number of samples were generated. This may be indicative of overfitting, where an excessive number of samples leads to a model with bias. From this experiment, we conclude that it is important to select an appropriate number of samples.

6.4 Comparison with Generation Method

The advantages of this modification method are its low complexity and cost. Therefore, there is a question as to whether performance will suffer as complexity and cost decrease. It is necessary to verify how much the performance degrades compared to recent generation methods. To address this, we compared our results with those of a data-augmentation method using LLMs called GPT3Mix. We cited the performance results from GPT3Mix [20], and conducted experiments with other conditions adjusted to align with our approach. Data augmentation generated 10x samples compared to the original data. The original data subset ratios were adjusted to {0.1%, 0.3%, and 1%} to simulate few-shot scenarios. The BERT model was used for training, and the other parameters were adjusted to be consistent for a fair evaluation. The experimental results are listed in Table 6.

Comparative experiments were conducted using the following three datasets: SST-2, TREC, and SUBJ. Across the different ratios, there was no significant difference in the average performance between STOP-SYM and GPT3Mix. When only 1% of the dataset was used, STOP-SYM outperformed GPT3Mix by approximately 1%, whereas when 0.1% of the dataset was used, GPT3Mix outperformed by approximately 1%. The greatest advantage of LLMs lies in their performance in few-shot scenarios, and the proposed STOP-SYM demonstrated superior performance in such situations. GPT3Mix uses GPT-3 and requires more computational resources and memory than the proposed method. In this study, we demonstrated the value of the proposed approach.

7. Conclusion

We propose a new data-augmentation method that relies solely on insertion. OOV expanded the corpus with training dataset additions, whereas the synonyms of stopwords generated white noise to minimize their impact on learning. STOP-SYM is composed of both OOV words and synonyms of stop words, which are then used for insertion-based data augmentation. We demonstrated its effectiveness across various text-classification tasks, showing superior performance compared with previous methods. Furthermore, through experiments, we highlighted the significance of the characteristics of OOV and synonyms of stopwords.

Fig 1.

Figure 1.

Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Fig 2.

Figure 2.

Comparison experiment results of OOV effect.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Fig 3.

Figure 3.

Comparison experiment results of synonyms effect.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Table 1 . Performance of each individual operation in EDA.

Method1%10%100%Average
EDA67.187.191.381.8
Deletion67.586.791.081.7
Swap67.486.190.881.4
Replacement67.286.390.981.4
Insertion67.487.091.381.9

The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..


Table 2 . Synonyms of stopwords examples.

DataNSynonyms of stopwords
SST-2321‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’
SUBJ306‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’
PC363‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’
CR400‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’
TREC352‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’

Table 3 . Example of augmentation samples.

DataSentence
OriginalToo clumsy in key moments to make a big splash
Sample 1Too clumsy unity in key moments to oasis make a big splash front
Sample 2Too clumsy in key moments to make hundred a big splash
Sample 3Too maine clumsy inwards in key coiffe moments to make a big splash
Sample 4Hera too clumsy in key moments to make a big splash nobelium

Table 4 . Overall performance on each dataset.

DataMethod1%5%10%20%40%60%80%100%Overall
SST-2Only original55.784.586.788.189.390.290.891.284.6
EDA67.184.287.188.389.290.390.591.386.0
AEDA77.886.589.088.689.890.891.191.288.1
TF-Low70.785.187.689.089.890.691.091.486.9
STOP77.585.687.988.289.190.390.391.187.5
STOP-SYM81.687.388.789.290.091.091.291.688.8

SUBJOnly original87.292.494.495.095.696.196.196.394.1
EDA91.794.195.195.095.895.696.596.095.0
AEDA91.593.394.795.595.595.895.696.594.8
TF-Low87.393.594.295.595.995.996.096.694.4
STOP91.094.394.794.994.294.895.996.094.5
STOP-SYM91.994.395.395.695.996.296.196.795.3

PCOnly original72.391.792.993.594.294.594.995.291.2
EDA88.192.39393.894.394.394.795.093.2
AEDA90.792.693.393.794.894.894.995.493.8
TF-Low90.692.693.29494.694.794.895.293.7
STOP89.292.192.993.494.194.394.795.293.2
STOP-SYM90.492.493.794.395.195.095.295.594.0

CROnly original63.266.568.471.784.084.886.287.876.6
EDA54.675.179.682.586.686.087.385.779.7
AEDA63.174.779.984.586.186.084.887.180.8
TF-Low63.468.880.280.985.785.386.586.279.6
STOP61.776.180.881.685.985.986.286.280.6
STOP-SYM63.276.681.083.387.586.086.687.181.4

TRECOnly original37.260.679.481.294.39595.495.679.8
EDA62.680.892.393.495.296.495.496.089.0
AEDA62.484.892.294.095.495.695.596.589.6
TF-Low60.686.092.093.795.796.496.096.189.6
STOP62.385.091.494.495.596.296.296.489.7
STOP-SYM63.987.193.695.296.196.596.496.590.7

The bold font indicates the best performance in each test..


Table 5 . Performance and the number of augmented samples.

# of samples1%10%100%Average
280.587.091.586.3
481.688.791.687.3
681.589.091.987.4
882.088.591.487.3
1082.287.691.387.0
1682.488.091.087.1

The bold font indicates the best performance in each test..


Table 6 . Comparison with GPT3Mix.

Ratio0.1%0.3%1%
DataGPT3MiXSTOP-SYMGPT3MiXSTOP-SYMGPT3MiXSTOP-SYM
SST-278.078.284.982.987.782.9
TREC47.744.057.854.660.568.2
SUBJ85.485.987.587.190.690.3
Average70.469.476.774.979.680.5

References

  1. Van Dyk, DA, and Meng, XL (2001). The art of data augmentation. Journal of Computational and Graphical Statistics. 10, 1-50. https://doi.org/10.1198/10618600152418584
    CrossRef
  2. Frid-Adar, M, Klang, E, Amitai, M, Goldberger, J, and Greenspan, H . Synthetic data augmentation using GAN for improved liver lesion classification., Proceedings of 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, Washington, DC, USA, Array, pp.289-293. https://doi.org/10.1109/ISBI.2018.8363576
    CrossRef
  3. Shorten, C, Khoshgoftaar, TM, and Furht, B (2021). Text data augmentation for deep learning. Journal of Big Data. 8. article no 101. https://doi.org/10.1186/s40537-021-00492-0
    CrossRef
  4. Bird, JJ, Ekart, A, and Faria, DR (2023). Chatbot Interaction with artificial intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing. 14, 3129-3144. https://doi.org/10.1007/s12652-021-03439-8
    CrossRef
  5. Chang, Y, Zhang, R, and Pu, J (2023). I-WAS: a data augmentation method with GPT-2 for simile detection. Document Analysis and Recognition – ICDAR 2023. Cham, Switzerland: Springer, pp. 265-279 https://doi.org/10.1007/978-3-031-41682-8_17
    CrossRef
  6. Wu, Z, Wang, S, Gu, J, Khabsa, M, Sun, F, and Ma, H. (2022) . CLEAR: contrastive learning for sentence representation. Available: https://arxiv.org/abs/2012.15466
  7. Dai, H, Liu, Z, Liao, W, Huang, X, Cao, Y, and Wu, Z. (2023) . AugGPT: Leveraging ChatGPT for text data augmentation. Available: https://arxiv.org/abs/2302.13007
  8. Gao, T, Fisch, A, and Chen, D. (2020) . Making pre-trained language models better few-shot learners. Available: https://arxiv.org/abs/2012.15723
  9. Chen, L, Zaharia, M, and Zou, J. (2023) . FrugalGPT: how to use large language models while reducing cost and improving performance. Available: https://arxiv.org/abs/2305.05176
  10. Shou, Z, Jiang, Y, and Lin, F . AMR-DA: data augmentation by abstract meaning representation., Findings of the Association for Computational Linguistics: ACL 2022, 2002, Dublin, Ireland, Array, pp.3082-3098. https://doi.org/10.18653/v1/2022.findings-acl.244
    CrossRef
  11. Bae, J, and Lee, C . Easy data augmentation for improved malware detection: a comparative study., Proceedings of 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), 2021, Jeju Island, South Korea, Array, pp.214-218. https://doi.org/10.1109/BigComp51126.2021.00048
    CrossRef
  12. Zhong, Z, Zheng, L, Kang, G, Li, S, and Yang, Y . Random erasing data augmentation., Proceedings of the AAAI Conference On Artificial Intelligence, 2020, Array, pp.13001-13008. https://doi.org/10.1609/aaai.v34i07.7000
    CrossRef
  13. Pervaiz, A, Hussain, F, Israr, H, Tahir, MA, Raja, FR, Baloch, NK, Ishmanov, F, and Zikria, YB (2020). Incorporating noise robustness in speech command recognition by noise augmentation of training data. Sensors. 20. article no 2326. https://doi.org/10.3390/s20082326
    Pubmed KoreaMed CrossRef
  14. Liu, NF, May, J, Pust, M, and Knight, K. (2018) . Augmenting statistical machine translation with subword translation of out-of-vocabulary words. Available: https://arxiv.org/abs/1808.05700
  15. Silva, C, and Ribeiro, B . The importance of stop word removal on recall values in text categorization., Proceedings of the International Joint Conference on Neural Networks, 2003, Portland, OR, USA, Array, pp.1661-1666. https://doi.org/10.1109/IJCNN.2003.1223656
  16. Edunov, S, Ott, M, Auli, M, and Grangier, D. (2018) . Understanding back-translation at scale. Available: https://arxiv.org/abs/1808.09381
    CrossRef
  17. Hou, Y, Liu, Y, Che, W, and Liu, T. (2018) . Sequence-to-sequence data augmentation for dialogue language understanding. Available: https://arxiv.org/abs/1807.01554
  18. Anaby-Tavor, A, Carmeli, B, Goldbraich, E, Kantor, A, Kour, G, Shlomov, G, Tepper, N, and Zwerdling, N . Do not have enough data? Deep learning to the rescue!., Proceedings of the AAAI Conference on Artificial Intelligence, 2020, Array, pp.7383-7390. https://doi.org/10.1609/aaai.v34i05.6233
    CrossRef
  19. Peng, B, Zhu, C, Zeng, M, and Gao, J. (2021) . Data augmentation for spoken language understanding via pretrained models. Available: https://arxiv.org/abs/2004.13952
    CrossRef
  20. Yoo, KM, Park, D, Kang, J, Lee, SW, and Park, W. (2021) . GPT3Mix: leveraging large-scale language models for text augmentation. Available: https://arxiv.org/abs/2104.08826
    CrossRef
  21. Yu, S, Yang, J, Liu, D, Li, R, Zhang, Y, and Zhao, S (2019). Hierarchical data augmentation and the application in text classification. IEEE Access. 7, 185476-185485. https://doi.org/10.1109/ACCESS.2019.2960263
    CrossRef
  22. Zhao, J, Wang, T, Yatskar, M, Ordonez, V, and Chang, KW. (2018) . Gender bias in coreference resolution: evaluation and debiasing methods. Available: https://arxiv.org/abs/1804.06876
  23. Dai, X, and Adel, H. (2020) . An analysis of simple data augmentation for named entity recognition. Available: https://arxiv.org/abs/2010.11683
  24. Karimi, A, Rossi, L, and Prati, A (2021). AEDA: an easier data augmentation technique for text classification. Available:
    CrossRef
  25. Wei, J, and Zou, K (2019). EDA: easy data augmentation techniques for boosting performance on text classification tasks. Available:
    CrossRef
  26. Socher, R, Perelygin, A, Wu, J, Chuang, J, Manning, CD, Ng, AY, and Potts, C . Recursive deep models for semantic compositionality over a sentiment treebank., Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, Seattle, WA, USA, pp.1631-1642.
  27. Pang, B, and Lee, L. (2004) . A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. Available: https://arxiv.org/abs/cs/0409058
    CrossRef
  28. Ganapathibhotla, M, and Liu, B . Mining opinions in comparative sentences., Proceedings of the 22nd International Conference on Computational Linguistics (Coling), 2008, Manchester, UK, pp.241-248.
    CrossRef
  29. Hu, M, and Liu, B . Mining and summarizing customer reviews., Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, Seattle, WA, USA, Array, pp.168-177. https://doi.org/10.1145/1014052.1014073
  30. Liu, Q, Gao, Z, Liu, B, and Zhang, Y . Automated rule selection for aspect extraction in opinion mining., Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015, Buenos Aires, Argentina, pp.1291-1297.
  31. Li, X, and Roth, D . Learning question classifiers., Proceedings of 19th International Conference on Computational Linguistics (Coling), 2002, Taipei, Taiwan, pp.1-7.
    CrossRef
  32. Devlin, J, Chang, MW, Lee, K, and Toutanova, K. (2018) . BERT: pre-training of deep bidirectional transformers for language understanding. Available: https://arxiv.org/abs/1810.04805
  33. Kim, HS, Lee, JH, and Kim, HB . TF-EDA: efficient and effective text data augmentation., Proceedings of the 24th International Symposium on Advanced Intelligent Systems (ISIS), 2023, Gwangju, South Korea.
  34. Siagh, A, Laallam, FZ, Kazar, O, Salem, H, and Benglia, ME. (2023) . IDA: an imbalanced data augmentation for text classification. Intelligent Systems and Pattern Recognition, 241-251. https://doi.org/10.1007/978-3-031-46335-8_19
    CrossRef
  35. Francis, WN, and Kucera, H. (1979) . Brown corpus manual. Available: http://korpus.uib.no/icame/brown/bcm.html

Share this article on :

Related articles in IJFIS