International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92
Published online June 25, 2024
https://doi.org/10.5391/IJFIS.2024.24.2.83
© The Korean Institute of Intelligent Systems
Ho-Seung Kim1 and Jee-Hyong Lee2
1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea
Correspondence to :
Jee-Hyong Lee (john@skku.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.
Keywords: Data augmentation, Natural language processing, Text classification
No potential conflict of interest relevant to this article was reported.
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92
Published online June 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.2.83
Copyright © The Korean Institute of Intelligent Systems.
Ho-Seung Kim1 and Jee-Hyong Lee2
1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea
Correspondence to:Jee-Hyong Lee (john@skku.edu)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.
Keywords: Data augmentation, Natural language processing, Text classification
Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.
Comparison experiment results of OOV effect.
Comparison experiment results of synonyms effect.
Table 1 . Performance of each individual operation in EDA.
Method | 1% | 10% | 100% | Average |
---|---|---|---|---|
EDA | 67.1 | 81.8 | ||
Deletion | 86.7 | 91.0 | 81.7 | |
Swap | 67.4 | 86.1 | 90.8 | 81.4 |
Replacement | 67.2 | 86.3 | 90.9 | 81.4 |
Insertion | 67.4 | 87.0 |
The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..
Table 2 . Synonyms of stopwords examples.
Data | Synonyms of stopwords | |
---|---|---|
SST-2 | 321 | ‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’ |
SUBJ | 306 | ‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’ |
PC | 363 | ‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’ |
CR | 400 | ‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’ |
TREC | 352 | ‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’ |
Table 3 . Example of augmentation samples.
Data | Sentence |
---|---|
Original | Too clumsy in key moments to make a big splash |
Sample 1 | Too clumsy |
Sample 2 | Too clumsy in key moments to make |
Sample 3 | Too |
Sample 4 |
Table 4 . Overall performance on each dataset.
Data | Method | 1% | 5% | 10% | 20% | 40% | 60% | 80% | 100% | Overall |
---|---|---|---|---|---|---|---|---|---|---|
SST-2 | Only original | 55.7 | 84.5 | 86.7 | 88.1 | 89.3 | 90.2 | 90.8 | 91.2 | 84.6 |
EDA | 67.1 | 84.2 | 87.1 | 88.3 | 89.2 | 90.3 | 90.5 | 91.3 | 86.0 | |
AEDA | 77.8 | 86.5 | 88.6 | 89.8 | 90.8 | 91.1 | 91.2 | 88.1 | ||
TF-Low | 70.7 | 85.1 | 87.6 | 89.0 | 89.8 | 90.6 | 91.0 | 91.4 | 86.9 | |
STOP | 77.5 | 85.6 | 87.9 | 88.2 | 89.1 | 90.3 | 90.3 | 91.1 | 87.5 | |
88.7 | ||||||||||
SUBJ | Only original | 87.2 | 92.4 | 94.4 | 95.0 | 95.6 | 96.1 | 96.1 | 96.3 | 94.1 |
EDA | 91.7 | 94.1 | 95.1 | 95.0 | 95.8 | 95.6 | 96.5 | 96.0 | 95.0 | |
AEDA | 91.5 | 93.3 | 94.7 | 95.5 | 95.5 | 95.8 | 95.6 | 96.5 | 94.8 | |
TF-Low | 87.3 | 93.5 | 94.2 | 95.5 | 95.9 | 95.9 | 96.0 | 96.6 | 94.4 | |
STOP | 91.0 | 94.7 | 94.9 | 94.2 | 94.8 | 95.9 | 96.0 | 94.5 | ||
PC | Only original | 72.3 | 91.7 | 92.9 | 93.5 | 94.2 | 94.5 | 94.9 | 95.2 | 91.2 |
EDA | 88.1 | 92.3 | 93 | 93.8 | 94.3 | 94.3 | 94.7 | 95.0 | 93.2 | |
AEDA | 93.3 | 93.7 | 94.8 | 94.8 | 94.9 | 95.4 | 93.8 | |||
TF-Low | 90.6 | 93.2 | 94 | 94.6 | 94.7 | 94.8 | 95.2 | 93.7 | ||
STOP | 89.2 | 92.1 | 92.9 | 93.4 | 94.1 | 94.3 | 94.7 | 95.2 | 93.2 | |
90.4 | 92.4 | |||||||||
CR | Only original | 63.2 | 66.5 | 68.4 | 71.7 | 84.0 | 84.8 | 86.2 | 76.6 | |
EDA | 54.6 | 75.1 | 79.6 | 82.5 | 86.6 | 85.7 | 79.7 | |||
AEDA | 63.1 | 74.7 | 79.9 | 86.1 | 84.8 | 87.1 | 80.8 | |||
TF-Low | 63.4 | 68.8 | 80.2 | 80.9 | 85.7 | 85.3 | 86.5 | 86.2 | 79.6 | |
STOP | 61.7 | 76.1 | 80.8 | 81.6 | 85.9 | 85.9 | 86.2 | 86.2 | 80.6 | |
83.3 | 86.6 | 87.1 | ||||||||
TREC | Only original | 37.2 | 60.6 | 79.4 | 81.2 | 94.3 | 95 | 95.4 | 95.6 | 79.8 |
EDA | 62.6 | 80.8 | 92.3 | 93.4 | 95.2 | 96.4 | 95.4 | 96.0 | 89.0 | |
AEDA | 62.4 | 84.8 | 92.2 | 94.0 | 95.4 | 95.6 | 95.5 | 89.6 | ||
TF-Low | 60.6 | 86.0 | 92.0 | 93.7 | 95.7 | 96.4 | 96.0 | 96.1 | 89.6 | |
STOP | 62.3 | 85.0 | 91.4 | 94.4 | 95.5 | 96.2 | 96.2 | 96.4 | 89.7 | |
The bold font indicates the best performance in each test..
Table 5 . Performance and the number of augmented samples.
# of samples | 1% | 10% | 100% | Average |
---|---|---|---|---|
2 | 80.5 | 87.0 | 91.5 | 86.3 |
4 | 81.6 | 88.7 | 91.6 | 87.3 |
6 | 81.5 | |||
8 | 82.0 | 88.5 | 91.4 | 87.3 |
10 | 82.2 | 87.6 | 91.3 | 87.0 |
16 | 88.0 | 91.0 | 87.1 |
The bold font indicates the best performance in each test..
Table 6 . Comparison with GPT3Mix.
Ratio | 0.1% | 0.3% | 1% | |||
---|---|---|---|---|---|---|
Data | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM | GPT3MiX | STOP-SYM |
SST-2 | 78.0 | 78.2 | 84.9 | 82.9 | 87.7 | 82.9 |
TREC | 47.7 | 44.0 | 57.8 | 54.6 | 60.5 | 68.2 |
SUBJ | 85.4 | 85.9 | 87.5 | 87.1 | 90.6 | 90.3 |
Average | 70.4 | 69.4 | 76.7 | 74.9 | 79.6 | 80.5 |
Alif Tri Handoyo, Hidayaturrahman, Criscentia Jessica Setiadi, Derwin Suhartono
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 401-413 https://doi.org/10.5391/IJFIS.2022.22.4.401Yagya Raj Pandeya, and Joonwhoan Lee
Int. J. Fuzzy Log. Intell. Syst. 2018; 18(2): 154-160 https://doi.org/10.5391/IJFIS.2018.18.2.154Minyoung Kim
Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298 https://doi.org/10.5391/IJFIS.2016.16.4.293Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.
|@|~(^,^)~|@|Comparison experiment results of OOV effect.
|@|~(^,^)~|@|Comparison experiment results of synonyms effect.