Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92

Published online June 25, 2024

https://doi.org/10.5391/IJFIS.2024.24.2.83

© The Korean Institute of Intelligent Systems

Need Text Data Augmentation? Just One Insertion Is Enough

Ho-Seung Kim1 and Jee-Hyong Lee2

1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea

Correspondence to :
Jee-Hyong Lee (john@skku.edu)

Received: February 20, 2024; Accepted: May 30, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.

Keywords: Data augmentation, Natural language processing, Text classification

This work was supported in part by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-00421, Artificial Intelligence Graduate School Program (Sung kyunkwan University)) and in part by the MSIT (Ministry of Science and ICT), Korea, under the ICT Creative Consilience Program (IITP-2023-2020-0-01821) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

No potential conflict of interest relevant to this article was reported.

Ho-Seung Kim received the B.S. degree in physics from the Korea Military Academy, Seoul, South Korean, in 2013, and M.S. in artificial intelligence from the Sungkyunkwan University, Suwon, South Korea, in 2022, respectively. He is currently pursuing the Ph.D. degree with Sungkyunkwan University, Suwon, South Korea. His research interests include machine learning, intelligent systems, natural language processing, and sentiment analysis.

Jee-Hyong Lee received the B.S., M.S., and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 1993, 1995, and 1999, respectively. From 2000 to 2002, he was an International Fellow at SRI International, USA. In 2002, he joined Sungkyunkwan University, Suwon, South Korea, as a Faculty Member. His research interests include fuzzy theory and applications, intelligent systems, and machine learning.

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92

Published online June 25, 2024 https://doi.org/10.5391/IJFIS.2024.24.2.83

Copyright © The Korean Institute of Intelligent Systems.

Need Text Data Augmentation? Just One Insertion Is Enough

Ho-Seung Kim1 and Jee-Hyong Lee2

1Department of Artificial Intelligence, Sungkyunkwan University, Suwon, Korea
2Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, Korea

Correspondence to:Jee-Hyong Lee (john@skku.edu)

Received: February 20, 2024; Accepted: May 30, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Data augmentation generates additional samples for data expansion. The modification method is an augmentation technique that is commonly used because of its simplicity. This method modifies the words in sentences using simple rules. It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations. Despite its simplicity, there is a drawback. It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity. In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks. In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise. The proposed method, STOP-SYM, generates sentences by simply inserting words. STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms. OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise. By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained. Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance. We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.

Keywords: Data augmentation, Natural language processing, Text classification

Fig 1.

Figure 1.

Visualization of diversity comparison between EDA and STOP-SYM through t-SNE.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Fig 2.

Figure 2.

Comparison experiment results of OOV effect.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Fig 3.

Figure 3.

Comparison experiment results of synonyms effect.

The International Journal of Fuzzy Logic and Intelligent Systems 2024; 24: 83-92https://doi.org/10.5391/IJFIS.2024.24.2.83

Table 1 . Performance of each individual operation in EDA.

Method1%10%100%Average
EDA67.187.191.381.8
Deletion67.586.791.081.7
Swap67.486.190.881.4
Replacement67.286.390.981.4
Insertion67.487.091.381.9

The bold font indicates the best performance in each test. We used the SST-2 dataset to generate four samples for each sentence. We fine-tuned BERT with the augmented dataset for classification. We performed the experiments with various fractions of the datasets, 1%, 10%, and 100%. In the case of 1%, we use only 1% of the dataset for augmentation and model training..


Table 2 . Synonyms of stopwords examples.

DataNSynonyms of stopwords
SST-2321‘maine’, ‘tween’, ‘coiffe’, ‘sulphur’, ‘nobelium’, ‘inward’, ‘afterward’, ‘backside’, ‘beingness’
SUBJ306‘buttocks’, ‘potty’, ‘soh’, ‘iodin’, ‘lav’, ‘nether’, ‘slay’, ‘posterior’, ‘ampere’, ‘kod’
PC363‘coif’, ‘crapper’, ‘momma’, ‘oxygen’, ‘lonesome’, ‘commode’, ‘mama’, ‘suffice’, ‘sour’,’embody’
CR400‘keister’, ‘dress’, ‘rump’, ‘bum’, ‘apiece’, ‘fanny’, ‘oregon’, ‘unity’, ‘polish’, ‘saami’, ‘metre’
TREC352‘buns’, ‘bequeath’, ‘practise’, ‘seaport’, ‘astir’, ‘ar’, ‘assume’, ‘boost’, ‘milliampere’

Table 3 . Example of augmentation samples.

DataSentence
OriginalToo clumsy in key moments to make a big splash
Sample 1Too clumsy unity in key moments to oasis make a big splash front
Sample 2Too clumsy in key moments to make hundred a big splash
Sample 3Too maine clumsy inwards in key coiffe moments to make a big splash
Sample 4Hera too clumsy in key moments to make a big splash nobelium

Table 4 . Overall performance on each dataset.

DataMethod1%5%10%20%40%60%80%100%Overall
SST-2Only original55.784.586.788.189.390.290.891.284.6
EDA67.184.287.188.389.290.390.591.386.0
AEDA77.886.589.088.689.890.891.191.288.1
TF-Low70.785.187.689.089.890.691.091.486.9
STOP77.585.687.988.289.190.390.391.187.5
STOP-SYM81.687.388.789.290.091.091.291.688.8

SUBJOnly original87.292.494.495.095.696.196.196.394.1
EDA91.794.195.195.095.895.696.596.095.0
AEDA91.593.394.795.595.595.895.696.594.8
TF-Low87.393.594.295.595.995.996.096.694.4
STOP91.094.394.794.994.294.895.996.094.5
STOP-SYM91.994.395.395.695.996.296.196.795.3

PCOnly original72.391.792.993.594.294.594.995.291.2
EDA88.192.39393.894.394.394.795.093.2
AEDA90.792.693.393.794.894.894.995.493.8
TF-Low90.692.693.29494.694.794.895.293.7
STOP89.292.192.993.494.194.394.795.293.2
STOP-SYM90.492.493.794.395.195.095.295.594.0

CROnly original63.266.568.471.784.084.886.287.876.6
EDA54.675.179.682.586.686.087.385.779.7
AEDA63.174.779.984.586.186.084.887.180.8
TF-Low63.468.880.280.985.785.386.586.279.6
STOP61.776.180.881.685.985.986.286.280.6
STOP-SYM63.276.681.083.387.586.086.687.181.4

TRECOnly original37.260.679.481.294.39595.495.679.8
EDA62.680.892.393.495.296.495.496.089.0
AEDA62.484.892.294.095.495.695.596.589.6
TF-Low60.686.092.093.795.796.496.096.189.6
STOP62.385.091.494.495.596.296.296.489.7
STOP-SYM63.987.193.695.296.196.596.496.590.7

The bold font indicates the best performance in each test..


Table 5 . Performance and the number of augmented samples.

# of samples1%10%100%Average
280.587.091.586.3
481.688.791.687.3
681.589.091.987.4
882.088.591.487.3
1082.287.691.387.0
1682.488.091.087.1

The bold font indicates the best performance in each test..


Table 6 . Comparison with GPT3Mix.

Ratio0.1%0.3%1%
DataGPT3MiXSTOP-SYMGPT3MiXSTOP-SYMGPT3MiXSTOP-SYM
SST-278.078.284.982.987.782.9
TREC47.744.057.854.660.568.2
SUBJ85.485.987.587.190.690.3
Average70.469.476.774.979.680.5

Share this article on :

Related articles in IJFIS