Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 244-258

Published online September 25, 2023

https://doi.org/10.5391/IJFIS.2023.23.3.244

© The Korean Institute of Intelligent Systems

Online Unstructured Data Analysis Models with KoBERT and Word2vec: A Study on Sentiment Analysis of Public Opinion in Korean

Changwon Baek1, Jiho Kang2, and SangSoo Choi1

1Technological Convergence Center, Korea Institute of Science and Technology (KIST), Seoul, Korea
2Institute of Engineering Research, Korea University, Seoul, Korea

Correspondence to :
Changwon Baek (baekcw@kist.re.kr)

Received: February 21, 2023; Revised: June 21, 2023; Accepted: July 5, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Online news articles and comments play a vital role in shaping public opinion. Numerous studies have conducted online opinion analyses using these as raw data. Bidirectional encoder representations from transformer (BERT)-based sentiment analysis of public opinion have recently attracted significant attention. However, owing to its limited linguistic versatility and low accuracy in domains with insufficient learning data, the application of BERT to Korean is challenging. Conventional public opinion analysis focuses on term frequency; hence, low-frequency words are likely to be excluded because their importance is underestimated. This study aimed to address these issues and facilitate the analysis of public opinion regarding Korean news articles and comments. We propose a method for analyzing public opinion using word2vec to increase the word-frequency-centered analytical limit in conjunction with KoBERT, which is optimized for Korean language by improving BERT. Naver news articles and comments were analyzed using a sentiment classification model developed on the KoBERT framework. The experiment demonstrated a sentiment classification accuracy of over 90%. Thus, it yields faster and more precise results than conventional methods. Words with a low frequency of occurrence, but high relevance, can be identified using word2vec.

Keywords: KoBERT, Word2vec, Public opinion analysis, Sentiment classification

With the development of information and communication technology, online media has become an essential means of disseminating information. The digital divide has also narrowed as anyone can easily access online news. Readers not only read news articles but also actively express their opinions by leaving comments. Such online comments enable the exchange of various viewpoints on an issue that has a lasting effect on public opinion. Public opinion formed by the exchange of ideas in online comments influence administrative, legislative, and judicial policy decisions in various ways, including through public petitions [1]. Therefore, many studies have been conducted to collect online comments and analyze public opinion. Recently, a neural network model using bidirectional encoder representations from transformers (BERT) was applied to these studies [2]. BERT is an English-language-based model that is not suitable for the Korean language [3, 4]. Therefore, applying BERT to Korean leads to inaccurate analysis results owing to errors in the Korean morphological analysis and word embedding. In addition, previous studies have used several algorithms, including term frequency-inverse document frequency (TF-IDF), latent semantic analysis (LSA), and latent Dirichlet allocation (LDA), to analyze comments. Although TF-IDF is simple to implement and provides a relatively accurate analysis based on word frequency, it does not consider the order or context of words, rendering it difficult to fully understand their meaning. By contrast, LSA considers the semantic similarity between words and accurately identifies meaning. LDA analyzes the topic to which a comment belongs, enabling it to cover multiple topics. However, both the LSA and LDA have high computational costs and are difficult to interpret. Word2vec calculates the semantic similarity between words by representing them as vectors and considering their order and context for a more accurate analysis. This method is cost-effective, rendering it less likely for important words to be excluded because of their low frequency. This ensures that a large number of comments from a particular group are not analyzed as important public opinion makes informed decisions based on public opinion [57]. Various research methods have attempted to address these issues, such as using supervised and unsupervised learning to remove duplicate comments or utilizing comment cluster analysis [8].

The main contributions of this paper can be summarized as follows:

  • • The objective of this study is to develop a specialized sentiment analyzer for online opinion analysis in Korean. To achieve this, we created an online public opinion sentiment analysis model using KoBERT, a pretrained deep learning model optimized for the Korean language. The model analyzes issues determined as positive or negative by the researcher. To overcome the linguistic limitations of BERT, we employed KoBERT, which is a transformer language model pretrained on a large Korean corpus.

  • • Online public opinion analysis is often limited by the use of word frequency analysis, which may exclude important but infrequently used opinion words. To overcome this limitation, we used a word embedding algorithm called word2vec. This algorithm considers the correlation between words by locating each word in a three-dimensional vector space and identifying words that are highly correlated with the target word. Using this algorithm, we aim to improve the accuracy and comprehensiveness of our online opinion analyses.

  • • Developing a highly accurate sentiment analyzer can significantly enhance the efficiency and speed of online public opinion analyses. Such analysis can offer valuable insights into shaping policies related to politics, economics, and social issues. Hence, policymakers and researchers who aim to make informed decisions must prioritize the development of an accurate sentiment analyzer.

Sections 2–5 of this paper present a literature review, data collection and analysis, experimental results, and conclusions.

2.1 News Comment and Online Opinion

Social media websites provide opportunities for active participation in democratic processes [5]. Online news is an example of a social media platform in which readers express their opinions and feelings through comments. News comments can be considered a form of multilateral dialogue [6]. Public opinion is formed by observing, hearing, and perceiving others’ opinions [7]. In other words, an online zone in which anyone can easily communicate with a diverse group of individuals and recognize, accept, and critique the viewpoints of others can be viewed as an ideal environment for forming public opinion. Therefore, news and comments are important data for public opinion analysis, and the number of studies on developing algorithms or tools to analyze them is increasing [8].

2.2 Text Mining

The extraction of useful information from textual data is known as text mining. It is an effective research method for deriving meaningful and valuable information and knowledge by discovering hidden relationships or patterns in data and analyzing different morphemes included in unstructured text data [9]. It can be used to identify the main latent content in a text [10]. Its applications include document classification, clustering, summarization as well as information extraction and sentiment analysis. Decision trees, recurrent neural network (RNN), and BERT are text analysis methods that employ supervised learning. Clustering, LDA, the dynamic multitopic model (DMA), and the expressed agenda model are examples of unsupervised learning methods. Recently, methodologies such as attentive long short-term memory (LSTM) and tensor fusion networks have been studied [11]. In particular, sentiment analysis is an important method for understanding online public opinion and has been used in numerous studies [1215]. The natural language processing (NLP) model includes document classification [16], syntax analysis [17], and machine translation [18,19], which focus on words or phrases. TF-IDF, which is typically used in text mining, including online opinion analysis, is an algorithm that uses word frequency and inverse document frequencies [20, 21]. Because high frequency words are given more weight, low frequency words are often excluded from analysis. To overcome these limitations, several studies have been conducted to improve the accuracy of algorithms that use probability methods [22].

2.3 KoBERT and word2vec

Although existing deep learning methodologies initialize weights to random values when creating a model, pretraining is a method for leveraging weights obtained from data learning of other problems as initial values. A problem using pretrained weights in this manner is called a downstream task. BERT improves downstream task performance by using language models for pretraining [23]. It has demonstrated excellent performance in sentence classification for various applications, including sentence sentiment analysis [24], major news topic classification [25], and document classification [26]. In this study, we used KoBERT, developed by a Korean company (SKT), to improve Google BERT’s Korean language learning performance. A case study demonstrated a 2.6% performance improvement compared to BERT [27]. It places words in a three-dimensional vector space, enabling us to evaluate the associations between different words based on their proximity. It has two different model architectures: continuous bag-of-words (CBOW) and skip-gram (SG). The CBOW model predicts a target word based on its surrounding words, whereas the SG model predicts the surrounding words based on a target word [28, 29].

This study followed the procedure depicted in Figure 1 for data collection, preprocessing, and analysis of online comments.

3.1 Data Collection

The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority”. The title, content, and comments were collected after parsing the webpage using HTML and CSS parsers. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database. The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority.” After parsing the webpage using HTML and CSS parser, the title, content, and comments were collected. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database.

3.2 Morpheme Analysis

The crawler’s unstructured text data must be classified using morphemes to analyze the comments. However, morphological analyses present significant challenges for Korean comments. First, adapting most morpheme analysis libraries to the Korean language is challenging because they were developed in English. Second, many exceptions exist to spacing rules in Hangul, including those set by the National Institute of Korean Language. Third, owing to changes in the ending and proposition, accurate analysis is impossible when the word is separated based on spacing but is not a word (space unit). This study analyzes morphemes using the “Mecab-ko library” of the “Eunjeon Hannip Project,” a Korean morpheme analyzer used in Korea, to address these issues.

3.3 word2vec Modeling

For a computer to analyze text data, a process known as “embedding” must convert them into mathematically executable forms. In this study, word2vec was used to embed the text data. word2vec, a library developed by Google, is an algorithm with the benefit of fast learning speed. This is accomplished by placing highly related words in similar positions in the vector space based on the frequency with which they appear in similar contexts. The hyperparameters required to generate the word2vec model are listed in Table 1. Although both the CBOW and SG are representative learning models, we used the SG in this study because of its excellent performance in many cases [30, 31].

3.4 Preprocessing Synonyms

Accurate results are difficult to obtain at the initial stage when attempting to determine a target word and extract the surrounding words with high relevance using word2vec. This is because the morpheme analyzer (Mecab-ko) does not perform appropriate preprocessing of stop words, synonyms, compound nouns, and neologisms. Therefore, the elimination of unnecessary “josa” and other stop words, correct typos, and integration synonyms are required. In addition, neologisms were added to the user dictionary of the morpheme analyzer. In this study, preprocessing was performed only on the surrounding words with high relevance to the target word using the word2vec model, which learns from all comments. This reduces the amount of time required for preprocessing by extracting and refining only the words surrounding the target word, thereby improving efficiency.

3.5 Related Word Analysis

The word2vec-based final analysis model extracts keywords related to the preprocessed morpheme list associated with the target word. The extracted words were expressed in the vector space. The degree of association between a certain target word and a specific neighboring word was calculated according to the cosine similarity of the corresponding vectors. The cosine similarity is calculated as the dot product between two vectors divided by the size of the vectors, as shown in Eq. (1).

Similarity=cos(θ)=A*B|A|*|B|=i=1nAi*Bii=1n(Ai)2*i=1n(Bi)2.

Here, θ is the angle between the two vectors, and v1, v2 are the dot product of the two vectors. Instead of interpreting the meaning of the extracted keywords on their own, we analyzed the readers’ interest in the news articles that contained those words. Naver enables readers to express their feelings about news articles with icons, such as “like,” “warm,” “sad,” “angry,” and “I want a follow-up article,”; therefore, readers can express their feeling about news stories. We can objectively determine whether a news article is of great interest based on the number of emotional expressions.

4.1 Time Series Analysis

We searched for the phrase “improvement of investigative authority” in Naver News from May 9, 2017, to June 6, 2022. A total of 39,697 news articles and 4,426,875 comments were collected using a crawler, including information such as “news title,” “news posting date,” “media company,” and “comments.” The number of monthly news articles and comments showed similar trends, as illustrated in Table 2 and Figure 4. The correlation coefficient between the two variables is 0.8634, indicating a strong positive correlation. This is consistent with other studies, indicating that news articles and comments have the same tendency and potential to shape online public opinion [6].

4.2 Development of a KoBERT Classification Model

A total of 233,352 datasets were used, of which 175,012 were used as the training set, and 58,340 were used as the test set. The optimal value to prevent overfitting during model training was calculated using a callback function called EarlyStopping. As shown in Figures 5 and 6, the optimal values of the epoch, val accuracy (validation set accuracy), and val loss (loss of validation set) are 4, 0.9051, and 0.2278, respectively. The performance of the proposed model is presented in Table 3. The model performed similarly to or slightly better than KoBERT on GitHub (0.9010, 0.8963) [32,33]. In this study, the KoBERT-based learning model classified emotions (positive or negative) from a total of 4,426,875 comments with more than 90% accuracy without the researcher’s intervention. This indicates that the emotions of online public opinion can be grasped quickly and accurately.

4.3 News Article Content Sentiment Analysis

The news articles containing the target word “Police” were extracted. The total number of articles was 39,697, of which 6,465 satisfied the aforementioned conditions. After applying the KoBERT-based sentiment classification model to police-related articles, 3,164 (49%) articles were classified as positive and 3,301 (51%) as negative. The positive-to-negative ratio of responses to articles related to “improvement of investigative authority” containing the word “police” is deemed to be similar. Therefore, the number of positive and negative responses to articles on the “improvement of investigative authority” and “police” is comparable, as illustrated in Figure 7.

4.4 Comment Sentiment Analysis

According to our sentiment analysis, 2,903,069 comments were analyzed after preprocessing for policy violations and deleted comments. As illustrated in Figure 8, the results indicate that negative comments constituted an overwhelming majority, accounting for 86% of all comments analyzed, whereas positive comments constituted only 14%.

4.5 Comparing News Article and Comment Sentiment Analysis

The sentiment analysis of news articles on the “improvement of investigative authority” was positive (49%) and negative (51%). Currently, news articles are written from conservative and liberal perspectives. Similar proportions of positive and negative sentiment suggest that conservative and liberal media outlets write similar numbers of articles [34]. However, when we analyzed the comments, we found that the majority were negative (19% positive and 81% negative). This is consistent with the argument that readers are more interested in negativity and that anonymity drives negative comments [35, 36].

4.6 Top Related Word News Article Analysis

The results of using word2vec to identify words closely related to “Police” in the content of the collected articles are illustrated in Tables 4 and 5 and Figure 9. Based on the cosine similarity value, the word “Police” correlates strongly with the following terms: “National Police Agency” (0.7511), “Police Commissioner” (0.7162), “Police Officers” (0.6993), “Investigation” (0.6845), and “Police Station” (0.6480). In particular, “Police Precinct” had a low frequency of 590 occurrences but a vector value of 0.6445, Rank 6. This is the difference between the current study’s approach to understanding public opinion and the term frequency-based approach used in previous studies [37].

4.7 Analysis of Important News Articles

The titles of news articles that are of high interest to the reader among the news content, including the keywords in Table 4, are shown in Table 6. These articles are considered to have a significant influence on shaping online public opinion. Naver readers express their interest in an article by selecting emoticons at the bottom of the news content. In this study, the number of emoticons was used as a measure of interest to identify the most popular articles. Consequently, important news articles that were previously subjectively evaluated by researchers can now be extracted using objective criteria.

4.8 Top Related Word Comments Analysis

With “Police” as the target word, word2vec was used to analyze highly relevant keywords. For each keyword, the most sympathetic comments containing the word were extracted. As illustrated in Table 7 and Figure 10, the term frequency of “Frontline” (1,664) is less than half that of “Autonomous police” (12,491). However, as listed in Table 9, the sympathetic comments for “Autonomous police” (2,646) are less than those of “Frontline” (1,664). Employing word2vec enables more accurate identification of words that have a high impact on online public opinion without a high frequency of occurrence.

Online opinion analysis has the advantage of being faster in real time and more cost effective than commonly used conventional survey methods. The search term “improvement of investigative authority” was used to gather and analyze online news articles and comments. A crawler was used to collect unstructured data in text format, and a KoBERT neural network model was used for sentiment classification learning. The number of news articles and comments showed a strong positive correlation (0.8634), indicating that the increase or decrease in online news articles was similar to that of comments. In other words, news articles with high social interest were more likely to generate stronger online opinions as the number of comments increased. In this study, the accuracy of the text–sentiment classification model using KoBERT exceeded 90%. The sentiment classification was performed using the model on 6,465 news articles that contained the word “Police” in their content. Consequently, 3,164 (48%) were positives and 3,301 (51%) were negatives, with no significant difference between the two. After analyzing the comments, 118,362 (81%) were classified as negative and 27,351 (19%) were classified as positive. The sentiment classifier for online public opinion is faster and more accurate than existing research that requires the researcher’s direct classification, owing to the artificial intelligence model. The words “National Police Agency” (0.7511), “Police commissioner” (0.7162), “Police officers” (0.6993), “Investigation” (0.6845), and “Police station” (0.6480) had the highest values when the word2vec algorithm was used to identify words that were highly associated with the term “Police” in the article content. Similarly, words that were highly related to “Police” were extracted from news comments, such as “Closing a case” (0.7657), “Police Officers” (0.7459), and “Police Precinct” (0.7058). Related keywords with relatively low frequencies, which may have been excluded in previous studies that focused solely on term frequencies, were also included. By using wrod2vec, we can identify words that may have been unnoticed because of their low frequency of occurrence, even though they have a significant impact on online opinion formation. Consequently, online public opinion can be analyzed quickly and accurately by collecting and analyzing online news articles and comments using KoBERT and word2vec. This study aimed to develop a specialized sentiment analyzer for online opinion analysis in Korean using the word2vec algorithm. However, the accuracy of the model is unknown because of the lack of validation data during training. In addition, the word2vec model has limitations in sentence-level natural language processing because it does not consider the context of the sentence. To overcome these limitations, a combination of document-level word embedding algorithms such as doc2vec should be utilized.

Furthermore, future research should collect news articles and comments from more diverse portals in addition to Naver and analyze the differences between portal sites. This will lead to a more accurate understanding of the online public opinion.

Fig. 1.

Data collection and analysis process.


Fig. 2.

News article table.


Fig. 3.

News comment table.


Fig. 4.

Number of news articles and comments per month.


Fig. 5.

Model training result.


Fig. 6.

Model training accuracy and loss.


Fig. 7.

News article content sentiment analysis.


Fig. 8.

News comment sentiment analysis.


Fig. 9.

News article content WordCloud.


Fig. 10.

News comments WordCloud.


Table. 1.

Table 1. Word2vec hyperparameter.

ParameterValue
Minimum frequency5
Layer size100
Learning rate0.15
Iteration100
Windows size3
Sub-sampling102
Negative sampling15

Table. 2.

Table 2. Number of news articles and comments per month.

Year MonthNumber of articlesNumber of commentsYear MonthNumber of articlesNumber of comments
2017.0597977,4902019.122,357275,419
2017.0634922,9792020.012,285258,179
2017.071,02857,0432020.02765152,093
2017.0861122,3612020.0316544,353
2017.0929914,2712020.040570
2017.1036123,4072020.0529723,221
2017.1129012,6892020.0640142,001
2017.122378,2482020.070638
2018.0180669,1502020.0872692,750
2018.0216115,8652020.0965151,418
2018.0379771,0342020.100141
2018.0440450,1182020.1129938,804
2018.0524311,1422020.121,546235,718
2018.061,65159,1892021.010594
2018.0740811,9882021.02027
2018.081839,7882021.032,021184,579
2018.091006,0352021.040223
2018.102437,6622021.05028
2018.1146231,2002021.0673252,591
2018.1228844,9742021.070131
2019.0143926,6012021.08022
2019.0274969,6262021.0924012,317
2019.031,827229,2592021.100149
2019.041,908239,1462021.11016
2019.053,664288,5502021.1220712,855
2019.061,123104,1952022.010781
2019.071,21980,5662022.02019
2019.08739162,1342022.0354877,496
2019.091,317376,1742022.040380
2019.102,827567,9292022.05028
2019.1165881,3292022.068717,192

Table. 3.

Table 3. Model performance.

Result value
Accuracy0.9853
Loss0.0474
Validation accuracy0.9051
Validation loss0.2278
Recall0.8529
Precision0.8981
F10.8750

Table. 4.

Table 4. News article word similarity and frequencies.

RankWordVectorEuclideanManhattanScaled EuclideanFrequency
1경찰청(National Police Agency)0.75112.002416.02363.906518,251
2청장(Police commissioner)0.71621.985716.24103.930124,731
3경찰관(Police officers)0.69932.424920.20574.77276,307
4수사(Investigation)0.68451.804413.93283.5361317,777
5경찰서(Police station)0.64802.611420.79295.07903,442
6지구대(Police precinct)0.64453.377327.98336.6381590
7자치경찰(Autonomous police)0.64132.612721.13405.131622,247
8국수본(National Investigation Headquarters)0.62252.650520.42865.20382,526
9관서(a government office)0.60933.356227.03286.5523373
10치안(Policing)0.59373.096024.56066.09934,520

Table. 5.

Table 5. News article word similarity (2-year cycle).

Rank2017–20182019–20202021–2022



WordVectorWordVectorWordVector
1청장(Police commissioner)0.6871경찰청(National Police Agency)0.7754종결(Closing a case)0.6683

2수사(Investigation)0.6764자치경찰(Autonomous police)0.7309수사(Investigation)0.6506

3경찰청(National Police Agency)0.6568청장(Police commissioner)0.7275경찰청(National Police Agency)0.6373

4검찰(Prosecutor)0.6449경찰관(Police officers)0.6738경찰관(Police officers)0.5923

5경찰관(Police officers)0.6135수사(Investigation)0.6729청장(Police commissioner)0.5692

6경찰서(Police station)0.5890국수본(National Investigation Headquarters)0.6418경찰대(Korea National Police University)0.5678

7서장(Police Chief)0.5788지구대(Police precinct)0.6412치안(Policing)0.5616

8이 **(Lee**)0.5593치안(Policing)0.6225검경(Prosecutors and Police)0.5583

9일선(Front line)0.5519민**(Min**)0.6211자치경찰(Autonomous police)0.5567

10지구대(Police precinct)0.5461경찰대(Korea National Police University)0.6158경찰서(Police station)0.5518

Table. 6.

Table 6. Important news article title.

Related wordNews article titleNumber of interests
경찰청(National Police Agency)경찰청 전직원에 “검찰 조국수사 비판 與보고서 읽어라”(“Criticize the prosecutor’s investigation of the National Police Agency and read the report” to former employees of the National Police Agency)22,191
청장(Police commissioner)정권 수사한 ‘***참모들’ 모두 유배 보내버렸다(All ‘*** staffers’ who investigated the regime were sent into exile)23,562
경찰관(Police officers)대통령 비판 전단 돌리던 50 대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워(50s housewife handing out flyers criticizing President *... handcuffed behind her back for lack of police ID)52,028
수사(Investigation)대통령 비판 전단 돌리던 50대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워(50s housewife handing out flyers criticizing President *... handcuffed behind her back for lack of police ID)52,028
경찰서(Police station)여당 윤석열 검찰과 제 2 의 전쟁 나섰다(Ruling party’s prosecutor *** launches second war against prosecutors)8,385
지구대(Police precinct)‘**논란’ 들끓는데...* 대통령 연설문에 “공정과 정의”(** controversy’ rages...President *’s speech includes “fairness and justice”)3,210
자치경찰(Autonomous police)** “***정부는 대한민국을 선진국 대열에 진입시킨 정부”(** “The *** government has brought South Korea into the ranks of developed countries”)19,865
국수본(National Investigation Headquarters)** “***정부는 대한민국을 선진국 대열에 진입시킨 정부”(** “The *** government has brought South Korea into the ranks of developed countries”)19,865
관서(a government office)관서 “檢개혁요구 커지는 현실 성찰. . . 수사관행 개혁돼야” ( 종합)(President * “reflection on the growing demand for reform...Investigation practices should be reformed” (Roundup))4,903
치안(Policing)*** 검찰총장의 응수 “검찰 개혁 반대한 적 없다”(**Prosecutor General *** responds, “I have never opposed prosecution reform.”)11,102

Table. 7.

Table 7. News comments vector values and frequency.

RankWordVectorEuclideanManhattanScaled EuclideanFrequency
1종결(Closing a case)0.76572.036016.03724.22837,862
2경찰관(Police officers)0.74592.264617.51764.70852,926
3지구대(Police Precinct)0.70582.785321.70335.7627784
4파출소(Police box)0.69412.657021.96165.5045798
5시기상조(Premature)0.68142.423619.37995.0183524
6일선(Frontline)0.65732.279617.95504.74031,664
7검찰(Prosecutor)0.64272.270117.63644.7098466,911
8수사력(Investigative power)0.64122.545520.82515.2700542
9치안(Policing)0.63782.775321.10615.74142,538
10자치경찰(Autonomous police)0.62612.744022.54925.674012,491

Table. 8.

Table 8. News comments word similarity (2-year cycle).

Rank2017–20182019–20202021–2022



WordVectorWordVectorWordVector
1수사(Investigation)0.7536종결(Closing a case)0.7750종결(Closing a case)0.7360

2검찰(Prosecutor)0.7414경찰관(Police officers)0.7692수사(Investigation)0.6835

3지구대(Police precinct)0.6379짭새(JJapsae)0.7287검찰(Prosecutor)0.6442

4종결(Closing a case)0.6298파출소(Police box)0.7273역량(Capability)0.6210

5일선(Frontline)0.6279지구대(Police precinct)0.7256수사관(Investigator)0.6040

6경찰관(Police officers)0.6098경찰서(Police station)0.7007검경(Prosecutors and Police)0.5911

7독립(Independent)0.6036경찰권(Police rights)0.6846현장(Field)0.5909

8기소(Prosecution)0.6003자치경찰(Autonomous)0.6709조정(Adjustment)0.5895

9자치경찰(Autonomous police)0.5978치안(Policing)0.6618치안(Policing)0.5824

10검사(Prosecutor)0.5966시기상조(Premature)0.6504경찰관(Police Officers)0.5811

Table. 9.

Table 9. Top related words news comments.

Related wordCommentNumber of interests
종결(Closing a case)경찰이 1 차수사권을 가진 이상사법경찰의 그 많은 사건 수사중 경찰이임의로 덮어버려서 묻히는 사건들이 꽤 많아질겁니다 말로는 인권침해법령위반 관계인의 이의제기 등의 단서를 달아놨지만 저기 시골같은데서 경찰이 아무도 모르게 사건 덮어버리면 알 수가 없죠 검찰로 보내는송부자료도 조작해서 결제만 해 보내면 어찌 알겠어요예전에야 불기소 의견도 모두 검찰에 넘기고 다시 한번 조사를 받고 종결되서 그나마 위법 부당한 처리를 찾을 수 있었지 앞으로는 그것도 어려울듯 이건 좀더 통제가 필요한 부분임(As long as the police have the right of primary investigation, there will be quite a few cases that are buried because the police arbitrarily cover them up during the investigation of all those cases. In the past, all the non-prosecution opinions were handed over to the prosecutor’s office, investigated once again, and closed, so we were able to find illegal unfair treatment. In the future, it will be difficult to do so)1,314
경찰관(Police officers)대한민국 피의자 인권은 괴할정도입니다 지금 중요한건 매맞고 힘빠진 경찰관의 법집행력과경찰관을 포함 공무원의 기본적 인권을 지켜줄 때입니다(The human rights of detainees in South Korea are abysmal, and it’s time to strengthen law enforcement and protect the basic human rights of public servants, including police officers)1,441
지구대(Police precinct)역삼지구대 3 년 근무하면 강남에 30 평대 아파트 현금으로 산다면서요 사실인가요(I heard that if you work in Yeoksam District for 3 years, you can get a 30-pyeong apartment in Gangnam for cash. Is it true”)626
파출소(Police box)문제는 과다한 업무에 있습니다실종사건이 하루에도 지구대나 파출소 별로 여러 건이 하달되는데 실종사건만 전담해서 일을 할 수 없고 다른 112 신고사건도 처리해 가면서 실종사건을 처리해야 하는데서 문제가 발생합니다결국은 경찰인원을일선 지구대나 여청계 형사계 등에 많이 배치해야 하지요 쓸데없이윗자리 확보하려는 경찰서만 자꾸 만들지 말고경찰서를 오히려 축소하여야 합니다(The problem is that there are several missing persons cases per district or police station per day, but they cannot work exclusively on missing persons cases and have to deal with missing persons cases while handling other 112 calls. In the end, police personnel should be assigned to frontline districts and women’s police departments)267
시기상조(Premature)수사와 기소 분리가검찰개혁의 요체인건 맞지 근데 경찰 아니 견찰 보니까 수사권 때어주는 것은 아직 시기상조다 싶다 직무능력도 그렇고 조직 구조상 정권에 딸랑거릴 수밖에없거든 공수처는단호히 반대한다 기소권 수사권 모두 독점 검찰이 맡은 사건도 빼앗을 수 있는 무소 불위의 대통령 친위대 절대 반대다(The separation of investigation and prosecution is the key to prosecutorial reform, but it’s still premature to hand over investigative powers to the police. They are bound to be beholden to the regime due to their competence and organizational structure. The Public Prosecutor’s Office is adamantly opposed to both prosecution and investigation powers. The President’s Guard, which is an omnipotent force that can take over even cases handled by monopoly prosecutors, is absolutely opposed)904
일선(Frontline)나도 검사였다면 범죄 소탕위해 일선에서 목숨다해 일했을텐데 ㅠㅠ 검사님들 좀만 더 힘내주세요 국민이 함께합니다(If I were a prosecutor, I would have worked my ass off on the front lines to fight crime, too.)2,905
검찰(Prosecutor)검찰 응원합니다(Cheers to the prosecution)25,129
수사력(Investigative power)검찰개혁의 목적이 수사권 독립으로 산 권력도 수사해라 인데 지금의 행태는 산 권력 수사하면 아웃 인사권 횡포로 검찰수사력 과도한 억제 내지는 사법질서 파괴(The purpose of the prosecution reform is to investigate the mountain power with the independence of the investigation power, but the current behavior is that if the mountain power is investigated, it is out personnel power abuse, which excessively suppresses the prosecution investigation power and destroys the judicial order.)163
치안(Policing)아직도 *** 씨가 민주주의 대통령이라 생각하시나요 이자는 *** 와 더불어 민주주의를 박살낸자 *** 보다 못한 최악의 공직자로 기록될 겁니다 *** 이 *** 보다 나은점은 올바른 인사 범죄와의 전쟁 경제활성화 등 지금과 비교하면 한 20 년 집권했다면 하는 생각이 듭니다 누구나 서울에 노력하면 집한채 살수 있었고 도 없던시절 이었지만 치안안전했고 문재인씨같은 범죄자는 사형시킬수 있었던 그런시절 이었습니다 군사쿠데타가 일어나길 바라는건 민주화이후 처음입니다(Do you still think *** is a democratic president” He will be recorded as the worst public official, worse than *** and the man who destroyed democracy, ***. The only thing *** did better than *** was to fight the right personnel crimes and revitalize the economy. I think if he had been in power for 20 years compared to now, anyone could buy a house in Seoul if they tried, and it was a time when there were no roads, but security was safe and criminals like *** could be executed. It is the first time since democratization that I want a military coup to happen)2,621
자치경찰(Autonomous police)자치경찰이라니 지금도 지역 유지랑 짝짜꿍해서 전라도 염전노예는 탈출해도 경찰이잡아다 주인에게 바치는 수준인데자치경찰로 바뀌면 지역 유지는 거의 지역에서 신으로 군림할듯도대체 미국처럼 땅이 워낙 커서 자치제를 할 수 밖에는 없는 경우도 아니고손바닥만한 나라에서 무슨 지역을 그리 따지는지 암튼 자치경찰 하면 지역 토호의 온갖 범죄와 갑질을 막을 방법은 없을듯(It’s an autonomous police force, so even if a Jeolla Province salt slave escapes, the police catch it and give it to the owner, but if it changes to an autonomous police force, the local maintenance will almost reign as a god in the region. Also, like the United States, the land is so large that it can only be self-governed, and what region is so important in a country the size of a palmAnyway, if you do an autonomous police force, there is no way to prevent all kinds of crimes and robberies in the local toho.2,646

  1. Jo, EK (2012). The current state of affairs of the sentiment analysis and case study based on corpus. The Journal of Linguistic Science. 2012, 259-282.
  2. Gai, XT (). Analysis of microblog public opinion based on BERT model, 1-6. https://doi.org/10.1145/3529299.3531496
  3. Cho, K, Van Merrienboer, B, Gulcehre, C, Bahdanau, D, Bougares, F, Schwenk, H, and Bengio, Y. (2014) . Learning phrase representations using RNN encoder-decoder for statistical machine translation. Available: https://arxiv.org/abs/1406.1078
  4. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, and Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30, 5998-6008.
  5. Fernandes, J, Giurcanu, M, Bowers, KW, and Neely, JC (2010). The writing on the wall: a content analysis of college students’ Facebook groups for the 2008 presidential election. Mass Communication and Society. 13, 653-675. https://doi.org/10.1080/15205436.2010.516865
    CrossRef
  6. Marcoccia, M (2004). On-line polylogues: conversation structure and participation framework in internet newsgroups. Journal of Pragmatics. 36, 115-145. https://doi.org/10.1016/S0378-2166(03)00038-9
    CrossRef
  7. Noelle-Neumann, E (1974). The spiral of silence a theory of public opinion. Journal of Communication. 24, 43-51. https://doi.org/10.1111/j.1460-2466.1974.tb00367.x
    CrossRef
  8. Tsagkias, M, Weerkamp, W, and De Rijke, M (2010). News comments: exploring, modeling, and online prediction. Advances in Information Retrieval. Heidelberg, Germany: Springer, pp. 191-203 https://doi.org/10.1007/978-3-642-12275-0_19
  9. Preiss, J, Stevenson, M, and Gaizauskas, R (2015). Exploring relation types for literature-based discovery. Journal of the American Medical Informatics Association. 22, 987-992. https://doi.org/10.1093/jamia/ocv002
    Pubmed KoreaMed CrossRef
  10. Chakraborty, G, Pagolu, M, and Garla, S (2014). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. Cary, NC: SAS Institute
  11. Lee, G, Jeong, J, Seo, S, Kim, C, and Kang, P. (2017) . Sentiment classification with word attention based on weakly supervised learning with a convolutional neural network. Available: https://arxiv.org/abs/1709.09885
  12. Kristiyanti, DA, Umam, AH, Wahyudi, M, Amin, R, and Marlinda, L . Comparison of SVM & naïve Bayes algorithm for sentiment analysis toward west java governor candidate period 2018–2023 based on public opinion on twitter., Proceedings of 2018 6th International Conference on Cyber and IT Service Management (CITSM), 2018, Parapat, Indonesia, Array, pp.1-6. https://doi.org/10.1109/CITSM.2018.8674352
  13. Haihong, E, Hu, Y, Peng, H, Zhao, W, Xiao, S, and Niu, P (2019). Theme and sentiment analysis model of public opinion dissemination based on generative adversarial network. Chaos, Solitons & Fractals. 121, 160-167. https://doi.org/10.1016/j.chaos.2018.11.036
    CrossRef
  14. Li, S, Liu, Z, and Li, Y (2020). Temporal and spatial evolution of online public sentiment on emergencies. Information Processing & Management. 57. article no 102177
    CrossRef
  15. Kim, D, and Kang, P (2022). Cross-modal distillation with audio– text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0. Neurocomputing. 506, 168-183. https://doi.org/10.1016/j.neucom.2022.07.035
    CrossRef
  16. Yang, Z, Yang, D, Dyer, C, He, X, Smola, A, and Hovy, E . Hierarchical attention networks for document classification., Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, San Diego, CA, pp.1480-1489.
  17. Vinyals, O, Kaiser, L, Koo, T, Petrov, S, Sutskever, I, and Hinton, G (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems. 28, 2773-2781.
  18. Bahdanau, D, Cho, K, and Bengio, Y. (2014) . Neural machine translation by jointly learning to align and translate. Available: https://arxiv.org/abs/1409.0473
  19. Luong, MT, Pham, H, and Manning, CD. (2015) . Effective approaches to attention-based neural machine translation. Available: https://arxiv.org/abs/1508.04025
  20. Bania, RK (2020). COVID-19 public tweets sentiment analysis using TF-IDF and inductive learning models. INFOCOMP Journal of Computer Science. 19, 23-41.
  21. Rachman, FH . Twitter sentiment analysis of Covid-19 using term weighting TF-IDF and logistic regression., Proceedings of 2020 6th Information Technology International Seminar (ITIS), 2020, Surabaya, Indonesia, Array, pp.238-242. https://doi.org/10.1109/ITIS50118.2020.9320958
  22. Antonio, VD, Efendi, S, and Mawengkang, H (2022). Sentiment analysis for covid-19 in Indonesia on Twitter with TF-IDF featured extraction and stochastic gradient descent. International Journal of Nonlinear Analysis and Applications. 13, 1367-1373. https://doi.org/10.22075/ijnaa.2021.5735
  23. Devlin, J, Chang, MW, Lee, K, and Toutanova, K. (2018) . BERT: pre-training of deep bidirectional transformers for language understanding. Available: https://arxiv.org/abs/1810.04805
  24. Gao, Z, Feng, A, Song, X, and Wu, X (2019). Target-dependent sentiment classification with BERT. IEEE Access. 7, 154290-154299. https://doi.org/10.1109/ACCESS.2019.2946594
    CrossRef
  25. Nugroho, KS, Sukmadewa, AY, and Yudistira, N . Large-scale news classification using BERT language model: Spark NLP approach., Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology, 2021, Malang, Indonesia, Array, pp.240-246. https://doi.org/10.1145/3479645.3479658
  26. Adhikari, A, Ram, A, Tang, R, and Lin, J. (2019) . DocBERT: BERT for document classification. Available: https://arxiv.org/abs/1904.08398
  27. SK Telecom. (2021) . KoBERT: SKT open-source. Available: https://sktelecom.github.io/project/kobert/
  28. Mikolov, T, Chen, K, Corrado, G, and Dean, J. (2013) . Efficient estimation of word representations in vector space. Available: https://arxiv.org/abs/1301.3781
  29. Mikolov, T, Sutskever, I, Chen, K, Corrado, GS, and Dean, J (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. 26, 3111-3119.
  30. Rong, X. (2014) . word2vec parameter learning explained. Available: https://arxiv.org/abs/1411.2738
  31. Riva, M. (2023) . Word embeddings: CBOW vs Skip-Gram. Available: https://www.baeldung.com/cs/word-embeddings-cbow-vs-skip-gram
  32. SKTBrain. (2022) . KoBERT. Available: https://github.com/SKTBrain/KoBERT
  33. Park, J. (2020) . KoBERT-nsmc. Available: https://github.com/monologg/KoBERT-nsmc
  34. Valente, A, Tudisca, V, Pelliccia, A, Cerbara, L, and Caruso, MG (2023). Comparing liberal and conservative newspapers: diverging narratives in representing migrants?. Journal of Immigrant & Refugee Studies. 21, 411-427. https://doi.org/10.1080/15562948.2021.1985200
    CrossRef
  35. Stafford, T. (2014) . Psychology: why bad news dominates the headlines. Available: https://www.bbc.com/future/article/20140728-why-is-all-the-news-bad
  36. Christopherson, KM (2007). The positive and negative implications of anonymity in Internet social interactions: “On the Internet, Nobody Knows You’re a Dog”. Computers in Human Behavior. 23, 3038-3056. https://doi.org/10.1016/j.chb.2006.09.001
    CrossRef
  37. Lavin, M (2019). Analyzing documents with TF-IDF. Available: http://dx.doi.org/10.46430/phen0082

Changwon Baek is a Ph.D. candidate in criminology from the Korean National Police University. He is currently affiliated with the Korea Institute of Science and Technology. His research interests are data science and NLP. E-mail: baekcw@kist.re.kr

Jiho Kang received his Ph.D. degree in Industrial Management Engineering from Korea University, South Korea. His research interests include data science, patent data mining, and natural language understanding. E-mail: kangmae@korea.ac.kr

SangSoo Choi received a B.S. degree in Electronics Engineering from Korea Aerospace University, Korea, in 1999 and the M.S. and Ph.D. degrees from the Department of Information and Communications from the Gwangju Institute of Science and Technology (GIST), Korea, in 2001 and 2004, respectively. From 2004 to 2021, he was affiliated as a Principal Research Engineer with the Production Engineering Research Institute, LG Electronics Inc., Korea. Since 2021, he has been affiliated with the Korea Institute of Science and Technology (KIST), Korea, where he is currently the Head of the Technological Convergence Center. His current research interests include smart automation, artificial intelligence, inspection, convergence technology, and other applications. E-mail: schoi@kist.re.kr

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 244-258

Published online September 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.3.244

Copyright © The Korean Institute of Intelligent Systems.

Online Unstructured Data Analysis Models with KoBERT and Word2vec: A Study on Sentiment Analysis of Public Opinion in Korean

Changwon Baek1, Jiho Kang2, and SangSoo Choi1

1Technological Convergence Center, Korea Institute of Science and Technology (KIST), Seoul, Korea
2Institute of Engineering Research, Korea University, Seoul, Korea

Correspondence to:Changwon Baek (baekcw@kist.re.kr)

Received: February 21, 2023; Revised: June 21, 2023; Accepted: July 5, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Online news articles and comments play a vital role in shaping public opinion. Numerous studies have conducted online opinion analyses using these as raw data. Bidirectional encoder representations from transformer (BERT)-based sentiment analysis of public opinion have recently attracted significant attention. However, owing to its limited linguistic versatility and low accuracy in domains with insufficient learning data, the application of BERT to Korean is challenging. Conventional public opinion analysis focuses on term frequency; hence, low-frequency words are likely to be excluded because their importance is underestimated. This study aimed to address these issues and facilitate the analysis of public opinion regarding Korean news articles and comments. We propose a method for analyzing public opinion using word2vec to increase the word-frequency-centered analytical limit in conjunction with KoBERT, which is optimized for Korean language by improving BERT. Naver news articles and comments were analyzed using a sentiment classification model developed on the KoBERT framework. The experiment demonstrated a sentiment classification accuracy of over 90%. Thus, it yields faster and more precise results than conventional methods. Words with a low frequency of occurrence, but high relevance, can be identified using word2vec.

Keywords: KoBERT, Word2vec, Public opinion analysis, Sentiment classification

1. Introduction

With the development of information and communication technology, online media has become an essential means of disseminating information. The digital divide has also narrowed as anyone can easily access online news. Readers not only read news articles but also actively express their opinions by leaving comments. Such online comments enable the exchange of various viewpoints on an issue that has a lasting effect on public opinion. Public opinion formed by the exchange of ideas in online comments influence administrative, legislative, and judicial policy decisions in various ways, including through public petitions [1]. Therefore, many studies have been conducted to collect online comments and analyze public opinion. Recently, a neural network model using bidirectional encoder representations from transformers (BERT) was applied to these studies [2]. BERT is an English-language-based model that is not suitable for the Korean language [3, 4]. Therefore, applying BERT to Korean leads to inaccurate analysis results owing to errors in the Korean morphological analysis and word embedding. In addition, previous studies have used several algorithms, including term frequency-inverse document frequency (TF-IDF), latent semantic analysis (LSA), and latent Dirichlet allocation (LDA), to analyze comments. Although TF-IDF is simple to implement and provides a relatively accurate analysis based on word frequency, it does not consider the order or context of words, rendering it difficult to fully understand their meaning. By contrast, LSA considers the semantic similarity between words and accurately identifies meaning. LDA analyzes the topic to which a comment belongs, enabling it to cover multiple topics. However, both the LSA and LDA have high computational costs and are difficult to interpret. Word2vec calculates the semantic similarity between words by representing them as vectors and considering their order and context for a more accurate analysis. This method is cost-effective, rendering it less likely for important words to be excluded because of their low frequency. This ensures that a large number of comments from a particular group are not analyzed as important public opinion makes informed decisions based on public opinion [57]. Various research methods have attempted to address these issues, such as using supervised and unsupervised learning to remove duplicate comments or utilizing comment cluster analysis [8].

The main contributions of this paper can be summarized as follows:

  • • The objective of this study is to develop a specialized sentiment analyzer for online opinion analysis in Korean. To achieve this, we created an online public opinion sentiment analysis model using KoBERT, a pretrained deep learning model optimized for the Korean language. The model analyzes issues determined as positive or negative by the researcher. To overcome the linguistic limitations of BERT, we employed KoBERT, which is a transformer language model pretrained on a large Korean corpus.

  • • Online public opinion analysis is often limited by the use of word frequency analysis, which may exclude important but infrequently used opinion words. To overcome this limitation, we used a word embedding algorithm called word2vec. This algorithm considers the correlation between words by locating each word in a three-dimensional vector space and identifying words that are highly correlated with the target word. Using this algorithm, we aim to improve the accuracy and comprehensiveness of our online opinion analyses.

  • • Developing a highly accurate sentiment analyzer can significantly enhance the efficiency and speed of online public opinion analyses. Such analysis can offer valuable insights into shaping policies related to politics, economics, and social issues. Hence, policymakers and researchers who aim to make informed decisions must prioritize the development of an accurate sentiment analyzer.

Sections 2–5 of this paper present a literature review, data collection and analysis, experimental results, and conclusions.

2. Literature Review

2.1 News Comment and Online Opinion

Social media websites provide opportunities for active participation in democratic processes [5]. Online news is an example of a social media platform in which readers express their opinions and feelings through comments. News comments can be considered a form of multilateral dialogue [6]. Public opinion is formed by observing, hearing, and perceiving others’ opinions [7]. In other words, an online zone in which anyone can easily communicate with a diverse group of individuals and recognize, accept, and critique the viewpoints of others can be viewed as an ideal environment for forming public opinion. Therefore, news and comments are important data for public opinion analysis, and the number of studies on developing algorithms or tools to analyze them is increasing [8].

2.2 Text Mining

The extraction of useful information from textual data is known as text mining. It is an effective research method for deriving meaningful and valuable information and knowledge by discovering hidden relationships or patterns in data and analyzing different morphemes included in unstructured text data [9]. It can be used to identify the main latent content in a text [10]. Its applications include document classification, clustering, summarization as well as information extraction and sentiment analysis. Decision trees, recurrent neural network (RNN), and BERT are text analysis methods that employ supervised learning. Clustering, LDA, the dynamic multitopic model (DMA), and the expressed agenda model are examples of unsupervised learning methods. Recently, methodologies such as attentive long short-term memory (LSTM) and tensor fusion networks have been studied [11]. In particular, sentiment analysis is an important method for understanding online public opinion and has been used in numerous studies [1215]. The natural language processing (NLP) model includes document classification [16], syntax analysis [17], and machine translation [18,19], which focus on words or phrases. TF-IDF, which is typically used in text mining, including online opinion analysis, is an algorithm that uses word frequency and inverse document frequencies [20, 21]. Because high frequency words are given more weight, low frequency words are often excluded from analysis. To overcome these limitations, several studies have been conducted to improve the accuracy of algorithms that use probability methods [22].

2.3 KoBERT and word2vec

Although existing deep learning methodologies initialize weights to random values when creating a model, pretraining is a method for leveraging weights obtained from data learning of other problems as initial values. A problem using pretrained weights in this manner is called a downstream task. BERT improves downstream task performance by using language models for pretraining [23]. It has demonstrated excellent performance in sentence classification for various applications, including sentence sentiment analysis [24], major news topic classification [25], and document classification [26]. In this study, we used KoBERT, developed by a Korean company (SKT), to improve Google BERT’s Korean language learning performance. A case study demonstrated a 2.6% performance improvement compared to BERT [27]. It places words in a three-dimensional vector space, enabling us to evaluate the associations between different words based on their proximity. It has two different model architectures: continuous bag-of-words (CBOW) and skip-gram (SG). The CBOW model predicts a target word based on its surrounding words, whereas the SG model predicts the surrounding words based on a target word [28, 29].

3. Data Collection and Analysis

This study followed the procedure depicted in Figure 1 for data collection, preprocessing, and analysis of online comments.

3.1 Data Collection

The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority”. The title, content, and comments were collected after parsing the webpage using HTML and CSS parsers. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database. The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority.” After parsing the webpage using HTML and CSS parser, the title, content, and comments were collected. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database.

3.2 Morpheme Analysis

The crawler’s unstructured text data must be classified using morphemes to analyze the comments. However, morphological analyses present significant challenges for Korean comments. First, adapting most morpheme analysis libraries to the Korean language is challenging because they were developed in English. Second, many exceptions exist to spacing rules in Hangul, including those set by the National Institute of Korean Language. Third, owing to changes in the ending and proposition, accurate analysis is impossible when the word is separated based on spacing but is not a word (space unit). This study analyzes morphemes using the “Mecab-ko library” of the “Eunjeon Hannip Project,” a Korean morpheme analyzer used in Korea, to address these issues.

3.3 word2vec Modeling

For a computer to analyze text data, a process known as “embedding” must convert them into mathematically executable forms. In this study, word2vec was used to embed the text data. word2vec, a library developed by Google, is an algorithm with the benefit of fast learning speed. This is accomplished by placing highly related words in similar positions in the vector space based on the frequency with which they appear in similar contexts. The hyperparameters required to generate the word2vec model are listed in Table 1. Although both the CBOW and SG are representative learning models, we used the SG in this study because of its excellent performance in many cases [30, 31].

3.4 Preprocessing Synonyms

Accurate results are difficult to obtain at the initial stage when attempting to determine a target word and extract the surrounding words with high relevance using word2vec. This is because the morpheme analyzer (Mecab-ko) does not perform appropriate preprocessing of stop words, synonyms, compound nouns, and neologisms. Therefore, the elimination of unnecessary “josa” and other stop words, correct typos, and integration synonyms are required. In addition, neologisms were added to the user dictionary of the morpheme analyzer. In this study, preprocessing was performed only on the surrounding words with high relevance to the target word using the word2vec model, which learns from all comments. This reduces the amount of time required for preprocessing by extracting and refining only the words surrounding the target word, thereby improving efficiency.

3.5 Related Word Analysis

The word2vec-based final analysis model extracts keywords related to the preprocessed morpheme list associated with the target word. The extracted words were expressed in the vector space. The degree of association between a certain target word and a specific neighboring word was calculated according to the cosine similarity of the corresponding vectors. The cosine similarity is calculated as the dot product between two vectors divided by the size of the vectors, as shown in Eq. (1).

Similarity=cos(θ)=A*B|A|*|B|=i=1nAi*Bii=1n(Ai)2*i=1n(Bi)2.

Here, θ is the angle between the two vectors, and v1, v2 are the dot product of the two vectors. Instead of interpreting the meaning of the extracted keywords on their own, we analyzed the readers’ interest in the news articles that contained those words. Naver enables readers to express their feelings about news articles with icons, such as “like,” “warm,” “sad,” “angry,” and “I want a follow-up article,”; therefore, readers can express their feeling about news stories. We can objectively determine whether a news article is of great interest based on the number of emotional expressions.

4. Results

4.1 Time Series Analysis

We searched for the phrase “improvement of investigative authority” in Naver News from May 9, 2017, to June 6, 2022. A total of 39,697 news articles and 4,426,875 comments were collected using a crawler, including information such as “news title,” “news posting date,” “media company,” and “comments.” The number of monthly news articles and comments showed similar trends, as illustrated in Table 2 and Figure 4. The correlation coefficient between the two variables is 0.8634, indicating a strong positive correlation. This is consistent with other studies, indicating that news articles and comments have the same tendency and potential to shape online public opinion [6].

4.2 Development of a KoBERT Classification Model

A total of 233,352 datasets were used, of which 175,012 were used as the training set, and 58,340 were used as the test set. The optimal value to prevent overfitting during model training was calculated using a callback function called EarlyStopping. As shown in Figures 5 and 6, the optimal values of the epoch, val accuracy (validation set accuracy), and val loss (loss of validation set) are 4, 0.9051, and 0.2278, respectively. The performance of the proposed model is presented in Table 3. The model performed similarly to or slightly better than KoBERT on GitHub (0.9010, 0.8963) [32,33]. In this study, the KoBERT-based learning model classified emotions (positive or negative) from a total of 4,426,875 comments with more than 90% accuracy without the researcher’s intervention. This indicates that the emotions of online public opinion can be grasped quickly and accurately.

4.3 News Article Content Sentiment Analysis

The news articles containing the target word “Police” were extracted. The total number of articles was 39,697, of which 6,465 satisfied the aforementioned conditions. After applying the KoBERT-based sentiment classification model to police-related articles, 3,164 (49%) articles were classified as positive and 3,301 (51%) as negative. The positive-to-negative ratio of responses to articles related to “improvement of investigative authority” containing the word “police” is deemed to be similar. Therefore, the number of positive and negative responses to articles on the “improvement of investigative authority” and “police” is comparable, as illustrated in Figure 7.

4.4 Comment Sentiment Analysis

According to our sentiment analysis, 2,903,069 comments were analyzed after preprocessing for policy violations and deleted comments. As illustrated in Figure 8, the results indicate that negative comments constituted an overwhelming majority, accounting for 86% of all comments analyzed, whereas positive comments constituted only 14%.

4.5 Comparing News Article and Comment Sentiment Analysis

The sentiment analysis of news articles on the “improvement of investigative authority” was positive (49%) and negative (51%). Currently, news articles are written from conservative and liberal perspectives. Similar proportions of positive and negative sentiment suggest that conservative and liberal media outlets write similar numbers of articles [34]. However, when we analyzed the comments, we found that the majority were negative (19% positive and 81% negative). This is consistent with the argument that readers are more interested in negativity and that anonymity drives negative comments [35, 36].

4.6 Top Related Word News Article Analysis

The results of using word2vec to identify words closely related to “Police” in the content of the collected articles are illustrated in Tables 4 and 5 and Figure 9. Based on the cosine similarity value, the word “Police” correlates strongly with the following terms: “National Police Agency” (0.7511), “Police Commissioner” (0.7162), “Police Officers” (0.6993), “Investigation” (0.6845), and “Police Station” (0.6480). In particular, “Police Precinct” had a low frequency of 590 occurrences but a vector value of 0.6445, Rank 6. This is the difference between the current study’s approach to understanding public opinion and the term frequency-based approach used in previous studies [37].

4.7 Analysis of Important News Articles

The titles of news articles that are of high interest to the reader among the news content, including the keywords in Table 4, are shown in Table 6. These articles are considered to have a significant influence on shaping online public opinion. Naver readers express their interest in an article by selecting emoticons at the bottom of the news content. In this study, the number of emoticons was used as a measure of interest to identify the most popular articles. Consequently, important news articles that were previously subjectively evaluated by researchers can now be extracted using objective criteria.

4.8 Top Related Word Comments Analysis

With “Police” as the target word, word2vec was used to analyze highly relevant keywords. For each keyword, the most sympathetic comments containing the word were extracted. As illustrated in Table 7 and Figure 10, the term frequency of “Frontline” (1,664) is less than half that of “Autonomous police” (12,491). However, as listed in Table 9, the sympathetic comments for “Autonomous police” (2,646) are less than those of “Frontline” (1,664). Employing word2vec enables more accurate identification of words that have a high impact on online public opinion without a high frequency of occurrence.

5. Conclusion

Online opinion analysis has the advantage of being faster in real time and more cost effective than commonly used conventional survey methods. The search term “improvement of investigative authority” was used to gather and analyze online news articles and comments. A crawler was used to collect unstructured data in text format, and a KoBERT neural network model was used for sentiment classification learning. The number of news articles and comments showed a strong positive correlation (0.8634), indicating that the increase or decrease in online news articles was similar to that of comments. In other words, news articles with high social interest were more likely to generate stronger online opinions as the number of comments increased. In this study, the accuracy of the text–sentiment classification model using KoBERT exceeded 90%. The sentiment classification was performed using the model on 6,465 news articles that contained the word “Police” in their content. Consequently, 3,164 (48%) were positives and 3,301 (51%) were negatives, with no significant difference between the two. After analyzing the comments, 118,362 (81%) were classified as negative and 27,351 (19%) were classified as positive. The sentiment classifier for online public opinion is faster and more accurate than existing research that requires the researcher’s direct classification, owing to the artificial intelligence model. The words “National Police Agency” (0.7511), “Police commissioner” (0.7162), “Police officers” (0.6993), “Investigation” (0.6845), and “Police station” (0.6480) had the highest values when the word2vec algorithm was used to identify words that were highly associated with the term “Police” in the article content. Similarly, words that were highly related to “Police” were extracted from news comments, such as “Closing a case” (0.7657), “Police Officers” (0.7459), and “Police Precinct” (0.7058). Related keywords with relatively low frequencies, which may have been excluded in previous studies that focused solely on term frequencies, were also included. By using wrod2vec, we can identify words that may have been unnoticed because of their low frequency of occurrence, even though they have a significant impact on online opinion formation. Consequently, online public opinion can be analyzed quickly and accurately by collecting and analyzing online news articles and comments using KoBERT and word2vec. This study aimed to develop a specialized sentiment analyzer for online opinion analysis in Korean using the word2vec algorithm. However, the accuracy of the model is unknown because of the lack of validation data during training. In addition, the word2vec model has limitations in sentence-level natural language processing because it does not consider the context of the sentence. To overcome these limitations, a combination of document-level word embedding algorithms such as doc2vec should be utilized.

Furthermore, future research should collect news articles and comments from more diverse portals in addition to Naver and analyze the differences between portal sites. This will lead to a more accurate understanding of the online public opinion.

Fig 1.

Figure 1.

Data collection and analysis process.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 2.

Figure 2.

News article table.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 3.

Figure 3.

News comment table.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 4.

Figure 4.

Number of news articles and comments per month.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 5.

Figure 5.

Model training result.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 6.

Figure 6.

Model training accuracy and loss.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 7.

Figure 7.

News article content sentiment analysis.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 8.

Figure 8.

News comment sentiment analysis.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 9.

Figure 9.

News article content WordCloud.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Fig 10.

Figure 10.

News comments WordCloud.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 244-258https://doi.org/10.5391/IJFIS.2023.23.3.244

Table 1 . Word2vec hyperparameter.

ParameterValue
Minimum frequency5
Layer size100
Learning rate0.15
Iteration100
Windows size3
Sub-sampling102
Negative sampling15

Table 2 . Number of news articles and comments per month.

Year MonthNumber of articlesNumber of commentsYear MonthNumber of articlesNumber of comments
2017.0597977,4902019.122,357275,419
2017.0634922,9792020.012,285258,179
2017.071,02857,0432020.02765152,093
2017.0861122,3612020.0316544,353
2017.0929914,2712020.040570
2017.1036123,4072020.0529723,221
2017.1129012,6892020.0640142,001
2017.122378,2482020.070638
2018.0180669,1502020.0872692,750
2018.0216115,8652020.0965151,418
2018.0379771,0342020.100141
2018.0440450,1182020.1129938,804
2018.0524311,1422020.121,546235,718
2018.061,65159,1892021.010594
2018.0740811,9882021.02027
2018.081839,7882021.032,021184,579
2018.091006,0352021.040223
2018.102437,6622021.05028
2018.1146231,2002021.0673252,591
2018.1228844,9742021.070131
2019.0143926,6012021.08022
2019.0274969,6262021.0924012,317
2019.031,827229,2592021.100149
2019.041,908239,1462021.11016
2019.053,664288,5502021.1220712,855
2019.061,123104,1952022.010781
2019.071,21980,5662022.02019
2019.08739162,1342022.0354877,496
2019.091,317376,1742022.040380
2019.102,827567,9292022.05028
2019.1165881,3292022.068717,192

Table 3 . Model performance.

Result value
Accuracy0.9853
Loss0.0474
Validation accuracy0.9051
Validation loss0.2278
Recall0.8529
Precision0.8981
F10.8750

Table 4 . News article word similarity and frequencies.

RankWordVectorEuclideanManhattanScaled EuclideanFrequency
1경찰청(National Police Agency)0.75112.002416.02363.906518,251
2청장(Police commissioner)0.71621.985716.24103.930124,731
3경찰관(Police officers)0.69932.424920.20574.77276,307
4수사(Investigation)0.68451.804413.93283.5361317,777
5경찰서(Police station)0.64802.611420.79295.07903,442
6지구대(Police precinct)0.64453.377327.98336.6381590
7자치경찰(Autonomous police)0.64132.612721.13405.131622,247
8국수본(National Investigation Headquarters)0.62252.650520.42865.20382,526
9관서(a government office)0.60933.356227.03286.5523373
10치안(Policing)0.59373.096024.56066.09934,520

Table 5 . News article word similarity (2-year cycle).

Rank2017–20182019–20202021–2022



WordVectorWordVectorWordVector
1청장(Police commissioner)0.6871경찰청(National Police Agency)0.7754종결(Closing a case)0.6683

2수사(Investigation)0.6764자치경찰(Autonomous police)0.7309수사(Investigation)0.6506

3경찰청(National Police Agency)0.6568청장(Police commissioner)0.7275경찰청(National Police Agency)0.6373

4검찰(Prosecutor)0.6449경찰관(Police officers)0.6738경찰관(Police officers)0.5923

5경찰관(Police officers)0.6135수사(Investigation)0.6729청장(Police commissioner)0.5692

6경찰서(Police station)0.5890국수본(National Investigation Headquarters)0.6418경찰대(Korea National Police University)0.5678

7서장(Police Chief)0.5788지구대(Police precinct)0.6412치안(Policing)0.5616

8이 **(Lee**)0.5593치안(Policing)0.6225검경(Prosecutors and Police)0.5583

9일선(Front line)0.5519민**(Min**)0.6211자치경찰(Autonomous police)0.5567

10지구대(Police precinct)0.5461경찰대(Korea National Police University)0.6158경찰서(Police station)0.5518

Table 6 . Important news article title.

Related wordNews article titleNumber of interests
경찰청(National Police Agency)경찰청 전직원에 “검찰 조국수사 비판 與보고서 읽어라”(“Criticize the prosecutor’s investigation of the National Police Agency and read the report” to former employees of the National Police Agency)22,191
청장(Police commissioner)정권 수사한 ‘***참모들’ 모두 유배 보내버렸다(All ‘*** staffers’ who investigated the regime were sent into exile)23,562
경찰관(Police officers)대통령 비판 전단 돌리던 50 대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워(50s housewife handing out flyers criticizing President *... handcuffed behind her back for lack of police ID)52,028
수사(Investigation)대통령 비판 전단 돌리던 50대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워(50s housewife handing out flyers criticizing President *... handcuffed behind her back for lack of police ID)52,028
경찰서(Police station)여당 윤석열 검찰과 제 2 의 전쟁 나섰다(Ruling party’s prosecutor *** launches second war against prosecutors)8,385
지구대(Police precinct)‘**논란’ 들끓는데...* 대통령 연설문에 “공정과 정의”(** controversy’ rages...President *’s speech includes “fairness and justice”)3,210
자치경찰(Autonomous police)** “***정부는 대한민국을 선진국 대열에 진입시킨 정부”(** “The *** government has brought South Korea into the ranks of developed countries”)19,865
국수본(National Investigation Headquarters)** “***정부는 대한민국을 선진국 대열에 진입시킨 정부”(** “The *** government has brought South Korea into the ranks of developed countries”)19,865
관서(a government office)관서 “檢개혁요구 커지는 현실 성찰. . . 수사관행 개혁돼야” ( 종합)(President * “reflection on the growing demand for reform...Investigation practices should be reformed” (Roundup))4,903
치안(Policing)*** 검찰총장의 응수 “검찰 개혁 반대한 적 없다”(**Prosecutor General *** responds, “I have never opposed prosecution reform.”)11,102

Table 7 . News comments vector values and frequency.

RankWordVectorEuclideanManhattanScaled EuclideanFrequency
1종결(Closing a case)0.76572.036016.03724.22837,862
2경찰관(Police officers)0.74592.264617.51764.70852,926
3지구대(Police Precinct)0.70582.785321.70335.7627784
4파출소(Police box)0.69412.657021.96165.5045798
5시기상조(Premature)0.68142.423619.37995.0183524
6일선(Frontline)0.65732.279617.95504.74031,664
7검찰(Prosecutor)0.64272.270117.63644.7098466,911
8수사력(Investigative power)0.64122.545520.82515.2700542
9치안(Policing)0.63782.775321.10615.74142,538
10자치경찰(Autonomous police)0.62612.744022.54925.674012,491

Table 8 . News comments word similarity (2-year cycle).

Rank2017–20182019–20202021–2022



WordVectorWordVectorWordVector
1수사(Investigation)0.7536종결(Closing a case)0.7750종결(Closing a case)0.7360

2검찰(Prosecutor)0.7414경찰관(Police officers)0.7692수사(Investigation)0.6835

3지구대(Police precinct)0.6379짭새(JJapsae)0.7287검찰(Prosecutor)0.6442

4종결(Closing a case)0.6298파출소(Police box)0.7273역량(Capability)0.6210

5일선(Frontline)0.6279지구대(Police precinct)0.7256수사관(Investigator)0.6040

6경찰관(Police officers)0.6098경찰서(Police station)0.7007검경(Prosecutors and Police)0.5911

7독립(Independent)0.6036경찰권(Police rights)0.6846현장(Field)0.5909

8기소(Prosecution)0.6003자치경찰(Autonomous)0.6709조정(Adjustment)0.5895

9자치경찰(Autonomous police)0.5978치안(Policing)0.6618치안(Policing)0.5824

10검사(Prosecutor)0.5966시기상조(Premature)0.6504경찰관(Police Officers)0.5811

Table 9 . Top related words news comments.

Related wordCommentNumber of interests
종결(Closing a case)경찰이 1 차수사권을 가진 이상사법경찰의 그 많은 사건 수사중 경찰이임의로 덮어버려서 묻히는 사건들이 꽤 많아질겁니다 말로는 인권침해법령위반 관계인의 이의제기 등의 단서를 달아놨지만 저기 시골같은데서 경찰이 아무도 모르게 사건 덮어버리면 알 수가 없죠 검찰로 보내는송부자료도 조작해서 결제만 해 보내면 어찌 알겠어요예전에야 불기소 의견도 모두 검찰에 넘기고 다시 한번 조사를 받고 종결되서 그나마 위법 부당한 처리를 찾을 수 있었지 앞으로는 그것도 어려울듯 이건 좀더 통제가 필요한 부분임(As long as the police have the right of primary investigation, there will be quite a few cases that are buried because the police arbitrarily cover them up during the investigation of all those cases. In the past, all the non-prosecution opinions were handed over to the prosecutor’s office, investigated once again, and closed, so we were able to find illegal unfair treatment. In the future, it will be difficult to do so)1,314
경찰관(Police officers)대한민국 피의자 인권은 괴할정도입니다 지금 중요한건 매맞고 힘빠진 경찰관의 법집행력과경찰관을 포함 공무원의 기본적 인권을 지켜줄 때입니다(The human rights of detainees in South Korea are abysmal, and it’s time to strengthen law enforcement and protect the basic human rights of public servants, including police officers)1,441
지구대(Police precinct)역삼지구대 3 년 근무하면 강남에 30 평대 아파트 현금으로 산다면서요 사실인가요(I heard that if you work in Yeoksam District for 3 years, you can get a 30-pyeong apartment in Gangnam for cash. Is it true”)626
파출소(Police box)문제는 과다한 업무에 있습니다실종사건이 하루에도 지구대나 파출소 별로 여러 건이 하달되는데 실종사건만 전담해서 일을 할 수 없고 다른 112 신고사건도 처리해 가면서 실종사건을 처리해야 하는데서 문제가 발생합니다결국은 경찰인원을일선 지구대나 여청계 형사계 등에 많이 배치해야 하지요 쓸데없이윗자리 확보하려는 경찰서만 자꾸 만들지 말고경찰서를 오히려 축소하여야 합니다(The problem is that there are several missing persons cases per district or police station per day, but they cannot work exclusively on missing persons cases and have to deal with missing persons cases while handling other 112 calls. In the end, police personnel should be assigned to frontline districts and women’s police departments)267
시기상조(Premature)수사와 기소 분리가검찰개혁의 요체인건 맞지 근데 경찰 아니 견찰 보니까 수사권 때어주는 것은 아직 시기상조다 싶다 직무능력도 그렇고 조직 구조상 정권에 딸랑거릴 수밖에없거든 공수처는단호히 반대한다 기소권 수사권 모두 독점 검찰이 맡은 사건도 빼앗을 수 있는 무소 불위의 대통령 친위대 절대 반대다(The separation of investigation and prosecution is the key to prosecutorial reform, but it’s still premature to hand over investigative powers to the police. They are bound to be beholden to the regime due to their competence and organizational structure. The Public Prosecutor’s Office is adamantly opposed to both prosecution and investigation powers. The President’s Guard, which is an omnipotent force that can take over even cases handled by monopoly prosecutors, is absolutely opposed)904
일선(Frontline)나도 검사였다면 범죄 소탕위해 일선에서 목숨다해 일했을텐데 ㅠㅠ 검사님들 좀만 더 힘내주세요 국민이 함께합니다(If I were a prosecutor, I would have worked my ass off on the front lines to fight crime, too.)2,905
검찰(Prosecutor)검찰 응원합니다(Cheers to the prosecution)25,129
수사력(Investigative power)검찰개혁의 목적이 수사권 독립으로 산 권력도 수사해라 인데 지금의 행태는 산 권력 수사하면 아웃 인사권 횡포로 검찰수사력 과도한 억제 내지는 사법질서 파괴(The purpose of the prosecution reform is to investigate the mountain power with the independence of the investigation power, but the current behavior is that if the mountain power is investigated, it is out personnel power abuse, which excessively suppresses the prosecution investigation power and destroys the judicial order.)163
치안(Policing)아직도 *** 씨가 민주주의 대통령이라 생각하시나요 이자는 *** 와 더불어 민주주의를 박살낸자 *** 보다 못한 최악의 공직자로 기록될 겁니다 *** 이 *** 보다 나은점은 올바른 인사 범죄와의 전쟁 경제활성화 등 지금과 비교하면 한 20 년 집권했다면 하는 생각이 듭니다 누구나 서울에 노력하면 집한채 살수 있었고 도 없던시절 이었지만 치안안전했고 문재인씨같은 범죄자는 사형시킬수 있었던 그런시절 이었습니다 군사쿠데타가 일어나길 바라는건 민주화이후 처음입니다(Do you still think *** is a democratic president” He will be recorded as the worst public official, worse than *** and the man who destroyed democracy, ***. The only thing *** did better than *** was to fight the right personnel crimes and revitalize the economy. I think if he had been in power for 20 years compared to now, anyone could buy a house in Seoul if they tried, and it was a time when there were no roads, but security was safe and criminals like *** could be executed. It is the first time since democratization that I want a military coup to happen)2,621
자치경찰(Autonomous police)자치경찰이라니 지금도 지역 유지랑 짝짜꿍해서 전라도 염전노예는 탈출해도 경찰이잡아다 주인에게 바치는 수준인데자치경찰로 바뀌면 지역 유지는 거의 지역에서 신으로 군림할듯도대체 미국처럼 땅이 워낙 커서 자치제를 할 수 밖에는 없는 경우도 아니고손바닥만한 나라에서 무슨 지역을 그리 따지는지 암튼 자치경찰 하면 지역 토호의 온갖 범죄와 갑질을 막을 방법은 없을듯(It’s an autonomous police force, so even if a Jeolla Province salt slave escapes, the police catch it and give it to the owner, but if it changes to an autonomous police force, the local maintenance will almost reign as a god in the region. Also, like the United States, the land is so large that it can only be self-governed, and what region is so important in a country the size of a palmAnyway, if you do an autonomous police force, there is no way to prevent all kinds of crimes and robberies in the local toho.2,646

References

  1. Jo, EK (2012). The current state of affairs of the sentiment analysis and case study based on corpus. The Journal of Linguistic Science. 2012, 259-282.
  2. Gai, XT (). Analysis of microblog public opinion based on BERT model, 1-6. https://doi.org/10.1145/3529299.3531496
  3. Cho, K, Van Merrienboer, B, Gulcehre, C, Bahdanau, D, Bougares, F, Schwenk, H, and Bengio, Y. (2014) . Learning phrase representations using RNN encoder-decoder for statistical machine translation. Available: https://arxiv.org/abs/1406.1078
  4. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, and Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30, 5998-6008.
  5. Fernandes, J, Giurcanu, M, Bowers, KW, and Neely, JC (2010). The writing on the wall: a content analysis of college students’ Facebook groups for the 2008 presidential election. Mass Communication and Society. 13, 653-675. https://doi.org/10.1080/15205436.2010.516865
    CrossRef
  6. Marcoccia, M (2004). On-line polylogues: conversation structure and participation framework in internet newsgroups. Journal of Pragmatics. 36, 115-145. https://doi.org/10.1016/S0378-2166(03)00038-9
    CrossRef
  7. Noelle-Neumann, E (1974). The spiral of silence a theory of public opinion. Journal of Communication. 24, 43-51. https://doi.org/10.1111/j.1460-2466.1974.tb00367.x
    CrossRef
  8. Tsagkias, M, Weerkamp, W, and De Rijke, M (2010). News comments: exploring, modeling, and online prediction. Advances in Information Retrieval. Heidelberg, Germany: Springer, pp. 191-203 https://doi.org/10.1007/978-3-642-12275-0_19
  9. Preiss, J, Stevenson, M, and Gaizauskas, R (2015). Exploring relation types for literature-based discovery. Journal of the American Medical Informatics Association. 22, 987-992. https://doi.org/10.1093/jamia/ocv002
    Pubmed KoreaMed CrossRef
  10. Chakraborty, G, Pagolu, M, and Garla, S (2014). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. Cary, NC: SAS Institute
  11. Lee, G, Jeong, J, Seo, S, Kim, C, and Kang, P. (2017) . Sentiment classification with word attention based on weakly supervised learning with a convolutional neural network. Available: https://arxiv.org/abs/1709.09885
  12. Kristiyanti, DA, Umam, AH, Wahyudi, M, Amin, R, and Marlinda, L . Comparison of SVM & naïve Bayes algorithm for sentiment analysis toward west java governor candidate period 2018–2023 based on public opinion on twitter., Proceedings of 2018 6th International Conference on Cyber and IT Service Management (CITSM), 2018, Parapat, Indonesia, Array, pp.1-6. https://doi.org/10.1109/CITSM.2018.8674352
  13. Haihong, E, Hu, Y, Peng, H, Zhao, W, Xiao, S, and Niu, P (2019). Theme and sentiment analysis model of public opinion dissemination based on generative adversarial network. Chaos, Solitons & Fractals. 121, 160-167. https://doi.org/10.1016/j.chaos.2018.11.036
    CrossRef
  14. Li, S, Liu, Z, and Li, Y (2020). Temporal and spatial evolution of online public sentiment on emergencies. Information Processing & Management. 57. article no 102177
    CrossRef
  15. Kim, D, and Kang, P (2022). Cross-modal distillation with audio– text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0. Neurocomputing. 506, 168-183. https://doi.org/10.1016/j.neucom.2022.07.035
    CrossRef
  16. Yang, Z, Yang, D, Dyer, C, He, X, Smola, A, and Hovy, E . Hierarchical attention networks for document classification., Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, San Diego, CA, pp.1480-1489.
  17. Vinyals, O, Kaiser, L, Koo, T, Petrov, S, Sutskever, I, and Hinton, G (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems. 28, 2773-2781.
  18. Bahdanau, D, Cho, K, and Bengio, Y. (2014) . Neural machine translation by jointly learning to align and translate. Available: https://arxiv.org/abs/1409.0473
  19. Luong, MT, Pham, H, and Manning, CD. (2015) . Effective approaches to attention-based neural machine translation. Available: https://arxiv.org/abs/1508.04025
  20. Bania, RK (2020). COVID-19 public tweets sentiment analysis using TF-IDF and inductive learning models. INFOCOMP Journal of Computer Science. 19, 23-41.
  21. Rachman, FH . Twitter sentiment analysis of Covid-19 using term weighting TF-IDF and logistic regression., Proceedings of 2020 6th Information Technology International Seminar (ITIS), 2020, Surabaya, Indonesia, Array, pp.238-242. https://doi.org/10.1109/ITIS50118.2020.9320958
  22. Antonio, VD, Efendi, S, and Mawengkang, H (2022). Sentiment analysis for covid-19 in Indonesia on Twitter with TF-IDF featured extraction and stochastic gradient descent. International Journal of Nonlinear Analysis and Applications. 13, 1367-1373. https://doi.org/10.22075/ijnaa.2021.5735
  23. Devlin, J, Chang, MW, Lee, K, and Toutanova, K. (2018) . BERT: pre-training of deep bidirectional transformers for language understanding. Available: https://arxiv.org/abs/1810.04805
  24. Gao, Z, Feng, A, Song, X, and Wu, X (2019). Target-dependent sentiment classification with BERT. IEEE Access. 7, 154290-154299. https://doi.org/10.1109/ACCESS.2019.2946594
    CrossRef
  25. Nugroho, KS, Sukmadewa, AY, and Yudistira, N . Large-scale news classification using BERT language model: Spark NLP approach., Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology, 2021, Malang, Indonesia, Array, pp.240-246. https://doi.org/10.1145/3479645.3479658
  26. Adhikari, A, Ram, A, Tang, R, and Lin, J. (2019) . DocBERT: BERT for document classification. Available: https://arxiv.org/abs/1904.08398
  27. SK Telecom. (2021) . KoBERT: SKT open-source. Available: https://sktelecom.github.io/project/kobert/
  28. Mikolov, T, Chen, K, Corrado, G, and Dean, J. (2013) . Efficient estimation of word representations in vector space. Available: https://arxiv.org/abs/1301.3781
  29. Mikolov, T, Sutskever, I, Chen, K, Corrado, GS, and Dean, J (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. 26, 3111-3119.
  30. Rong, X. (2014) . word2vec parameter learning explained. Available: https://arxiv.org/abs/1411.2738
  31. Riva, M. (2023) . Word embeddings: CBOW vs Skip-Gram. Available: https://www.baeldung.com/cs/word-embeddings-cbow-vs-skip-gram
  32. SKTBrain. (2022) . KoBERT. Available: https://github.com/SKTBrain/KoBERT
  33. Park, J. (2020) . KoBERT-nsmc. Available: https://github.com/monologg/KoBERT-nsmc
  34. Valente, A, Tudisca, V, Pelliccia, A, Cerbara, L, and Caruso, MG (2023). Comparing liberal and conservative newspapers: diverging narratives in representing migrants?. Journal of Immigrant & Refugee Studies. 21, 411-427. https://doi.org/10.1080/15562948.2021.1985200
    CrossRef
  35. Stafford, T. (2014) . Psychology: why bad news dominates the headlines. Available: https://www.bbc.com/future/article/20140728-why-is-all-the-news-bad
  36. Christopherson, KM (2007). The positive and negative implications of anonymity in Internet social interactions: “On the Internet, Nobody Knows You’re a Dog”. Computers in Human Behavior. 23, 3038-3056. https://doi.org/10.1016/j.chb.2006.09.001
    CrossRef
  37. Lavin, M (2019). Analyzing documents with TF-IDF. Available: http://dx.doi.org/10.46430/phen0082