International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 244-258
Published online September 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.3.244
© The Korean Institute of Intelligent Systems
Changwon Baek1, Jiho Kang2, and SangSoo Choi1
1Technological Convergence Center, Korea Institute of Science and Technology (KIST), Seoul, Korea
2Institute of Engineering Research, Korea University, Seoul, Korea
Correspondence to :
Changwon Baek (baekcw@kist.re.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Online news articles and comments play a vital role in shaping public opinion. Numerous studies have conducted online opinion analyses using these as raw data. Bidirectional encoder representations from transformer (BERT)-based sentiment analysis of public opinion have recently attracted significant attention. However, owing to its limited linguistic versatility and low accuracy in domains with insufficient learning data, the application of BERT to Korean is challenging. Conventional public opinion analysis focuses on term frequency; hence, low-frequency words are likely to be excluded because their importance is underestimated. This study aimed to address these issues and facilitate the analysis of public opinion regarding Korean news articles and comments. We propose a method for analyzing public opinion using word2vec to increase the word-frequency-centered analytical limit in conjunction with KoBERT, which is optimized for Korean language by improving BERT. Naver news articles and comments were analyzed using a sentiment classification model developed on the KoBERT framework. The experiment demonstrated a sentiment classification accuracy of over 90%. Thus, it yields faster and more precise results than conventional methods. Words with a low frequency of occurrence, but high relevance, can be identified using word2vec.
Keywords: KoBERT, Word2vec, Public opinion analysis, Sentiment classification
With the development of information and communication technology, online media has become an essential means of disseminating information. The digital divide has also narrowed as anyone can easily access online news. Readers not only read news articles but also actively express their opinions by leaving comments. Such online comments enable the exchange of various viewpoints on an issue that has a lasting effect on public opinion. Public opinion formed by the exchange of ideas in online comments influence administrative, legislative, and judicial policy decisions in various ways, including through public petitions [1]. Therefore, many studies have been conducted to collect online comments and analyze public opinion. Recently, a neural network model using bidirectional encoder representations from transformers (BERT) was applied to these studies [2]. BERT is an English-language-based model that is not suitable for the Korean language [3, 4]. Therefore, applying BERT to Korean leads to inaccurate analysis results owing to errors in the Korean morphological analysis and word embedding. In addition, previous studies have used several algorithms, including term frequency-inverse document frequency (TF-IDF), latent semantic analysis (LSA), and latent Dirichlet allocation (LDA), to analyze comments. Although TF-IDF is simple to implement and provides a relatively accurate analysis based on word frequency, it does not consider the order or context of words, rendering it difficult to fully understand their meaning. By contrast, LSA considers the semantic similarity between words and accurately identifies meaning. LDA analyzes the topic to which a comment belongs, enabling it to cover multiple topics. However, both the LSA and LDA have high computational costs and are difficult to interpret. Word2vec calculates the semantic similarity between words by representing them as vectors and considering their order and context for a more accurate analysis. This method is cost-effective, rendering it less likely for important words to be excluded because of their low frequency. This ensures that a large number of comments from a particular group are not analyzed as important public opinion makes informed decisions based on public opinion [5–7]. Various research methods have attempted to address these issues, such as using supervised and unsupervised learning to remove duplicate comments or utilizing comment cluster analysis [8].
The main contributions of this paper can be summarized as follows:
• The objective of this study is to develop a specialized sentiment analyzer for online opinion analysis in Korean. To achieve this, we created an online public opinion sentiment analysis model using KoBERT, a pretrained deep learning model optimized for the Korean language. The model analyzes issues determined as positive or negative by the researcher. To overcome the linguistic limitations of BERT, we employed KoBERT, which is a transformer language model pretrained on a large Korean corpus.
• Online public opinion analysis is often limited by the use of word frequency analysis, which may exclude important but infrequently used opinion words. To overcome this limitation, we used a word embedding algorithm called word2vec. This algorithm considers the correlation between words by locating each word in a three-dimensional vector space and identifying words that are highly correlated with the target word. Using this algorithm, we aim to improve the accuracy and comprehensiveness of our online opinion analyses.
• Developing a highly accurate sentiment analyzer can significantly enhance the efficiency and speed of online public opinion analyses. Such analysis can offer valuable insights into shaping policies related to politics, economics, and social issues. Hence, policymakers and researchers who aim to make informed decisions must prioritize the development of an accurate sentiment analyzer.
Sections 2–5 of this paper present a literature review, data collection and analysis, experimental results, and conclusions.
Social media websites provide opportunities for active participation in democratic processes [5]. Online news is an example of a social media platform in which readers express their opinions and feelings through comments. News comments can be considered a form of multilateral dialogue [6]. Public opinion is formed by observing, hearing, and perceiving others’ opinions [7]. In other words, an online zone in which anyone can easily communicate with a diverse group of individuals and recognize, accept, and critique the viewpoints of others can be viewed as an ideal environment for forming public opinion. Therefore, news and comments are important data for public opinion analysis, and the number of studies on developing algorithms or tools to analyze them is increasing [8].
The extraction of useful information from textual data is known as text mining. It is an effective research method for deriving meaningful and valuable information and knowledge by discovering hidden relationships or patterns in data and analyzing different morphemes included in unstructured text data [9]. It can be used to identify the main latent content in a text [10]. Its applications include document classification, clustering, summarization as well as information extraction and sentiment analysis. Decision trees, recurrent neural network (RNN), and BERT are text analysis methods that employ supervised learning. Clustering, LDA, the dynamic multitopic model (DMA), and the expressed agenda model are examples of unsupervised learning methods. Recently, methodologies such as attentive long short-term memory (LSTM) and tensor fusion networks have been studied [11]. In particular, sentiment analysis is an important method for understanding online public opinion and has been used in numerous studies [12–15]. The natural language processing (NLP) model includes document classification [16], syntax analysis [17], and machine translation [18,19], which focus on words or phrases. TF-IDF, which is typically used in text mining, including online opinion analysis, is an algorithm that uses word frequency and inverse document frequencies [20, 21]. Because high frequency words are given more weight, low frequency words are often excluded from analysis. To overcome these limitations, several studies have been conducted to improve the accuracy of algorithms that use probability methods [22].
Although existing deep learning methodologies initialize weights to random values when creating a model, pretraining is a method for leveraging weights obtained from data learning of other problems as initial values. A problem using pretrained weights in this manner is called a downstream task. BERT improves downstream task performance by using language models for pretraining [23]. It has demonstrated excellent performance in sentence classification for various applications, including sentence sentiment analysis [24], major news topic classification [25], and document classification [26]. In this study, we used KoBERT, developed by a Korean company (SKT), to improve Google BERT’s Korean language learning performance. A case study demonstrated a 2.6% performance improvement compared to BERT [27]. It places words in a three-dimensional vector space, enabling us to evaluate the associations between different words based on their proximity. It has two different model architectures: continuous bag-of-words (CBOW) and skip-gram (SG). The CBOW model predicts a target word based on its surrounding words, whereas the SG model predicts the surrounding words based on a target word [28, 29].
This study followed the procedure depicted in Figure 1 for data collection, preprocessing, and analysis of online comments.
The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority”. The title, content, and comments were collected after parsing the webpage using HTML and CSS parsers. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database. The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority.” After parsing the webpage using HTML and CSS parser, the title, content, and comments were collected. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database.
The crawler’s unstructured text data must be classified using morphemes to analyze the comments. However, morphological analyses present significant challenges for Korean comments. First, adapting most morpheme analysis libraries to the Korean language is challenging because they were developed in English. Second, many exceptions exist to spacing rules in Hangul, including those set by the National Institute of Korean Language. Third, owing to changes in the ending and proposition, accurate analysis is impossible when the word is separated based on spacing but is not a word (space unit). This study analyzes morphemes using the “Mecab-ko library” of the “Eunjeon Hannip Project,” a Korean morpheme analyzer used in Korea, to address these issues.
For a computer to analyze text data, a process known as “embedding” must convert them into mathematically executable forms. In this study, word2vec was used to embed the text data. word2vec, a library developed by Google, is an algorithm with the benefit of fast learning speed. This is accomplished by placing highly related words in similar positions in the vector space based on the frequency with which they appear in similar contexts. The hyperparameters required to generate the word2vec model are listed in Table 1. Although both the CBOW and SG are representative learning models, we used the SG in this study because of its excellent performance in many cases [30, 31].
Accurate results are difficult to obtain at the initial stage when attempting to determine a target word and extract the surrounding words with high relevance using word2vec. This is because the morpheme analyzer (Mecab-ko) does not perform appropriate preprocessing of stop words, synonyms, compound nouns, and neologisms. Therefore, the elimination of unnecessary “josa” and other stop words, correct typos, and integration synonyms are required. In addition, neologisms were added to the user dictionary of the morpheme analyzer. In this study, preprocessing was performed only on the surrounding words with high relevance to the target word using the word2vec model, which learns from all comments. This reduces the amount of time required for preprocessing by extracting and refining only the words surrounding the target word, thereby improving efficiency.
The word2vec-based final analysis model extracts keywords related to the preprocessed morpheme list associated with the target word. The extracted words were expressed in the vector space. The degree of association between a certain target word and a specific neighboring word was calculated according to the cosine similarity of the corresponding vectors. The cosine similarity is calculated as the dot product between two vectors divided by the size of the vectors, as shown in
Here,
We searched for the phrase “improvement of investigative authority” in Naver News from May 9, 2017, to June 6, 2022. A total of 39,697 news articles and 4,426,875 comments were collected using a crawler, including information such as “news title,” “news posting date,” “media company,” and “comments.” The number of monthly news articles and comments showed similar trends, as illustrated in Table 2 and Figure 4. The correlation coefficient between the two variables is 0.8634, indicating a strong positive correlation. This is consistent with other studies, indicating that news articles and comments have the same tendency and potential to shape online public opinion [6].
A total of 233,352 datasets were used, of which 175,012 were used as the training set, and 58,340 were used as the test set. The optimal value to prevent overfitting during model training was calculated using a callback function called EarlyStopping. As shown in Figures 5 and 6, the optimal values of the epoch, val accuracy (validation set accuracy), and val loss (loss of validation set) are 4, 0.9051, and 0.2278, respectively. The performance of the proposed model is presented in Table 3. The model performed similarly to or slightly better than KoBERT on GitHub (0.9010, 0.8963) [32,33]. In this study, the KoBERT-based learning model classified emotions (positive or negative) from a total of 4,426,875 comments with more than 90% accuracy without the researcher’s intervention. This indicates that the emotions of online public opinion can be grasped quickly and accurately.
The news articles containing the target word “Police” were extracted. The total number of articles was 39,697, of which 6,465 satisfied the aforementioned conditions. After applying the KoBERT-based sentiment classification model to police-related articles, 3,164 (49%) articles were classified as positive and 3,301 (51%) as negative. The positive-to-negative ratio of responses to articles related to “improvement of investigative authority” containing the word “police” is deemed to be similar. Therefore, the number of positive and negative responses to articles on the “improvement of investigative authority” and “police” is comparable, as illustrated in Figure 7.
According to our sentiment analysis, 2,903,069 comments were analyzed after preprocessing for policy violations and deleted comments. As illustrated in Figure 8, the results indicate that negative comments constituted an overwhelming majority, accounting for 86% of all comments analyzed, whereas positive comments constituted only 14%.
The sentiment analysis of news articles on the “improvement of investigative authority” was positive (49%) and negative (51%). Currently, news articles are written from conservative and liberal perspectives. Similar proportions of positive and negative sentiment suggest that conservative and liberal media outlets write similar numbers of articles [34]. However, when we analyzed the comments, we found that the majority were negative (19% positive and 81% negative). This is consistent with the argument that readers are more interested in negativity and that anonymity drives negative comments [35, 36].
The results of using word2vec to identify words closely related to “Police” in the content of the collected articles are illustrated in Tables 4 and 5 and Figure 9. Based on the cosine similarity value, the word “Police” correlates strongly with the following terms: “National Police Agency” (0.7511), “Police Commissioner” (0.7162), “Police Officers” (0.6993), “Investigation” (0.6845), and “Police Station” (0.6480). In particular, “Police Precinct” had a low frequency of 590 occurrences but a vector value of 0.6445, Rank 6. This is the difference between the current study’s approach to understanding public opinion and the term frequency-based approach used in previous studies [37].
The titles of news articles that are of high interest to the reader among the news content, including the keywords in Table 4, are shown in Table 6. These articles are considered to have a significant influence on shaping online public opinion. Naver readers express their interest in an article by selecting emoticons at the bottom of the news content. In this study, the number of emoticons was used as a measure of interest to identify the most popular articles. Consequently, important news articles that were previously subjectively evaluated by researchers can now be extracted using objective criteria.
With “Police” as the target word, word2vec was used to analyze highly relevant keywords. For each keyword, the most sympathetic comments containing the word were extracted. As illustrated in Table 7
Online opinion analysis has the advantage of being faster in real time and more cost effective than commonly used conventional survey methods. The search term “improvement of investigative authority” was used to gather and analyze online news articles and comments. A crawler was used to collect unstructured data in text format, and a KoBERT neural network model was used for sentiment classification learning. The number of news articles and comments showed a strong positive correlation (0.8634), indicating that the increase or decrease in online news articles was similar to that of comments. In other words, news articles with high social interest were more likely to generate stronger online opinions as the number of comments increased. In this study, the accuracy of the text–sentiment classification model using KoBERT exceeded 90%. The sentiment classification was performed using the model on 6,465 news articles that contained the word “Police” in their content. Consequently, 3,164 (48%) were positives and 3,301 (51%) were negatives, with no significant difference between the two. After analyzing the comments, 118,362 (81%) were classified as negative and 27,351 (19%) were classified as positive. The sentiment classifier for online public opinion is faster and more accurate than existing research that requires the researcher’s direct classification, owing to the artificial intelligence model. The words “National Police Agency” (0.7511), “Police commissioner” (0.7162), “Police officers” (0.6993), “Investigation” (0.6845), and “Police station” (0.6480) had the highest values when the word2vec algorithm was used to identify words that were highly associated with the term “Police” in the article content. Similarly, words that were highly related to “Police” were extracted from news comments, such as “Closing a case” (0.7657), “Police Officers” (0.7459), and “Police Precinct” (0.7058). Related keywords with relatively low frequencies, which may have been excluded in previous studies that focused solely on term frequencies, were also included. By using wrod2vec, we can identify words that may have been unnoticed because of their low frequency of occurrence, even though they have a significant impact on online opinion formation. Consequently, online public opinion can be analyzed quickly and accurately by collecting and analyzing online news articles and comments using KoBERT and word2vec. This study aimed to develop a specialized sentiment analyzer for online opinion analysis in Korean using the word2vec algorithm. However, the accuracy of the model is unknown because of the lack of validation data during training. In addition, the word2vec model has limitations in sentence-level natural language processing because it does not consider the context of the sentence. To overcome these limitations, a combination of document-level word embedding algorithms such as doc2vec should be utilized.
Furthermore, future research should collect news articles and comments from more diverse portals in addition to Naver and analyze the differences between portal sites. This will lead to a more accurate understanding of the online public opinion.
No potential conflict of interest relevant to this article was reported.
Table 1. Word2vec hyperparameter.
Parameter | Value |
---|---|
Minimum frequency | 5 |
Layer size | 100 |
Learning rate | 0.15 |
Iteration | 100 |
Windows size | 3 |
Sub-sampling | 10−2 |
Negative sampling | 15 |
Table 2. Number of news articles and comments per month.
Year Month | Number of articles | Number of comments | Year Month | Number of articles | Number of comments |
---|---|---|---|---|---|
2017.05 | 979 | 77,490 | 2019.12 | 2,357 | 275,419 |
2017.06 | 349 | 22,979 | 2020.01 | 2,285 | 258,179 |
2017.07 | 1,028 | 57,043 | 2020.02 | 765 | 152,093 |
2017.08 | 611 | 22,361 | 2020.03 | 165 | 44,353 |
2017.09 | 299 | 14,271 | 2020.04 | 0 | 570 |
2017.10 | 361 | 23,407 | 2020.05 | 297 | 23,221 |
2017.11 | 290 | 12,689 | 2020.06 | 401 | 42,001 |
2017.12 | 237 | 8,248 | 2020.07 | 0 | 638 |
2018.01 | 806 | 69,150 | 2020.08 | 726 | 92,750 |
2018.02 | 161 | 15,865 | 2020.09 | 651 | 51,418 |
2018.03 | 797 | 71,034 | 2020.10 | 0 | 141 |
2018.04 | 404 | 50,118 | 2020.11 | 299 | 38,804 |
2018.05 | 243 | 11,142 | 2020.12 | 1,546 | 235,718 |
2018.06 | 1,651 | 59,189 | 2021.01 | 0 | 594 |
2018.07 | 408 | 11,988 | 2021.02 | 0 | 27 |
2018.08 | 183 | 9,788 | 2021.03 | 2,021 | 184,579 |
2018.09 | 100 | 6,035 | 2021.04 | 0 | 223 |
2018.10 | 243 | 7,662 | 2021.05 | 0 | 28 |
2018.11 | 462 | 31,200 | 2021.06 | 732 | 52,591 |
2018.12 | 288 | 44,974 | 2021.07 | 0 | 131 |
2019.01 | 439 | 26,601 | 2021.08 | 0 | 22 |
2019.02 | 749 | 69,626 | 2021.09 | 240 | 12,317 |
2019.03 | 1,827 | 229,259 | 2021.10 | 0 | 149 |
2019.04 | 1,908 | 239,146 | 2021.11 | 0 | 16 |
2019.05 | 3,664 | 288,550 | 2021.12 | 207 | 12,855 |
2019.06 | 1,123 | 104,195 | 2022.01 | 0 | 781 |
2019.07 | 1,219 | 80,566 | 2022.02 | 0 | 19 |
2019.08 | 739 | 162,134 | 2022.03 | 548 | 77,496 |
2019.09 | 1,317 | 376,174 | 2022.04 | 0 | 380 |
2019.10 | 2,827 | 567,929 | 2022.05 | 0 | 28 |
2019.11 | 658 | 81,329 | 2022.06 | 87 | 17,192 |
Table 3. Model performance.
Result value | |
---|---|
Accuracy | 0.9853 |
Loss | 0.0474 |
Validation accuracy | 0.9051 |
Validation loss | 0.2278 |
Recall | 0.8529 |
Precision | 0.8981 |
F1 | 0.8750 |
Table 4. News article word similarity and frequencies.
Rank | Word | Vector | Euclidean | Manhattan | Scaled Euclidean | Frequency |
---|---|---|---|---|---|---|
1 | 경찰청 | 0.7511 | 2.0024 | 16.0236 | 3.9065 | 18,251 |
2 | 청장 | 0.7162 | 1.9857 | 16.2410 | 3.9301 | 24,731 |
3 | 경찰관 | 0.6993 | 2.4249 | 20.2057 | 4.7727 | 6,307 |
4 | 수사 | 0.6845 | 1.8044 | 13.9328 | 3.5361 | 317,777 |
5 | 경찰서 | 0.6480 | 2.6114 | 20.7929 | 5.0790 | 3,442 |
6 | 지구대 | 0.6445 | 3.3773 | 27.9833 | 6.6381 | 590 |
7 | 자치경찰 | 0.6413 | 2.6127 | 21.1340 | 5.1316 | 22,247 |
8 | 국수본 | 0.6225 | 2.6505 | 20.4286 | 5.2038 | 2,526 |
9 | 관서 | 0.6093 | 3.3562 | 27.0328 | 6.5523 | 373 |
10 | 치안 | 0.5937 | 3.0960 | 24.5606 | 6.0993 | 4,520 |
Table 5. News article word similarity (2-year cycle).
Rank | 2017–2018 | 2019–2020 | 2021–2022 | |||
---|---|---|---|---|---|---|
Word | Vector | Word | Vector | Word | Vector | |
1 | 청장 | 0.6871 | 경찰청 | 0.7754 | 종결 | 0.6683 |
2 | 수사 | 0.6764 | 자치경찰 | 0.7309 | 수사 | 0.6506 |
3 | 경찰청 | 0.6568 | 청장 | 0.7275 | 경찰청 | 0.6373 |
4 | 검찰 | 0.6449 | 경찰관 | 0.6738 | 경찰관 | 0.5923 |
5 | 경찰관 | 0.6135 | 수사 | 0.6729 | 청장 | 0.5692 |
6 | 경찰서 | 0.5890 | 국수본 | 0.6418 | 경찰대 | 0.5678 |
7 | 서장 | 0.5788 | 지구대 | 0.6412 | 치안 | 0.5616 |
8 | 이 ** | 0.5593 | 치안 | 0.6225 | 검경 | 0.5583 |
9 | 일선 | 0.5519 | 민** | 0.6211 | 자치경찰 | 0.5567 |
10 | 지구대 | 0.5461 | 경찰대 | 0.6158 | 경찰서 | 0.5518 |
Table 6. Important news article title.
Related word | News article title | Number of interests |
---|---|---|
경찰청 | 경찰청 전직원에 “검찰 조국수사 비판 與보고서 읽어라” | 22,191 |
청장 | 정권 수사한 ‘***참모들’ 모두 유배 보내버렸다 | 23,562 |
경찰관 | 대통령 비판 전단 돌리던 50 대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워 | 52,028 |
수사 | 대통령 비판 전단 돌리던 50대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워 | 52,028 |
경찰서 | 여당 윤석열 검찰과 제 2 의 전쟁 나섰다 | 8,385 |
지구대 | ‘**논란’ 들끓는데...* 대통령 연설문에 “공정과 정의” | 3,210 |
자치경찰 | ** “***정부는 대한민국을 선진국 대열에 진입시킨 정부” | 19,865 |
국수본 | ** “***정부는 대한민국을 선진국 대열에 진입시킨 정부” | 19,865 |
관서 | 관서 “檢개혁요구 커지는 현실 성찰. . . 수사관행 개혁돼야” ( 종합) | 4,903 |
치안 | *** 검찰총장의 응수 “검찰 개혁 반대한 적 없다” | 11,102 |
Table 7. News comments vector values and frequency.
Rank | Word | Vector | Euclidean | Manhattan | Scaled Euclidean | Frequency |
---|---|---|---|---|---|---|
1 | 종결 | 0.7657 | 2.0360 | 16.0372 | 4.2283 | 7,862 |
2 | 경찰관 | 0.7459 | 2.2646 | 17.5176 | 4.7085 | 2,926 |
3 | 지구대 | 0.7058 | 2.7853 | 21.7033 | 5.7627 | 784 |
4 | 파출소 | 0.6941 | 2.6570 | 21.9616 | 5.5045 | 798 |
5 | 시기상조 | 0.6814 | 2.4236 | 19.3799 | 5.0183 | 524 |
6 | 일선 | 0.6573 | 2.2796 | 17.9550 | 4.7403 | 1,664 |
7 | 검찰 | 0.6427 | 2.2701 | 17.6364 | 4.7098 | 466,911 |
8 | 수사력 | 0.6412 | 2.5455 | 20.8251 | 5.2700 | 542 |
9 | 치안 | 0.6378 | 2.7753 | 21.1061 | 5.7414 | 2,538 |
10 | 자치경찰 | 0.6261 | 2.7440 | 22.5492 | 5.6740 | 12,491 |
Table 8. News comments word similarity (2-year cycle).
Rank | 2017–2018 | 2019–2020 | 2021–2022 | |||
---|---|---|---|---|---|---|
Word | Vector | Word | Vector | Word | Vector | |
1 | 수사 | 0.7536 | 종결 | 0.7750 | 종결 | 0.7360 |
2 | 검찰 | 0.7414 | 경찰관 | 0.7692 | 수사 | 0.6835 |
3 | 지구대 | 0.6379 | 짭새 | 0.7287 | 검찰 | 0.6442 |
4 | 종결 | 0.6298 | 파출소 | 0.7273 | 역량 | 0.6210 |
5 | 일선 | 0.6279 | 지구대 | 0.7256 | 수사관 | 0.6040 |
6 | 경찰관 | 0.6098 | 경찰서 | 0.7007 | 검경 | 0.5911 |
7 | 독립 | 0.6036 | 경찰권 | 0.6846 | 현장 | 0.5909 |
8 | 기소 | 0.6003 | 자치경찰 | 0.6709 | 조정 | 0.5895 |
9 | 자치경찰 | 0.5978 | 치안 | 0.6618 | 치안 | 0.5824 |
10 | 검사 | 0.5966 | 시기상조 | 0.6504 | 경찰관 | 0.5811 |
Table 9. Top related words news comments.
Related word | Comment | Number of interests |
---|---|---|
종결 | 경찰이 1 차수사권을 가진 이상사법경찰의 그 많은 사건 수사중 경찰이임의로 덮어버려서 묻히는 사건들이 꽤 많아질겁니다 말로는 인권침해법령위반 관계인의 이의제기 등의 단서를 달아놨지만 저기 시골같은데서 경찰이 아무도 모르게 사건 덮어버리면 알 수가 없죠 검찰로 보내는송부자료도 조작해서 결제만 해 보내면 어찌 알겠어요예전에야 불기소 의견도 모두 검찰에 넘기고 다시 한번 조사를 받고 종결되서 그나마 위법 부당한 처리를 찾을 수 있었지 앞으로는 그것도 어려울듯 이건 좀더 통제가 필요한 부분임 | 1,314 |
경찰관 | 대한민국 피의자 인권은 괴할정도입니다 지금 중요한건 매맞고 힘빠진 경찰관의 법집행력과경찰관을 포함 공무원의 기본적 인권을 지켜줄 때입니다 | 1,441 |
지구대 | 역삼지구대 3 년 근무하면 강남에 30 평대 아파트 현금으로 산다면서요 사실인가요 | 626 |
파출소 | 문제는 과다한 업무에 있습니다실종사건이 하루에도 지구대나 파출소 별로 여러 건이 하달되는데 실종사건만 전담해서 일을 할 수 없고 다른 112 신고사건도 처리해 가면서 실종사건을 처리해야 하는데서 문제가 발생합니다결국은 경찰인원을일선 지구대나 여청계 형사계 등에 많이 배치해야 하지요 쓸데없이윗자리 확보하려는 경찰서만 자꾸 만들지 말고경찰서를 오히려 축소하여야 합니다 | 267 |
시기상조 | 수사와 기소 분리가검찰개혁의 요체인건 맞지 근데 경찰 아니 견찰 보니까 수사권 때어주는 것은 아직 시기상조다 싶다 직무능력도 그렇고 조직 구조상 정권에 딸랑거릴 수밖에없거든 공수처는단호히 반대한다 기소권 수사권 모두 독점 검찰이 맡은 사건도 빼앗을 수 있는 무소 불위의 대통령 친위대 절대 반대다 | 904 |
일선 | 나도 검사였다면 범죄 소탕위해 일선에서 목숨다해 일했을텐데 ㅠㅠ 검사님들 좀만 더 힘내주세요 국민이 함께합니다 | 2,905 |
검찰 | 검찰 응원합니다 | 25,129 |
수사력 | 검찰개혁의 목적이 수사권 독립으로 산 권력도 수사해라 인데 지금의 행태는 산 권력 수사하면 아웃 인사권 횡포로 검찰수사력 과도한 억제 내지는 사법질서 파괴 | 163 |
치안 | 아직도 *** 씨가 민주주의 대통령이라 생각하시나요 이자는 *** 와 더불어 민주주의를 박살낸자 *** 보다 못한 최악의 공직자로 기록될 겁니다 *** 이 *** 보다 나은점은 올바른 인사 범죄와의 전쟁 경제활성화 등 지금과 비교하면 한 20 년 집권했다면 하는 생각이 듭니다 누구나 서울에 노력하면 집한채 살수 있었고 도 없던시절 이었지만 치안안전했고 문재인씨같은 범죄자는 사형시킬수 있었던 그런시절 이었습니다 군사쿠데타가 일어나길 바라는건 민주화이후 처음입니다 | 2,621 |
자치경찰 | 자치경찰이라니 지금도 지역 유지랑 짝짜꿍해서 전라도 염전노예는 탈출해도 경찰이잡아다 주인에게 바치는 수준인데자치경찰로 바뀌면 지역 유지는 거의 지역에서 신으로 군림할듯도대체 미국처럼 땅이 워낙 커서 자치제를 할 수 밖에는 없는 경우도 아니고손바닥만한 나라에서 무슨 지역을 그리 따지는지 암튼 자치경찰 하면 지역 토호의 온갖 범죄와 갑질을 막을 방법은 없을듯 | 2,646 |
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 244-258
Published online September 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.3.244
Copyright © The Korean Institute of Intelligent Systems.
Changwon Baek1, Jiho Kang2, and SangSoo Choi1
1Technological Convergence Center, Korea Institute of Science and Technology (KIST), Seoul, Korea
2Institute of Engineering Research, Korea University, Seoul, Korea
Correspondence to:Changwon Baek (baekcw@kist.re.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Online news articles and comments play a vital role in shaping public opinion. Numerous studies have conducted online opinion analyses using these as raw data. Bidirectional encoder representations from transformer (BERT)-based sentiment analysis of public opinion have recently attracted significant attention. However, owing to its limited linguistic versatility and low accuracy in domains with insufficient learning data, the application of BERT to Korean is challenging. Conventional public opinion analysis focuses on term frequency; hence, low-frequency words are likely to be excluded because their importance is underestimated. This study aimed to address these issues and facilitate the analysis of public opinion regarding Korean news articles and comments. We propose a method for analyzing public opinion using word2vec to increase the word-frequency-centered analytical limit in conjunction with KoBERT, which is optimized for Korean language by improving BERT. Naver news articles and comments were analyzed using a sentiment classification model developed on the KoBERT framework. The experiment demonstrated a sentiment classification accuracy of over 90%. Thus, it yields faster and more precise results than conventional methods. Words with a low frequency of occurrence, but high relevance, can be identified using word2vec.
Keywords: KoBERT, Word2vec, Public opinion analysis, Sentiment classification
With the development of information and communication technology, online media has become an essential means of disseminating information. The digital divide has also narrowed as anyone can easily access online news. Readers not only read news articles but also actively express their opinions by leaving comments. Such online comments enable the exchange of various viewpoints on an issue that has a lasting effect on public opinion. Public opinion formed by the exchange of ideas in online comments influence administrative, legislative, and judicial policy decisions in various ways, including through public petitions [1]. Therefore, many studies have been conducted to collect online comments and analyze public opinion. Recently, a neural network model using bidirectional encoder representations from transformers (BERT) was applied to these studies [2]. BERT is an English-language-based model that is not suitable for the Korean language [3, 4]. Therefore, applying BERT to Korean leads to inaccurate analysis results owing to errors in the Korean morphological analysis and word embedding. In addition, previous studies have used several algorithms, including term frequency-inverse document frequency (TF-IDF), latent semantic analysis (LSA), and latent Dirichlet allocation (LDA), to analyze comments. Although TF-IDF is simple to implement and provides a relatively accurate analysis based on word frequency, it does not consider the order or context of words, rendering it difficult to fully understand their meaning. By contrast, LSA considers the semantic similarity between words and accurately identifies meaning. LDA analyzes the topic to which a comment belongs, enabling it to cover multiple topics. However, both the LSA and LDA have high computational costs and are difficult to interpret. Word2vec calculates the semantic similarity between words by representing them as vectors and considering their order and context for a more accurate analysis. This method is cost-effective, rendering it less likely for important words to be excluded because of their low frequency. This ensures that a large number of comments from a particular group are not analyzed as important public opinion makes informed decisions based on public opinion [5–7]. Various research methods have attempted to address these issues, such as using supervised and unsupervised learning to remove duplicate comments or utilizing comment cluster analysis [8].
The main contributions of this paper can be summarized as follows:
• The objective of this study is to develop a specialized sentiment analyzer for online opinion analysis in Korean. To achieve this, we created an online public opinion sentiment analysis model using KoBERT, a pretrained deep learning model optimized for the Korean language. The model analyzes issues determined as positive or negative by the researcher. To overcome the linguistic limitations of BERT, we employed KoBERT, which is a transformer language model pretrained on a large Korean corpus.
• Online public opinion analysis is often limited by the use of word frequency analysis, which may exclude important but infrequently used opinion words. To overcome this limitation, we used a word embedding algorithm called word2vec. This algorithm considers the correlation between words by locating each word in a three-dimensional vector space and identifying words that are highly correlated with the target word. Using this algorithm, we aim to improve the accuracy and comprehensiveness of our online opinion analyses.
• Developing a highly accurate sentiment analyzer can significantly enhance the efficiency and speed of online public opinion analyses. Such analysis can offer valuable insights into shaping policies related to politics, economics, and social issues. Hence, policymakers and researchers who aim to make informed decisions must prioritize the development of an accurate sentiment analyzer.
Sections 2–5 of this paper present a literature review, data collection and analysis, experimental results, and conclusions.
Social media websites provide opportunities for active participation in democratic processes [5]. Online news is an example of a social media platform in which readers express their opinions and feelings through comments. News comments can be considered a form of multilateral dialogue [6]. Public opinion is formed by observing, hearing, and perceiving others’ opinions [7]. In other words, an online zone in which anyone can easily communicate with a diverse group of individuals and recognize, accept, and critique the viewpoints of others can be viewed as an ideal environment for forming public opinion. Therefore, news and comments are important data for public opinion analysis, and the number of studies on developing algorithms or tools to analyze them is increasing [8].
The extraction of useful information from textual data is known as text mining. It is an effective research method for deriving meaningful and valuable information and knowledge by discovering hidden relationships or patterns in data and analyzing different morphemes included in unstructured text data [9]. It can be used to identify the main latent content in a text [10]. Its applications include document classification, clustering, summarization as well as information extraction and sentiment analysis. Decision trees, recurrent neural network (RNN), and BERT are text analysis methods that employ supervised learning. Clustering, LDA, the dynamic multitopic model (DMA), and the expressed agenda model are examples of unsupervised learning methods. Recently, methodologies such as attentive long short-term memory (LSTM) and tensor fusion networks have been studied [11]. In particular, sentiment analysis is an important method for understanding online public opinion and has been used in numerous studies [12–15]. The natural language processing (NLP) model includes document classification [16], syntax analysis [17], and machine translation [18,19], which focus on words or phrases. TF-IDF, which is typically used in text mining, including online opinion analysis, is an algorithm that uses word frequency and inverse document frequencies [20, 21]. Because high frequency words are given more weight, low frequency words are often excluded from analysis. To overcome these limitations, several studies have been conducted to improve the accuracy of algorithms that use probability methods [22].
Although existing deep learning methodologies initialize weights to random values when creating a model, pretraining is a method for leveraging weights obtained from data learning of other problems as initial values. A problem using pretrained weights in this manner is called a downstream task. BERT improves downstream task performance by using language models for pretraining [23]. It has demonstrated excellent performance in sentence classification for various applications, including sentence sentiment analysis [24], major news topic classification [25], and document classification [26]. In this study, we used KoBERT, developed by a Korean company (SKT), to improve Google BERT’s Korean language learning performance. A case study demonstrated a 2.6% performance improvement compared to BERT [27]. It places words in a three-dimensional vector space, enabling us to evaluate the associations between different words based on their proximity. It has two different model architectures: continuous bag-of-words (CBOW) and skip-gram (SG). The CBOW model predicts a target word based on its surrounding words, whereas the SG model predicts the surrounding words based on a target word [28, 29].
This study followed the procedure depicted in Figure 1 for data collection, preprocessing, and analysis of online comments.
The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority”. The title, content, and comments were collected after parsing the webpage using HTML and CSS parsers. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database. The data used in this study were collected from News Naver, a major Korean web portal. We also developed a Python-based web crawler program to collect and store the data. Using the BeautifulSoup library, we searched the Naver News webpage for “improvement of investigative authority.” After parsing the webpage using HTML and CSS parser, the title, content, and comments were collected. In Figures 2 and 3, the collected data are classified into news articles and comment tables and stored in SQLite, a relational database.
The crawler’s unstructured text data must be classified using morphemes to analyze the comments. However, morphological analyses present significant challenges for Korean comments. First, adapting most morpheme analysis libraries to the Korean language is challenging because they were developed in English. Second, many exceptions exist to spacing rules in Hangul, including those set by the National Institute of Korean Language. Third, owing to changes in the ending and proposition, accurate analysis is impossible when the word is separated based on spacing but is not a word (space unit). This study analyzes morphemes using the “Mecab-ko library” of the “Eunjeon Hannip Project,” a Korean morpheme analyzer used in Korea, to address these issues.
For a computer to analyze text data, a process known as “embedding” must convert them into mathematically executable forms. In this study, word2vec was used to embed the text data. word2vec, a library developed by Google, is an algorithm with the benefit of fast learning speed. This is accomplished by placing highly related words in similar positions in the vector space based on the frequency with which they appear in similar contexts. The hyperparameters required to generate the word2vec model are listed in Table 1. Although both the CBOW and SG are representative learning models, we used the SG in this study because of its excellent performance in many cases [30, 31].
Accurate results are difficult to obtain at the initial stage when attempting to determine a target word and extract the surrounding words with high relevance using word2vec. This is because the morpheme analyzer (Mecab-ko) does not perform appropriate preprocessing of stop words, synonyms, compound nouns, and neologisms. Therefore, the elimination of unnecessary “josa” and other stop words, correct typos, and integration synonyms are required. In addition, neologisms were added to the user dictionary of the morpheme analyzer. In this study, preprocessing was performed only on the surrounding words with high relevance to the target word using the word2vec model, which learns from all comments. This reduces the amount of time required for preprocessing by extracting and refining only the words surrounding the target word, thereby improving efficiency.
The word2vec-based final analysis model extracts keywords related to the preprocessed morpheme list associated with the target word. The extracted words were expressed in the vector space. The degree of association between a certain target word and a specific neighboring word was calculated according to the cosine similarity of the corresponding vectors. The cosine similarity is calculated as the dot product between two vectors divided by the size of the vectors, as shown in
Here,
We searched for the phrase “improvement of investigative authority” in Naver News from May 9, 2017, to June 6, 2022. A total of 39,697 news articles and 4,426,875 comments were collected using a crawler, including information such as “news title,” “news posting date,” “media company,” and “comments.” The number of monthly news articles and comments showed similar trends, as illustrated in Table 2 and Figure 4. The correlation coefficient between the two variables is 0.8634, indicating a strong positive correlation. This is consistent with other studies, indicating that news articles and comments have the same tendency and potential to shape online public opinion [6].
A total of 233,352 datasets were used, of which 175,012 were used as the training set, and 58,340 were used as the test set. The optimal value to prevent overfitting during model training was calculated using a callback function called EarlyStopping. As shown in Figures 5 and 6, the optimal values of the epoch, val accuracy (validation set accuracy), and val loss (loss of validation set) are 4, 0.9051, and 0.2278, respectively. The performance of the proposed model is presented in Table 3. The model performed similarly to or slightly better than KoBERT on GitHub (0.9010, 0.8963) [32,33]. In this study, the KoBERT-based learning model classified emotions (positive or negative) from a total of 4,426,875 comments with more than 90% accuracy without the researcher’s intervention. This indicates that the emotions of online public opinion can be grasped quickly and accurately.
The news articles containing the target word “Police” were extracted. The total number of articles was 39,697, of which 6,465 satisfied the aforementioned conditions. After applying the KoBERT-based sentiment classification model to police-related articles, 3,164 (49%) articles were classified as positive and 3,301 (51%) as negative. The positive-to-negative ratio of responses to articles related to “improvement of investigative authority” containing the word “police” is deemed to be similar. Therefore, the number of positive and negative responses to articles on the “improvement of investigative authority” and “police” is comparable, as illustrated in Figure 7.
According to our sentiment analysis, 2,903,069 comments were analyzed after preprocessing for policy violations and deleted comments. As illustrated in Figure 8, the results indicate that negative comments constituted an overwhelming majority, accounting for 86% of all comments analyzed, whereas positive comments constituted only 14%.
The sentiment analysis of news articles on the “improvement of investigative authority” was positive (49%) and negative (51%). Currently, news articles are written from conservative and liberal perspectives. Similar proportions of positive and negative sentiment suggest that conservative and liberal media outlets write similar numbers of articles [34]. However, when we analyzed the comments, we found that the majority were negative (19% positive and 81% negative). This is consistent with the argument that readers are more interested in negativity and that anonymity drives negative comments [35, 36].
The results of using word2vec to identify words closely related to “Police” in the content of the collected articles are illustrated in Tables 4 and 5 and Figure 9. Based on the cosine similarity value, the word “Police” correlates strongly with the following terms: “National Police Agency” (0.7511), “Police Commissioner” (0.7162), “Police Officers” (0.6993), “Investigation” (0.6845), and “Police Station” (0.6480). In particular, “Police Precinct” had a low frequency of 590 occurrences but a vector value of 0.6445, Rank 6. This is the difference between the current study’s approach to understanding public opinion and the term frequency-based approach used in previous studies [37].
The titles of news articles that are of high interest to the reader among the news content, including the keywords in Table 4, are shown in Table 6. These articles are considered to have a significant influence on shaping online public opinion. Naver readers express their interest in an article by selecting emoticons at the bottom of the news content. In this study, the number of emoticons was used as a measure of interest to identify the most popular articles. Consequently, important news articles that were previously subjectively evaluated by researchers can now be extracted using objective criteria.
With “Police” as the target word, word2vec was used to analyze highly relevant keywords. For each keyword, the most sympathetic comments containing the word were extracted. As illustrated in Table 7
Online opinion analysis has the advantage of being faster in real time and more cost effective than commonly used conventional survey methods. The search term “improvement of investigative authority” was used to gather and analyze online news articles and comments. A crawler was used to collect unstructured data in text format, and a KoBERT neural network model was used for sentiment classification learning. The number of news articles and comments showed a strong positive correlation (0.8634), indicating that the increase or decrease in online news articles was similar to that of comments. In other words, news articles with high social interest were more likely to generate stronger online opinions as the number of comments increased. In this study, the accuracy of the text–sentiment classification model using KoBERT exceeded 90%. The sentiment classification was performed using the model on 6,465 news articles that contained the word “Police” in their content. Consequently, 3,164 (48%) were positives and 3,301 (51%) were negatives, with no significant difference between the two. After analyzing the comments, 118,362 (81%) were classified as negative and 27,351 (19%) were classified as positive. The sentiment classifier for online public opinion is faster and more accurate than existing research that requires the researcher’s direct classification, owing to the artificial intelligence model. The words “National Police Agency” (0.7511), “Police commissioner” (0.7162), “Police officers” (0.6993), “Investigation” (0.6845), and “Police station” (0.6480) had the highest values when the word2vec algorithm was used to identify words that were highly associated with the term “Police” in the article content. Similarly, words that were highly related to “Police” were extracted from news comments, such as “Closing a case” (0.7657), “Police Officers” (0.7459), and “Police Precinct” (0.7058). Related keywords with relatively low frequencies, which may have been excluded in previous studies that focused solely on term frequencies, were also included. By using wrod2vec, we can identify words that may have been unnoticed because of their low frequency of occurrence, even though they have a significant impact on online opinion formation. Consequently, online public opinion can be analyzed quickly and accurately by collecting and analyzing online news articles and comments using KoBERT and word2vec. This study aimed to develop a specialized sentiment analyzer for online opinion analysis in Korean using the word2vec algorithm. However, the accuracy of the model is unknown because of the lack of validation data during training. In addition, the word2vec model has limitations in sentence-level natural language processing because it does not consider the context of the sentence. To overcome these limitations, a combination of document-level word embedding algorithms such as doc2vec should be utilized.
Furthermore, future research should collect news articles and comments from more diverse portals in addition to Naver and analyze the differences between portal sites. This will lead to a more accurate understanding of the online public opinion.
Data collection and analysis process.
News article table.
News comment table.
Number of news articles and comments per month.
Model training result.
Model training accuracy and loss.
News article content sentiment analysis.
News comment sentiment analysis.
News article content WordCloud.
News comments WordCloud.
Table 1 . Word2vec hyperparameter.
Parameter | Value |
---|---|
Minimum frequency | 5 |
Layer size | 100 |
Learning rate | 0.15 |
Iteration | 100 |
Windows size | 3 |
Sub-sampling | 10−2 |
Negative sampling | 15 |
Table 2 . Number of news articles and comments per month.
Year Month | Number of articles | Number of comments | Year Month | Number of articles | Number of comments |
---|---|---|---|---|---|
2017.05 | 979 | 77,490 | 2019.12 | 2,357 | 275,419 |
2017.06 | 349 | 22,979 | 2020.01 | 2,285 | 258,179 |
2017.07 | 1,028 | 57,043 | 2020.02 | 765 | 152,093 |
2017.08 | 611 | 22,361 | 2020.03 | 165 | 44,353 |
2017.09 | 299 | 14,271 | 2020.04 | 0 | 570 |
2017.10 | 361 | 23,407 | 2020.05 | 297 | 23,221 |
2017.11 | 290 | 12,689 | 2020.06 | 401 | 42,001 |
2017.12 | 237 | 8,248 | 2020.07 | 0 | 638 |
2018.01 | 806 | 69,150 | 2020.08 | 726 | 92,750 |
2018.02 | 161 | 15,865 | 2020.09 | 651 | 51,418 |
2018.03 | 797 | 71,034 | 2020.10 | 0 | 141 |
2018.04 | 404 | 50,118 | 2020.11 | 299 | 38,804 |
2018.05 | 243 | 11,142 | 2020.12 | 1,546 | 235,718 |
2018.06 | 1,651 | 59,189 | 2021.01 | 0 | 594 |
2018.07 | 408 | 11,988 | 2021.02 | 0 | 27 |
2018.08 | 183 | 9,788 | 2021.03 | 2,021 | 184,579 |
2018.09 | 100 | 6,035 | 2021.04 | 0 | 223 |
2018.10 | 243 | 7,662 | 2021.05 | 0 | 28 |
2018.11 | 462 | 31,200 | 2021.06 | 732 | 52,591 |
2018.12 | 288 | 44,974 | 2021.07 | 0 | 131 |
2019.01 | 439 | 26,601 | 2021.08 | 0 | 22 |
2019.02 | 749 | 69,626 | 2021.09 | 240 | 12,317 |
2019.03 | 1,827 | 229,259 | 2021.10 | 0 | 149 |
2019.04 | 1,908 | 239,146 | 2021.11 | 0 | 16 |
2019.05 | 3,664 | 288,550 | 2021.12 | 207 | 12,855 |
2019.06 | 1,123 | 104,195 | 2022.01 | 0 | 781 |
2019.07 | 1,219 | 80,566 | 2022.02 | 0 | 19 |
2019.08 | 739 | 162,134 | 2022.03 | 548 | 77,496 |
2019.09 | 1,317 | 376,174 | 2022.04 | 0 | 380 |
2019.10 | 2,827 | 567,929 | 2022.05 | 0 | 28 |
2019.11 | 658 | 81,329 | 2022.06 | 87 | 17,192 |
Table 3 . Model performance.
Result value | |
---|---|
Accuracy | 0.9853 |
Loss | 0.0474 |
Validation accuracy | 0.9051 |
Validation loss | 0.2278 |
Recall | 0.8529 |
Precision | 0.8981 |
F1 | 0.8750 |
Table 4 . News article word similarity and frequencies.
Rank | Word | Vector | Euclidean | Manhattan | Scaled Euclidean | Frequency |
---|---|---|---|---|---|---|
1 | 경찰청 | 0.7511 | 2.0024 | 16.0236 | 3.9065 | 18,251 |
2 | 청장 | 0.7162 | 1.9857 | 16.2410 | 3.9301 | 24,731 |
3 | 경찰관 | 0.6993 | 2.4249 | 20.2057 | 4.7727 | 6,307 |
4 | 수사 | 0.6845 | 1.8044 | 13.9328 | 3.5361 | 317,777 |
5 | 경찰서 | 0.6480 | 2.6114 | 20.7929 | 5.0790 | 3,442 |
6 | 지구대 | 0.6445 | 3.3773 | 27.9833 | 6.6381 | 590 |
7 | 자치경찰 | 0.6413 | 2.6127 | 21.1340 | 5.1316 | 22,247 |
8 | 국수본 | 0.6225 | 2.6505 | 20.4286 | 5.2038 | 2,526 |
9 | 관서 | 0.6093 | 3.3562 | 27.0328 | 6.5523 | 373 |
10 | 치안 | 0.5937 | 3.0960 | 24.5606 | 6.0993 | 4,520 |
Table 5 . News article word similarity (2-year cycle).
Rank | 2017–2018 | 2019–2020 | 2021–2022 | |||
---|---|---|---|---|---|---|
Word | Vector | Word | Vector | Word | Vector | |
1 | 청장 | 0.6871 | 경찰청 | 0.7754 | 종결 | 0.6683 |
2 | 수사 | 0.6764 | 자치경찰 | 0.7309 | 수사 | 0.6506 |
3 | 경찰청 | 0.6568 | 청장 | 0.7275 | 경찰청 | 0.6373 |
4 | 검찰 | 0.6449 | 경찰관 | 0.6738 | 경찰관 | 0.5923 |
5 | 경찰관 | 0.6135 | 수사 | 0.6729 | 청장 | 0.5692 |
6 | 경찰서 | 0.5890 | 국수본 | 0.6418 | 경찰대 | 0.5678 |
7 | 서장 | 0.5788 | 지구대 | 0.6412 | 치안 | 0.5616 |
8 | 이 ** | 0.5593 | 치안 | 0.6225 | 검경 | 0.5583 |
9 | 일선 | 0.5519 | 민** | 0.6211 | 자치경찰 | 0.5567 |
10 | 지구대 | 0.5461 | 경찰대 | 0.6158 | 경찰서 | 0.5518 |
Table 6 . Important news article title.
Related word | News article title | Number of interests |
---|---|---|
경찰청 | 경찰청 전직원에 “검찰 조국수사 비판 與보고서 읽어라” | 22,191 |
청장 | 정권 수사한 ‘***참모들’ 모두 유배 보내버렸다 | 23,562 |
경찰관 | 대통령 비판 전단 돌리던 50 대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워 | 52,028 |
수사 | 대통령 비판 전단 돌리던 50대 주부. . . 경찰 신분증 없다고 등뒤로 수갑 채워 | 52,028 |
경찰서 | 여당 윤석열 검찰과 제 2 의 전쟁 나섰다 | 8,385 |
지구대 | ‘**논란’ 들끓는데...* 대통령 연설문에 “공정과 정의” | 3,210 |
자치경찰 | ** “***정부는 대한민국을 선진국 대열에 진입시킨 정부” | 19,865 |
국수본 | ** “***정부는 대한민국을 선진국 대열에 진입시킨 정부” | 19,865 |
관서 | 관서 “檢개혁요구 커지는 현실 성찰. . . 수사관행 개혁돼야” ( 종합) | 4,903 |
치안 | *** 검찰총장의 응수 “검찰 개혁 반대한 적 없다” | 11,102 |
Table 7 . News comments vector values and frequency.
Rank | Word | Vector | Euclidean | Manhattan | Scaled Euclidean | Frequency |
---|---|---|---|---|---|---|
1 | 종결 | 0.7657 | 2.0360 | 16.0372 | 4.2283 | 7,862 |
2 | 경찰관 | 0.7459 | 2.2646 | 17.5176 | 4.7085 | 2,926 |
3 | 지구대 | 0.7058 | 2.7853 | 21.7033 | 5.7627 | 784 |
4 | 파출소 | 0.6941 | 2.6570 | 21.9616 | 5.5045 | 798 |
5 | 시기상조 | 0.6814 | 2.4236 | 19.3799 | 5.0183 | 524 |
6 | 일선 | 0.6573 | 2.2796 | 17.9550 | 4.7403 | 1,664 |
7 | 검찰 | 0.6427 | 2.2701 | 17.6364 | 4.7098 | 466,911 |
8 | 수사력 | 0.6412 | 2.5455 | 20.8251 | 5.2700 | 542 |
9 | 치안 | 0.6378 | 2.7753 | 21.1061 | 5.7414 | 2,538 |
10 | 자치경찰 | 0.6261 | 2.7440 | 22.5492 | 5.6740 | 12,491 |
Table 8 . News comments word similarity (2-year cycle).
Rank | 2017–2018 | 2019–2020 | 2021–2022 | |||
---|---|---|---|---|---|---|
Word | Vector | Word | Vector | Word | Vector | |
1 | 수사 | 0.7536 | 종결 | 0.7750 | 종결 | 0.7360 |
2 | 검찰 | 0.7414 | 경찰관 | 0.7692 | 수사 | 0.6835 |
3 | 지구대 | 0.6379 | 짭새 | 0.7287 | 검찰 | 0.6442 |
4 | 종결 | 0.6298 | 파출소 | 0.7273 | 역량 | 0.6210 |
5 | 일선 | 0.6279 | 지구대 | 0.7256 | 수사관 | 0.6040 |
6 | 경찰관 | 0.6098 | 경찰서 | 0.7007 | 검경 | 0.5911 |
7 | 독립 | 0.6036 | 경찰권 | 0.6846 | 현장 | 0.5909 |
8 | 기소 | 0.6003 | 자치경찰 | 0.6709 | 조정 | 0.5895 |
9 | 자치경찰 | 0.5978 | 치안 | 0.6618 | 치안 | 0.5824 |
10 | 검사 | 0.5966 | 시기상조 | 0.6504 | 경찰관 | 0.5811 |
Table 9 . Top related words news comments.
Related word | Comment | Number of interests |
---|---|---|
종결 | 경찰이 1 차수사권을 가진 이상사법경찰의 그 많은 사건 수사중 경찰이임의로 덮어버려서 묻히는 사건들이 꽤 많아질겁니다 말로는 인권침해법령위반 관계인의 이의제기 등의 단서를 달아놨지만 저기 시골같은데서 경찰이 아무도 모르게 사건 덮어버리면 알 수가 없죠 검찰로 보내는송부자료도 조작해서 결제만 해 보내면 어찌 알겠어요예전에야 불기소 의견도 모두 검찰에 넘기고 다시 한번 조사를 받고 종결되서 그나마 위법 부당한 처리를 찾을 수 있었지 앞으로는 그것도 어려울듯 이건 좀더 통제가 필요한 부분임 | 1,314 |
경찰관 | 대한민국 피의자 인권은 괴할정도입니다 지금 중요한건 매맞고 힘빠진 경찰관의 법집행력과경찰관을 포함 공무원의 기본적 인권을 지켜줄 때입니다 | 1,441 |
지구대 | 역삼지구대 3 년 근무하면 강남에 30 평대 아파트 현금으로 산다면서요 사실인가요 | 626 |
파출소 | 문제는 과다한 업무에 있습니다실종사건이 하루에도 지구대나 파출소 별로 여러 건이 하달되는데 실종사건만 전담해서 일을 할 수 없고 다른 112 신고사건도 처리해 가면서 실종사건을 처리해야 하는데서 문제가 발생합니다결국은 경찰인원을일선 지구대나 여청계 형사계 등에 많이 배치해야 하지요 쓸데없이윗자리 확보하려는 경찰서만 자꾸 만들지 말고경찰서를 오히려 축소하여야 합니다 | 267 |
시기상조 | 수사와 기소 분리가검찰개혁의 요체인건 맞지 근데 경찰 아니 견찰 보니까 수사권 때어주는 것은 아직 시기상조다 싶다 직무능력도 그렇고 조직 구조상 정권에 딸랑거릴 수밖에없거든 공수처는단호히 반대한다 기소권 수사권 모두 독점 검찰이 맡은 사건도 빼앗을 수 있는 무소 불위의 대통령 친위대 절대 반대다 | 904 |
일선 | 나도 검사였다면 범죄 소탕위해 일선에서 목숨다해 일했을텐데 ㅠㅠ 검사님들 좀만 더 힘내주세요 국민이 함께합니다 | 2,905 |
검찰 | 검찰 응원합니다 | 25,129 |
수사력 | 검찰개혁의 목적이 수사권 독립으로 산 권력도 수사해라 인데 지금의 행태는 산 권력 수사하면 아웃 인사권 횡포로 검찰수사력 과도한 억제 내지는 사법질서 파괴 | 163 |
치안 | 아직도 *** 씨가 민주주의 대통령이라 생각하시나요 이자는 *** 와 더불어 민주주의를 박살낸자 *** 보다 못한 최악의 공직자로 기록될 겁니다 *** 이 *** 보다 나은점은 올바른 인사 범죄와의 전쟁 경제활성화 등 지금과 비교하면 한 20 년 집권했다면 하는 생각이 듭니다 누구나 서울에 노력하면 집한채 살수 있었고 도 없던시절 이었지만 치안안전했고 문재인씨같은 범죄자는 사형시킬수 있었던 그런시절 이었습니다 군사쿠데타가 일어나길 바라는건 민주화이후 처음입니다 | 2,621 |
자치경찰 | 자치경찰이라니 지금도 지역 유지랑 짝짜꿍해서 전라도 염전노예는 탈출해도 경찰이잡아다 주인에게 바치는 수준인데자치경찰로 바뀌면 지역 유지는 거의 지역에서 신으로 군림할듯도대체 미국처럼 땅이 워낙 커서 자치제를 할 수 밖에는 없는 경우도 아니고손바닥만한 나라에서 무슨 지역을 그리 따지는지 암튼 자치경찰 하면 지역 토호의 온갖 범죄와 갑질을 막을 방법은 없을듯 | 2,646 |
Data collection and analysis process.
|@|~(^,^)~|@|News article table.
|@|~(^,^)~|@|News comment table.
|@|~(^,^)~|@|Number of news articles and comments per month.
|@|~(^,^)~|@|Model training result.
|@|~(^,^)~|@|Model training accuracy and loss.
|@|~(^,^)~|@|News article content sentiment analysis.
|@|~(^,^)~|@|News comment sentiment analysis.
|@|~(^,^)~|@|News article content WordCloud.
|@|~(^,^)~|@|News comments WordCloud.