International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 373-381
Published online December 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.4.373
© The Korean Institute of Intelligent Systems
Yudi Priyadi1 , Krishna Kusumahadi2, and Pramoedya Syachrizalhaq Lyanda1
1Department of Software Engineering, Telkom University, Bandung, Indonesia
2Department of Informatics Business, Telkom University, Bandung, Indonesia
Correspondence to :
Yudi Priyadi (whyphi@telkomuniversity.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Systems thinking is a discipline for understanding wholeness and frameworks based on the changing patterns of the interconnectedness of the whole system. The storytelling of a system is a description of the mental model of an individual in describing the state of the environment. There are differences in the interpretation of the system description. This difference occurs because each individual has a different level of systems thinking in terms of experience, learning process, insight, intuition, and assumption in understanding system interactions. This study aims to extract data in the description of the storytelling of a systems thinking case by performing text mining and similarity to identify and find a variable to form causal loop diagrams. Based on the results of this study, there are results in the data extraction from the description of storytelling for the systems thinking case. The conclusions of this study are as follows: First, processing the five documents has successfully identified two documents with the highest similarity value, such as d1 and d3. Second, based on the cosine similarity calculation results and the results of the similarity value, there is a value closest to 1, such as 0.0913166. This value is at the d1 and d3 positions. Third, it produces a variable approach in the form of a group of words used in modeling thinking systems based on a connectedness value greater than 0.50.
Keywords: Systems thinking, Storytelling, Text mining, Similarity, Causal loop diagrams
Systems thinking is a way for an individual to understand the interrelationship of interactions in the system as a whole [1–3]. According to Sterman [4], systems thinking is a discipline in understanding wholeness and frameworks based on the changing patterns of the interconnectedness of a whole system.
The storytelling of a system is a description of an individual’s mental model in describing the state of the environment. An individual has a difference in interpreting the explanation description of storytelling, and this difference occurs because each individual is influenced by their mental model and systems thinking.
There are differences in the interpretation of the system description. This difference occurs because each individual has a different level of systems thinking in terms of experience, learning process, insight, intuition, and assumptions in understanding system interactions [5, 6]. As a result, if there is the same storyline, each individual will make a difference when determining the variable. This variable is the beginning of understanding systems thinking and is used as a reference in the next stage for modeling systems thinking or system dynamics [7, 8].
Causal loop diagrams present a language for articulating an understanding of the dynamic nature of interrelated systems [9]. In this diagram, sentences are interconnected with key variables, thereby showing a causal relationship. Through several loops, a logical interrelated story can be built regarding a particular problem. According to Kim [10], who referred to guidelines regarding the causal loop diagram design rules, when building this diagram, there is a design focus, such as selecting variable names and loop construction. There are differences in determining the causal loop variable. This difference depends on theme selection, time horizon, behavior over time charts, boundary issues, level of aggregation, and significant delays [11].
In text mining for the activities of text preprocessing, the stages of the process can be adjusted depending on the type of text data and the results required. According to Octavially et al. [12], there is an extraction process for text preprocessing consisting of tokenization, stopword removal, and stemming. In addition, there is a semantic similarity measurement process through the WordNet similarity for Java applications. The results of the extraction process, combined with greedy algorithms, constitute an optimal value solution approach. In addition, there is a method for calculating similarities using the Wu Palmer and Levenshtein method [12, 18].
Referring to the description of the concept above, identifying variables in the system is essential in determining and forming the model using the causal loop diagram. Variable identification must be made because the results of one storytelling description will result in the focus of a different modeling approach. This is because of the difference in an individual’s understanding of a storytelling description. Through text mining, the description of storytelling can be used to identify variables in the form of a collection of words that are related and can be represented in the formation of casual loop events.
This study aims to extract data in the description of the storytelling of a systems thinking case by performing text preprocessing and similarity to identify and find a variable to be used to form causal loop diagrams.
The contributions and novelties of this study are as follows:
Through the Python NLTK and based on the text description of storytelling, this activity produces case folding, tokenization, stopword removal, and stemming.
Generate a text weighting value from the results of the document similarity activity.
Generate a variable approach in the form of a set of words, which will be used in modeling systems thinking.
This section explains all concepts related to understanding the relationship between the systems thinking and causal loops, the relationship between text mining and text preprocessing, and similarity.
Systems thinking is a discipline for understanding wholeness and frameworks based on the changing patterns of the interconnectedness of the whole system. Storytelling, in the form of text descriptions of a case of phenomena, produces different variables in interpreting the situation of a system [4].
A causal loop diagram (CLD) is used to represent the association of variables in the system dynamics. This diagram presents a language for articulating an understanding of the interrelationships between words that have a causal relationship. A story can be created through a series of multiple loops that are logically related to a problem [10]. Additionally, there is an available concept that follows the coding process to obtain the causal relationships of data explicitly using qualitative analysis software to make the relationship between the final causal map and data sources more transparent. The stages are as follows [15]:
Identifying concepts and discovering themes in the data.
Categorizing and aggregating themes into variables.
Identifying causal relationships.
Transforming the coding dictionary into causal diagrams.
The relationship between systems thinking and mental models that influence assumptions in understanding the system is related to the fifth discipline for understanding a system [16]. There are several ways to describe the system. Among them is through textual descriptions. Text mining is one of the techniques used in data mining. In simple terms, mining refers to the process of using keywords from a set of words to identify patterns that have meaning or to make predictions [17]. Support for these activities. There is a method for determining the semantic similarity level between pairs of short texts [19]. This method is based on the similarity between word contexts that are built using word embeddings and semantic linkages between concepts based on external sources of knowledge [19–21].
In text mining, the implementation is required to prepare for the irregularity of the text data structure to become structured data. This implementation uses several activities for text preprocessing, including [12, 18]
Case folding for converting to lowercase.
Tokenization in breaking sentences into words.
Stopwords removal to eliminate words that have no meaning in the text mining process.
Stemming/lemmatization to make the root word.
Through the term frequency-inverse document frequency (TF-IDF), we determine the weight of each word in each document. This method is used in natural language processing (NLP), text information retrieval, and text mining. The more important a word is, the more it appears in a document. In determining the value of a word, this method uses two elements: TF - term frequency of the term “i” in document “j” and IDF - inverse document frequency of the term “i” [22–24]. The multiplication result between the TF and IDF uses the following formula:
To find the similarity of the TF-IDF weighting results, we can use the formula below.
The principle of semantic similarity calculation refers to the edge-counting method, which is a set of nodes and root nodes (R). For nodes C1 and C2, the similarity between two elements is calculated. The principle of similarity calculation is based on the distance (N1 and N2), which separates C1 and C2 nodes from the root node, and the distance (N) separated by CS between C1 and C2 from the root node R [13]. The basis for measuring semantic similarity is defined by the formulation of Wu and Palmer [18].
Based on Arora et al. [14], some requirement methods analyze the impact of NLP-based changes. In this method, detection is performed that considers the phrasal structure of the statement of a requirement. The input is in the form of a requirement document that contains the NL requirement statement. The steps of the requirements statement process for this method are as follows:
Identify the requirement statement phrase.
Calculate the value of pairwise similarity for all tokens (words) appearing in the identified phrase.
In Figure 1, the storytelling from systems thinking, in the form of a text description of a phenomenon, produces a different variable. This difference can occur because every individual differently understand the cause and effect that occur in an environment/ system, such as experience, learning process, insight, intuition, and assumption.
The proposed method, called IdVar4CL, implements a process that involves using the text-mining concept approach as a solution to deal with the different identification of storytelling. At the beginning of the formation of the causal loop diagram model, determining variables is important to understand the dynamic system.
The data sources for this research consisted of five documents adopted from the paragraphs of articles (see
Based on this paragraph, it consists of five sentences. Therefore, in preparation for using the dataset, the process was further divided into five documents (see Table 1).
In practice, the contribution of this study is the combination of systems thinking (system dynamics) and data mining (text mining). The output is a set of words (variables) used in the causal loop diagram model in system dynamics.
In the extraction process, there is a text preprocessing activity for each document, consisting of case folding, tokenization, stopword removal, and stemming. To measure the semantic similarity, WordNet similarity for Java applications, which includes a method for calculating similarities through the Wu Palmer and Levenshtein methods, can be used as an alternative (see Figure 3).
This section describes the implementation of all the steps in the method. Some steps explain the processing of the dataset to be extracted using text preprocessing. Then, its similarity is measured using cosine similarity and its semantics through WS4J. After successfully obtaining the variables for the causal loop diagram, the final step was to test its validity.
Referring to the five documents as a dataset, in this activity, text processing is carried out using the natural language toolkit (NLTK). NLTK is a tool used in the field of natural language processing using Python-3 programming language. Text preprocessing activities include the following steps:
Case folding for changing in the form of lowercase letters
Tokenization in breaking a sentence into words
Stemming/Lemmatization to make the essential words
Stopword removal eliminates words that have no meaning in the text mining process
As shown in Figure 4, there is an example process snippet of paragraphs of articles processed in text preprocessing using case folding. All forms of writing in this paragraph have been changed to lowercase. This paragraph consists of five sentences and uses a text file named “beforecase.txt” and “aftercase.txt” in reading and writing the results of the case folding process.
In the paragraph, which is the result of the case folding process, there are five sentences in which each sentence becomes a dataset of five documents, such as d1, d2, d3, d4, and d5. Figure 5 shows the process for the dataset documents.
After preparing the documents, a tokenization process is carried out to break up the sentences into words, followed by the stemming/lemmatization process such that all the results turn into essential words. For the process of lemmatization, “wordnet” NLTK is used as an English semantic dictionary. The normalized text from the results of lemmatization in five documents that were previously given tokens and then given an index. Examples of this process are apparent in the screenshot presented in Figure 6 regarding the normalization and indexing of all word results.
The results of the indexed lemmatization carried out transformation in the form of a TF matrix, which consists of five documents with 68 words in the corpus. The features of Python “tf matrix.shape”, which will produce the output “(5, 68)”, can be utilized to check and determine the truth of the matrix form. The meaning of the output is five documents with 68-word indices Figure 7 shows the term frequency (TF) matrix.
Using the Python library “sklearn” feature, the IDF value was calculated based on the TF matrix results. Based on Figure 8, the IDF values for words that appear in many documents can be determined. There were three different IDF values.
The IDF value of 2.09861229 that appears in one document
IDF value 1.69314718 that appears in two documents
IDF value of 1.40546511 that appears in three documents
This value can be shown through the calculation results of the “import math” library from Python, as shown in Figure 8, regarding the inverse document frequency (IDF).
Subsequently, we calculated the TF-IDF Matrix (5 × 68). After calculating the TF and IDF values, the document value weights were processed using the transformation method. This method multiplies the TF matrix (5 × 68) with the IDF matrix (68 × 68 with IDF for each word on the main diagonal) and divides the TF-IDF with the Euclidean norm. The results are shown in Figure 9 for the TF-IDF matrix data processing results.
The formula cosine similarity can calculate the similarity of an object. Through this process, we searched for similarities between the documents. Based on the results of the processing of document value weights in the TF-IDF process, an experiment has been carried out, with the following steps:
The two documents with the highest similarity values were selected. This process applies the formulation of cosine similarity using Python. Figure 10 illustrates the process of calculating document similarities.
Analysis of similarity calculation results. Based on the cosine similarity calculation results, of all the results of the similarity value, there is a value that is closest to 1, such as 0.0913166. This value is at the d1 and d3 positions. Table 2 summarizes the similarity values between documents.
Perform the stopword removal process on documents d1 and d3. The purpose of this process is to eliminate words that have no meaning in the text mining process. The results can be analyzed using Python, as shown in Figure 11. The process of erasing words has no meaning. The results of document d1 is “death toll severe flooding around Indonesian capital Jakarta is risky 66 parts country continues heavy rain began New Year Eve”. For document d3, the result is “worst floods Indonesia seen since 2013, at least 29 people edited aftermath torrential rains.”
Semantic linkages between words were calculated. The Wu–Palmer concept of similarity is used in the “WordNet Similarity for Java” applications to calculate the semantics between words contained in documents d1 and d3. An illustration of this process is presented in Figure 12. Calculation of similarities and related semantics.
All words used as variables were identified based on similarity results. All words that have a semantic value higher than 0.50 are identified. Then, all words are paired based on the semantic value. The results of this identification presented in Table 3 regarding the identification results of the variables that will be used in the causal loop diagram. For the value limit of 0.50, this study used the middle limit between 0 and 1 such that the semantic value can be more dynamic and balanced.
Based on the results and discussions, there are results in the form of data extraction from the description of storytelling for the systems thinking case. This extraction is performed through text preprocessing and similarity, identifying a variable to be used in forming causal loop diagrams. Three things can be concluded as the core of this research activity, which are as follows:
Through the Python NLTK and based on the text description of storytelling, this activity produces case folding, tokenization, stopword removal, and stemming/lemmatization, which is applied to five documents (i.e., d1, d2, d3, d4, and d5). The results of processing the five documents successfully identified two documents that had the highest similarity value:
d1 = “death toll, severe flooding around Indonesian capital Jakarta, 66 risky country parts continue to reel heavy rain began New Year Eve.”
d3 = “worst floods Indonesia seen since 2013, at least 29 people edited aftermath torrential rains.”
Generate a text weighting value from the results of the document similarity activity. Based on the Cosine Similarity calculation results, of all the results of the similarity value, there is a value that is closest to 1, namely 0.0913166. This value is at the d1 and d3 positions.
This produced a variable approach in the form of a group of words, which will be used in modeling Thinking Systems based on a connectedness value greater than 0.50.
For further research, we plan to conduct the following to reach the implementation phase:
The validity and reliability of all variables generated by the prototype method named “IdVar4CL” are investigated. This research continues to apply the agreement coefficient concept to measure the veracity of all variables produced. This measurement can be carried out by employing the agreement of experts as a reference in text mining and system dynamic activities.
This variable identification method was tested through the implementation in a systems thinking case study that used the causal loop diagram modeling. It is expected that all variables produced in this study can follow the method of thinking of analyst systems in determining variables when modeling.
This activity can be used to initiate software that produces Intellectual Property Rights.
No potential conflict of interest relevant to this article was reported.
Table 1. Documents.
Document labeling | Document identification |
---|---|
d1 | The death toll from severe flooding in and around the Indonesian capital of Jakarta has risen to 66 as parts of the country continue to reel from heavy rain that began on New Year’s Eve. |
d2 | Landslides and flash floods have displaced more than 36,000 in Jakarta and the nearby provinces of West Java and Banten, according to the ASEAN Coordinating Center for Humanitarian Assistance (AHA). |
d3 | These are the worst floods Indonesia has seen since 2013, when at least 29 people edited in the aftermath of torrential rains. |
d4 | The disaster, experts say, underscores the impacts of climate change in a country with a capital city that is sinking so quickly that officials are working to move it to another island. |
d5 | The floods are also threatening to exacerbate the already severe wealth inequality that plagues the Southeast Asian nation. |
Table 2. Document similarity value..
d1 | d2 | d3 | d4 | d5 | |
---|---|---|---|---|---|
d1 | 1 | 0.036 | 0.091 | 0.079 | 0.049 |
d2 | 0.036 | 1 | 0.033 | 0 | 0.035 |
d3 | 0.091 | 0.033 | 1 | 0 | 0.045 |
d4 | 0.079 | 0 | 0 | 1 | 0 |
d5 | 0.049 | 0.035 | 0.045 | 0 | 1 |
Table 3. Identification of variables for causal loops.
Variable identification results | Value of connectedness | |
---|---|---|
Death | Floods | 0.7500 |
Death | People | 0.5455 |
Death | Aftermath | 0.7059 |
Death | Rains | 0.6316 |
Toll | Floods | 0.7059 |
Toll | Aftermath | 0.7059 |
Capital | Floods | 0.5217 |
Capital | People | 0.5333 |
Risen | Seen | 0.5714 |
Risen | Died | 0.6667 |
Parts | Floods | 0.6000 |
Parts | People | 0.6000 |
Country | People | 0.9091 |
Continue | Seen | 0.5714 |
Rain | Floods | 0.6316 |
Began | Seen | 0.6667 |
Year | Floods | 0.5333 |
Year | People | 0.6667 |
Eve | Floods | 0.5333 |
E-mail: whyphi@telkomuniversity.ac.id
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 373-381
Published online December 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.4.373
Copyright © The Korean Institute of Intelligent Systems.
Yudi Priyadi1 , Krishna Kusumahadi2, and Pramoedya Syachrizalhaq Lyanda1
1Department of Software Engineering, Telkom University, Bandung, Indonesia
2Department of Informatics Business, Telkom University, Bandung, Indonesia
Correspondence to:Yudi Priyadi (whyphi@telkomuniversity.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Systems thinking is a discipline for understanding wholeness and frameworks based on the changing patterns of the interconnectedness of the whole system. The storytelling of a system is a description of the mental model of an individual in describing the state of the environment. There are differences in the interpretation of the system description. This difference occurs because each individual has a different level of systems thinking in terms of experience, learning process, insight, intuition, and assumption in understanding system interactions. This study aims to extract data in the description of the storytelling of a systems thinking case by performing text mining and similarity to identify and find a variable to form causal loop diagrams. Based on the results of this study, there are results in the data extraction from the description of storytelling for the systems thinking case. The conclusions of this study are as follows: First, processing the five documents has successfully identified two documents with the highest similarity value, such as d1 and d3. Second, based on the cosine similarity calculation results and the results of the similarity value, there is a value closest to 1, such as 0.0913166. This value is at the d1 and d3 positions. Third, it produces a variable approach in the form of a group of words used in modeling thinking systems based on a connectedness value greater than 0.50.
Keywords: Systems thinking, Storytelling, Text mining, Similarity, Causal loop diagrams
Systems thinking is a way for an individual to understand the interrelationship of interactions in the system as a whole [1–3]. According to Sterman [4], systems thinking is a discipline in understanding wholeness and frameworks based on the changing patterns of the interconnectedness of a whole system.
The storytelling of a system is a description of an individual’s mental model in describing the state of the environment. An individual has a difference in interpreting the explanation description of storytelling, and this difference occurs because each individual is influenced by their mental model and systems thinking.
There are differences in the interpretation of the system description. This difference occurs because each individual has a different level of systems thinking in terms of experience, learning process, insight, intuition, and assumptions in understanding system interactions [5, 6]. As a result, if there is the same storyline, each individual will make a difference when determining the variable. This variable is the beginning of understanding systems thinking and is used as a reference in the next stage for modeling systems thinking or system dynamics [7, 8].
Causal loop diagrams present a language for articulating an understanding of the dynamic nature of interrelated systems [9]. In this diagram, sentences are interconnected with key variables, thereby showing a causal relationship. Through several loops, a logical interrelated story can be built regarding a particular problem. According to Kim [10], who referred to guidelines regarding the causal loop diagram design rules, when building this diagram, there is a design focus, such as selecting variable names and loop construction. There are differences in determining the causal loop variable. This difference depends on theme selection, time horizon, behavior over time charts, boundary issues, level of aggregation, and significant delays [11].
In text mining for the activities of text preprocessing, the stages of the process can be adjusted depending on the type of text data and the results required. According to Octavially et al. [12], there is an extraction process for text preprocessing consisting of tokenization, stopword removal, and stemming. In addition, there is a semantic similarity measurement process through the WordNet similarity for Java applications. The results of the extraction process, combined with greedy algorithms, constitute an optimal value solution approach. In addition, there is a method for calculating similarities using the Wu Palmer and Levenshtein method [12, 18].
Referring to the description of the concept above, identifying variables in the system is essential in determining and forming the model using the causal loop diagram. Variable identification must be made because the results of one storytelling description will result in the focus of a different modeling approach. This is because of the difference in an individual’s understanding of a storytelling description. Through text mining, the description of storytelling can be used to identify variables in the form of a collection of words that are related and can be represented in the formation of casual loop events.
This study aims to extract data in the description of the storytelling of a systems thinking case by performing text preprocessing and similarity to identify and find a variable to be used to form causal loop diagrams.
The contributions and novelties of this study are as follows:
Through the Python NLTK and based on the text description of storytelling, this activity produces case folding, tokenization, stopword removal, and stemming.
Generate a text weighting value from the results of the document similarity activity.
Generate a variable approach in the form of a set of words, which will be used in modeling systems thinking.
This section explains all concepts related to understanding the relationship between the systems thinking and causal loops, the relationship between text mining and text preprocessing, and similarity.
Systems thinking is a discipline for understanding wholeness and frameworks based on the changing patterns of the interconnectedness of the whole system. Storytelling, in the form of text descriptions of a case of phenomena, produces different variables in interpreting the situation of a system [4].
A causal loop diagram (CLD) is used to represent the association of variables in the system dynamics. This diagram presents a language for articulating an understanding of the interrelationships between words that have a causal relationship. A story can be created through a series of multiple loops that are logically related to a problem [10]. Additionally, there is an available concept that follows the coding process to obtain the causal relationships of data explicitly using qualitative analysis software to make the relationship between the final causal map and data sources more transparent. The stages are as follows [15]:
Identifying concepts and discovering themes in the data.
Categorizing and aggregating themes into variables.
Identifying causal relationships.
Transforming the coding dictionary into causal diagrams.
The relationship between systems thinking and mental models that influence assumptions in understanding the system is related to the fifth discipline for understanding a system [16]. There are several ways to describe the system. Among them is through textual descriptions. Text mining is one of the techniques used in data mining. In simple terms, mining refers to the process of using keywords from a set of words to identify patterns that have meaning or to make predictions [17]. Support for these activities. There is a method for determining the semantic similarity level between pairs of short texts [19]. This method is based on the similarity between word contexts that are built using word embeddings and semantic linkages between concepts based on external sources of knowledge [19–21].
In text mining, the implementation is required to prepare for the irregularity of the text data structure to become structured data. This implementation uses several activities for text preprocessing, including [12, 18]
Case folding for converting to lowercase.
Tokenization in breaking sentences into words.
Stopwords removal to eliminate words that have no meaning in the text mining process.
Stemming/lemmatization to make the root word.
Through the term frequency-inverse document frequency (TF-IDF), we determine the weight of each word in each document. This method is used in natural language processing (NLP), text information retrieval, and text mining. The more important a word is, the more it appears in a document. In determining the value of a word, this method uses two elements: TF - term frequency of the term “i” in document “j” and IDF - inverse document frequency of the term “i” [22–24]. The multiplication result between the TF and IDF uses the following formula:
To find the similarity of the TF-IDF weighting results, we can use the formula below.
The principle of semantic similarity calculation refers to the edge-counting method, which is a set of nodes and root nodes (R). For nodes C1 and C2, the similarity between two elements is calculated. The principle of similarity calculation is based on the distance (N1 and N2), which separates C1 and C2 nodes from the root node, and the distance (N) separated by CS between C1 and C2 from the root node R [13]. The basis for measuring semantic similarity is defined by the formulation of Wu and Palmer [18].
Based on Arora et al. [14], some requirement methods analyze the impact of NLP-based changes. In this method, detection is performed that considers the phrasal structure of the statement of a requirement. The input is in the form of a requirement document that contains the NL requirement statement. The steps of the requirements statement process for this method are as follows:
Identify the requirement statement phrase.
Calculate the value of pairwise similarity for all tokens (words) appearing in the identified phrase.
In Figure 1, the storytelling from systems thinking, in the form of a text description of a phenomenon, produces a different variable. This difference can occur because every individual differently understand the cause and effect that occur in an environment/ system, such as experience, learning process, insight, intuition, and assumption.
The proposed method, called IdVar4CL, implements a process that involves using the text-mining concept approach as a solution to deal with the different identification of storytelling. At the beginning of the formation of the causal loop diagram model, determining variables is important to understand the dynamic system.
The data sources for this research consisted of five documents adopted from the paragraphs of articles (see
Based on this paragraph, it consists of five sentences. Therefore, in preparation for using the dataset, the process was further divided into five documents (see Table 1).
In practice, the contribution of this study is the combination of systems thinking (system dynamics) and data mining (text mining). The output is a set of words (variables) used in the causal loop diagram model in system dynamics.
In the extraction process, there is a text preprocessing activity for each document, consisting of case folding, tokenization, stopword removal, and stemming. To measure the semantic similarity, WordNet similarity for Java applications, which includes a method for calculating similarities through the Wu Palmer and Levenshtein methods, can be used as an alternative (see Figure 3).
This section describes the implementation of all the steps in the method. Some steps explain the processing of the dataset to be extracted using text preprocessing. Then, its similarity is measured using cosine similarity and its semantics through WS4J. After successfully obtaining the variables for the causal loop diagram, the final step was to test its validity.
Referring to the five documents as a dataset, in this activity, text processing is carried out using the natural language toolkit (NLTK). NLTK is a tool used in the field of natural language processing using Python-3 programming language. Text preprocessing activities include the following steps:
Case folding for changing in the form of lowercase letters
Tokenization in breaking a sentence into words
Stemming/Lemmatization to make the essential words
Stopword removal eliminates words that have no meaning in the text mining process
As shown in Figure 4, there is an example process snippet of paragraphs of articles processed in text preprocessing using case folding. All forms of writing in this paragraph have been changed to lowercase. This paragraph consists of five sentences and uses a text file named “beforecase.txt” and “aftercase.txt” in reading and writing the results of the case folding process.
In the paragraph, which is the result of the case folding process, there are five sentences in which each sentence becomes a dataset of five documents, such as d1, d2, d3, d4, and d5. Figure 5 shows the process for the dataset documents.
After preparing the documents, a tokenization process is carried out to break up the sentences into words, followed by the stemming/lemmatization process such that all the results turn into essential words. For the process of lemmatization, “wordnet” NLTK is used as an English semantic dictionary. The normalized text from the results of lemmatization in five documents that were previously given tokens and then given an index. Examples of this process are apparent in the screenshot presented in Figure 6 regarding the normalization and indexing of all word results.
The results of the indexed lemmatization carried out transformation in the form of a TF matrix, which consists of five documents with 68 words in the corpus. The features of Python “tf matrix.shape”, which will produce the output “(5, 68)”, can be utilized to check and determine the truth of the matrix form. The meaning of the output is five documents with 68-word indices Figure 7 shows the term frequency (TF) matrix.
Using the Python library “sklearn” feature, the IDF value was calculated based on the TF matrix results. Based on Figure 8, the IDF values for words that appear in many documents can be determined. There were three different IDF values.
The IDF value of 2.09861229 that appears in one document
IDF value 1.69314718 that appears in two documents
IDF value of 1.40546511 that appears in three documents
This value can be shown through the calculation results of the “import math” library from Python, as shown in Figure 8, regarding the inverse document frequency (IDF).
Subsequently, we calculated the TF-IDF Matrix (5 × 68). After calculating the TF and IDF values, the document value weights were processed using the transformation method. This method multiplies the TF matrix (5 × 68) with the IDF matrix (68 × 68 with IDF for each word on the main diagonal) and divides the TF-IDF with the Euclidean norm. The results are shown in Figure 9 for the TF-IDF matrix data processing results.
The formula cosine similarity can calculate the similarity of an object. Through this process, we searched for similarities between the documents. Based on the results of the processing of document value weights in the TF-IDF process, an experiment has been carried out, with the following steps:
The two documents with the highest similarity values were selected. This process applies the formulation of cosine similarity using Python. Figure 10 illustrates the process of calculating document similarities.
Analysis of similarity calculation results. Based on the cosine similarity calculation results, of all the results of the similarity value, there is a value that is closest to 1, such as 0.0913166. This value is at the d1 and d3 positions. Table 2 summarizes the similarity values between documents.
Perform the stopword removal process on documents d1 and d3. The purpose of this process is to eliminate words that have no meaning in the text mining process. The results can be analyzed using Python, as shown in Figure 11. The process of erasing words has no meaning. The results of document d1 is “death toll severe flooding around Indonesian capital Jakarta is risky 66 parts country continues heavy rain began New Year Eve”. For document d3, the result is “worst floods Indonesia seen since 2013, at least 29 people edited aftermath torrential rains.”
Semantic linkages between words were calculated. The Wu–Palmer concept of similarity is used in the “WordNet Similarity for Java” applications to calculate the semantics between words contained in documents d1 and d3. An illustration of this process is presented in Figure 12. Calculation of similarities and related semantics.
All words used as variables were identified based on similarity results. All words that have a semantic value higher than 0.50 are identified. Then, all words are paired based on the semantic value. The results of this identification presented in Table 3 regarding the identification results of the variables that will be used in the causal loop diagram. For the value limit of 0.50, this study used the middle limit between 0 and 1 such that the semantic value can be more dynamic and balanced.
Based on the results and discussions, there are results in the form of data extraction from the description of storytelling for the systems thinking case. This extraction is performed through text preprocessing and similarity, identifying a variable to be used in forming causal loop diagrams. Three things can be concluded as the core of this research activity, which are as follows:
Through the Python NLTK and based on the text description of storytelling, this activity produces case folding, tokenization, stopword removal, and stemming/lemmatization, which is applied to five documents (i.e., d1, d2, d3, d4, and d5). The results of processing the five documents successfully identified two documents that had the highest similarity value:
d1 = “death toll, severe flooding around Indonesian capital Jakarta, 66 risky country parts continue to reel heavy rain began New Year Eve.”
d3 = “worst floods Indonesia seen since 2013, at least 29 people edited aftermath torrential rains.”
Generate a text weighting value from the results of the document similarity activity. Based on the Cosine Similarity calculation results, of all the results of the similarity value, there is a value that is closest to 1, namely 0.0913166. This value is at the d1 and d3 positions.
This produced a variable approach in the form of a group of words, which will be used in modeling Thinking Systems based on a connectedness value greater than 0.50.
For further research, we plan to conduct the following to reach the implementation phase:
The validity and reliability of all variables generated by the prototype method named “IdVar4CL” are investigated. This research continues to apply the agreement coefficient concept to measure the veracity of all variables produced. This measurement can be carried out by employing the agreement of experts as a reference in text mining and system dynamic activities.
This variable identification method was tested through the implementation in a systems thinking case study that used the causal loop diagram modeling. It is expected that all variables produced in this study can follow the method of thinking of analyst systems in determining variables when modeling.
This activity can be used to initiate software that produces Intellectual Property Rights.
Fundamental ideas for the IdVar4CL method.
Paragraphs as dataset. Source:
Illustration of causal loop variable identification.
Case folding.
Dataset documents.
Normalization and index.
Term frequency (TF) matrix.
Inverse document frequency (IDF).
Preview of TF-IDF matrix results.
Cosine similarity.
Stopword removal process.
Measures semantic similarity/relatedness between words.
Table 1 . Documents.
Document labeling | Document identification |
---|---|
d1 | The death toll from severe flooding in and around the Indonesian capital of Jakarta has risen to 66 as parts of the country continue to reel from heavy rain that began on New Year’s Eve. |
d2 | Landslides and flash floods have displaced more than 36,000 in Jakarta and the nearby provinces of West Java and Banten, according to the ASEAN Coordinating Center for Humanitarian Assistance (AHA). |
d3 | These are the worst floods Indonesia has seen since 2013, when at least 29 people edited in the aftermath of torrential rains. |
d4 | The disaster, experts say, underscores the impacts of climate change in a country with a capital city that is sinking so quickly that officials are working to move it to another island. |
d5 | The floods are also threatening to exacerbate the already severe wealth inequality that plagues the Southeast Asian nation. |
Table 2 . Document similarity value..
d1 | d2 | d3 | d4 | d5 | |
---|---|---|---|---|---|
d1 | 1 | 0.036 | 0.091 | 0.079 | 0.049 |
d2 | 0.036 | 1 | 0.033 | 0 | 0.035 |
d3 | 0.091 | 0.033 | 1 | 0 | 0.045 |
d4 | 0.079 | 0 | 0 | 1 | 0 |
d5 | 0.049 | 0.035 | 0.045 | 0 | 1 |
Table 3 . Identification of variables for causal loops.
Variable identification results | Value of connectedness | |
---|---|---|
Death | Floods | 0.7500 |
Death | People | 0.5455 |
Death | Aftermath | 0.7059 |
Death | Rains | 0.6316 |
Toll | Floods | 0.7059 |
Toll | Aftermath | 0.7059 |
Capital | Floods | 0.5217 |
Capital | People | 0.5333 |
Risen | Seen | 0.5714 |
Risen | Died | 0.6667 |
Parts | Floods | 0.6000 |
Parts | People | 0.6000 |
Country | People | 0.9091 |
Continue | Seen | 0.5714 |
Rain | Floods | 0.6316 |
Began | Seen | 0.6667 |
Year | Floods | 0.5333 |
Year | People | 0.6667 |
Eve | Floods | 0.5333 |
Mohamedou Cheikh Tourad, and Abdelmounaim Abdali
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(4): 303-315 https://doi.org/10.5391/IJFIS.2018.18.4.303Ishara Sandun, Sagara Sumathipala, and Gamage Upeksha Ganegoda
Int. J. Fuzzy Log. Intell. Syst. 2017; 17(4): 307-314 https://doi.org/10.5391/IJFIS.2017.17.4.307Minyoung Kim
Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298 https://doi.org/10.5391/IJFIS.2016.16.4.293Fundamental ideas for the IdVar4CL method.
|@|~(^,^)~|@|Paragraphs as dataset. Source:
Illustration of causal loop variable identification.
|@|~(^,^)~|@|Case folding.
|@|~(^,^)~|@|Dataset documents.
|@|~(^,^)~|@|Normalization and index.
|@|~(^,^)~|@|Term frequency (TF) matrix.
|@|~(^,^)~|@|Inverse document frequency (IDF).
|@|~(^,^)~|@|Preview of TF-IDF matrix results.
|@|~(^,^)~|@|Cosine similarity.
|@|~(^,^)~|@|Stopword removal process.
|@|~(^,^)~|@|Measures semantic similarity/relatedness between words.