International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 409-422
Published online December 25, 2021
https://doi.org/10.5391/IJFIS.2021.21.4.409
© The Korean Institute of Intelligent Systems
Hasan J. Alyamani1, Shakeel Ahmad2, Asif Hassan Syed2, Sheikh Muhammad Saqib3 , and Yasser D. Al-Otaibi1
1Department of Information Systems, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
2Department of Computer Science, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
3Institute of Computing and Information Technology, Gomal University, Dera Ismail Khan, Pakistan
Correspondence to :
Sheikh Muhammad Saqib (saqibsheikh4@gu.edu.pk)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Due to the large amount of Arabic text produced on a daily basis, there is a need to analyze these texts. Following a comprehensive literature review, there become clear several issues related to Arabic text summaries, keyword extraction, and sentiment analyses. These issues occur owing to several factors, such as the structure and morphology of Arabic text, a lack of machine-readable Arabic dictionaries, insufficient tools to manage Arabic text, no standard datasets, inherently cursive scripts, and isolated characters; thus, there is a need to create Arabic text in forms that can be easily read by machine learning, deep learning algorithms, and existing analysis tools. To achieve this, the Arabic texts must be converted into English texts. This paper proposes a lexicon called the AEC-Lexicon for use by all researchers working in Arabic, which is based on the Arabic case system and converts Arabic text into English text. Based on the experimental results of latent semantic indexing (LSI), it was found that texts generated from the proposed work exhibited a significant improvement over existing work (converted Arabic to English texts), considering reading and understanding as well as the relevance to the original Arabic text.
Keywords: Arabic text, Machine learning algorithms, Lexicon, Automatic generated summary
This section describes the problems in the literature and how the proposed work will overcome these problems. There are six major dialectical groups in Saudi Arabia. The main dialects in Saudi Arabia are the Hejazi dialect and the Nejdi dialect, with six million and eight million speakers, respectively [1]. According to the Arab Social Media Report in 2017 [2], there are 2.6 million active Twitter users from Saudi Arabia. Learning Arabic makes one stand out, as there are very few people from the West who speak Arabic, and having command of the Arabic language seems clever and sophisticated [3] because a considerable amount of research has been done on Arabic texts. Arabic summarization is a highly difficult task because the Arabic language is complex with respect to its morphology and structure [4]. Although standard Arabic morphological analysis tools are not available [5], some researchers have utilized existing common Arabic stemming as a workaround [6].
Existing resources for Arabic texts are limited, and the accuracy of existing methods is low owing to Arabic specific (e.g., limited resources, Arabic’s morphological complexity, differing dialects) and general linguistic issues (e.g., implicit sentiment, sarcasm, polarity fuzziness, polarity strength, review quality, spam, domain dependence) [7]. Stemming Arabic words is a complex issue in the field of Arabic text classification [8]. As reported in [9] and [10], the issues and challenges in identifying sentiments in informal Arabic language in the context of Twitter and YouTube Arabic content were investigated, which is unstructured in nature. Limited research has been conducted on opinion mining using Arabic Twitter [11] because the majority of the natural language processing (NLP) tools for the Arabic language have been developed for Modern Standard Arabic (MSA) rather than informal Arabic [12]. All machine learning algorithms are available for the English language, so there the above issue does not apply to machine-readable texts.
Thus, there is a strong need to analyze the bulk of Arabic text being created daily, but it is difficult to manage because of the following issues involved in handling Arabic text:
- The complexity of the Arabic language related to structure and morphology [4].
- Due to the inherent complexity of the Arabic language, very little research has been conducted combining Arabic text summarization and text entailment to produce extracts [13].
- Lack of existing Arabic morphological analysis tools [5].
- A limited number of documents with keywords are available online [14].
- There is a lack of techniques for scanning and matching of Arabic texts to identify key phrases [15].
- No gold standard summaries for Arabic and a lack of machine-readable Arabic dictionaries [16].
- Little current research on the automatic classification of text documents due to stemming Arabic [8], Arabic-specific, and general linguistic issues [7]; and existing tools are used for MSA and ignore informal Arabic [11].
The most obvious characteristics of the Arabic language are that Arabic scripts are inherently cursive and that isolated characters are written out, as shown in Figure 1 [17]; thus, there is a need to convert Arabic text into English text for further processing.
The primary purpose of the proposed methodology is to address all of the issues discussed. For example, there should be a common and consistent Arabic text that can easily solve the issues relating to machine-readability, the generation of standard summary, and morphology, which can be read by different tools. In this study, Arabic texts were converted into English texts by preparing a lexicon for conversion. The coordinates of existing documents were also compared with the proposed documents using the latent segmenting indexing method, and it was found that the documents produced by the proposed method were highly similar to the Arabic documents. The key contribution of this work is the proposed lexicon, which is based on several generated rules:
- Proposing Simple-Rules (based on the absence of case systems).
- Proposing FKD-Rules based on Fattah, Kasrah, and Dammah.
- Proposing T-Rules based on Tanveen
- Proposing a user-friendly interface for converting Arabic texts into English texts.
Using the Internet, a study [18] was conducted investigating Arabic and Islamic content containing relevant information from the prophetic narration texts using an artificial intelligence (AI) approach. The authors proposed a semantically driven approach to analyze Arabic discourse following the segmented discourse representation theory (SDRT) framework. It has also been found that discourse analysis can be used to produce indicative summaries of Arabic documents [13]. The authors of [19] proposed a new model for summarization based on clusters resulting from the document clustering extraction of keyphrases. The authors of [20] proposed an approach called Adaboost based on a supervised method to produce Arabic summary extraction. For this study, a set of statistical features were utilized including sentence position, the number of keywords in a sentence, overlap with word title, and sentence length. The proposed approach produces automatic text summarization of multiple large-scale Arabic documents using a genetic algorithm and the MapReduce parallel programming model [21]. Few studies have been conducted on an approach that ensures scalability, speed, and accuracy of summary generation [13].
Arabic text summarization is still a new approach because of the inherent complexity of the Arabic language as well as the lack of a standardized evaluation process, because there are no gold standard summaries for Arabic machine-readable dictionaries [16]. There is a model that presents an Arabic text summarization system (query-based) using Arabic WordNet and an extracted knowledge base [22]. As reported in [23], a weighted directed graph represents each document, with nodes corresponding to sentences and edge weights representing similarities between the sentences. The authors of [24] introduced a new method for Arabic text summarization based on graph theory; however, summarization systems for Arabic are still not as mature or as reliable as those developed for other languages such as English. In [25], the authors addressed the problem of developing an Arabic text summarization system (LCEAS) that produces extracts without redundancy.
Many text classification (TC) research studies have been conducted and tested with on the English, French, German, Spanish, Chinese, Greek, and Japanese languages [26]. Opinion mining is the process of automatically identifying opinions expressed in Arabic texts on certain subjects [27]; however, there is little current research on the automatic classification of text documents in Arabic because of varying spellings of certain words, various ways to write certain combination characters, and short (diacritics) and long vowels. In addition, most Arabic words contain affixes [28]. Mesleh [29] utilized three machine-learning algorithms–support vector machine (SVM), k-nearest neighbor (KNN), and naïve Bayes (NB)–to classify Arabic data collected from Arabic newspapers. The studies [30] and [31] used TC and analyses of Arabic texts to automatically assign texts to predefined categories based on their linguistic features. The NB, KNN, multinomial logistic regression (MLR), and maximum weight (MW) classifiers were used, with the NB classifier implemented as the master and the others as slaves [32]. All tests were performed after preprocessing Arabic text documents. The results showed that the master-slave technique yields a significant improvement in the accuracy of text document classification compared to other techniques [32]. The authors of [7] also used three machine-learning classification techniques, NB, SVM, and decision trees, to improve the sentiment analyzer. Stemmer and document embedding techniques were used for text classification to investigate the impact of the preprocessing methods on the performance of three machine-learning algorithms: NB, discriminative nulti-nominal naive Bayes (DMNB) text, and C4.5, all of which were used for Arabic text categorization [33]. The results of the sentiment analysis and subjectivity analysis demonstrated that while the use of lexeme or lemma data is useful, there is also a need for individualized solutions for each task and genre [10].
This section provides information about the case system of Arabic, which corresponds to vowels in English. The case endings in Arabic are little patterns ( Harakaat) which are appended to the ends of words to indicate the their grammatical functions in a sentence. Case endings are usually not written (with one exception) outside of the Qur’an/Bible and children’s books; however, newscasters pronounce them, and to speak formal Arabic well, it is important to be familiar with the case system [3,34]:
In the proposed work, three rules were created to update the lexicon: the Simple-Rule, FKD-Rule, and T-Rule. The FKD-Rule is based on Fathah, Kasrah, and Dammah, while the T-Rule is based on Tanveen. The following Arabic case systems ( ) will be matched with vowels and some other sounds of Arabic. In Arabic, Fathah is a small diagonal line placed above a letter ( ) and maps to the letter a in English. The Kasrah ( ) is a small diagonal line placed below a letter ( ) and represents a short
The Arabic to English convertor lexicon (AEC-Lexicon) was developed from different sets of pairs and consists of a set of three categories. Some heuristics from Arabic sounds were used to develop this lexicon. The first set was generated using only simple similar sounding characters from Arabic and English, as mentioned in the transliteration system [35]. The second set was generated from similar sounding characters in Arabic and English using Fathah, Kasrah, and Dammah. The third set was generated from similar sounding characters in Arabic and English using Tanveen.
The Simple-Rule consists of simple characters without a case system. This rule consists of a tuple containing a pair denoted ASC:ESC, where ASC means Arabic sound character(s) (i.e., ) and ESC means relevant English sound character (i.e., S). After processing, 20 pairs of Arabic to English sounds were produced, as shown in Table 1.
The basic case systems used in Arabic which produce sound are A-FKD {FATHAH, KASRAH, and DUMMA} and E-rFKD{A, I, and U}, where “A” is an alternate of FATHAH, “I” is an alternate of KASRAH, and “U” is an alternate of DUMMA. As shown in
The equation Pair-FKD(A:E) is a pair consisting of ASC.A-FKD and ESC.E-rFKD and represents a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ), as shown in Figure 2.
Table 2 contains all pairs generated using the FKD-Rule.
The “T” in T-Rule refers to Tanveen ( ) and is also based on the FKD-Rule. Tanveen used in Arabic are A-T { } and ErT {AN, IN, UN}, where “AN” is a replacement for , “IN” is a replacement for , and “UN” is a replacement for , as shown in
The equation Pair-T(A:E) is a pair of ASC.A-T and ESC.E-rT representing a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ) as shown in Figure 3.
All pairs generated using the T-Rule are shown in Table 3. In Table 3, the rows with black text have unique sounds, while the remaining colors maintain the same sounds from Arabic to English.
These rules were implemented in Python because it is an open-source language that is easy to understand, learn, and use. Python also has highly useful libraries for the manipulation and analysis of data. When modern languages (existing) are assessed, the utility of Python-based solutions is remarkable in terms of agility and productivity. Many companies in all regions, including the largest investment banks and the smallest social/mobile web app startups, have used Python to operate their businesses and manage their data [36].
After generating the three rules, they were applied to Arabic sentences. Either the FKD-Rule or T-Rule can be applied first, and then the Simple-Rule may be applied, because applying the Simple-Rule first when converting Arabic texts into English may result in a less readable form. The conversion process is shown in Figure 4.
Quranic Arabic texts from “Surah Ikhlas” (chapter 112 verses 1–4) and “Sura An-Nass” (chapter 114 verses 1–6) have were converted, as shown in Table 4. These results are generated using only FATAH, KASRAh, DUMMA, and TANVEEN.
It is worth mentioning that “no generally accepted database for Arabic text recognition is freely available for researchers” [37]. Hence, different researchers investigating Arabic text recognition use different data, and therefore, the recognition rates of the different techniques may not be comparable. The website (
After the conversion of Arabic texts to English texts, the existing and manual approaches were compared, and our proposed approach yielded the best readable form. These formulations are presented in Table 6 based on the sample data. In the second column, the red-colored words from the existing approach are less readable, whereas in the third column, the relevant words in green color produced using the proposed approach can be easily read.
The converted documents were also checked using an online speaking tool (
The latent semantic indexing approach was applied to determine the accuracy of the existing approach. For this purpose, the vector coordinates of the documents of the Arabic texts, the documents of the existing approach, and documents of the proposed approach were examined. The coordinates of the existing documents and the proposed documents with Arabic text were then compared. A query to determine the coordinates was also required. The documents of all approaches remained in the same order, and Document-6 was taken as a query, as shown in Table 7. In addition, the similarity distance of each document was determined using a specific query.
The main objective was to identify the coordinates of each document and query. Singular value decomposition (SVD) can determine the points or coordinates of a document and query. Using SVD, three matrices S, V, and U, can be determined by a matrix and used for further processing. To determine the values of these variables, SVD requires a matrix consisting of rows and columns of integers and different text documents as inputs. A feature matrix can be obtained by calculating the frequency of each word. This means a feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using the NumPy library. The coordinates of all the documents were determined from S, and these coordinates were merged with the query to obtain the query coordinates. Finally, a cosine similarity function was applied to these coordinates to identify the documents closest to the queries [38].
From S, the coordinates of all documents corresponding to Arabic texts (column-1 in Table 5), texts converted using the existing approach (column-1 in Table 6), and texts converted using the proposed approach (column-2 in Table 6), were determined, as shown in Table 8. These coordinates were merged with relevant queries from Table 8 to find the query coordinates, which are shown in Table 9. The data provided in Table 8 represent 10 documents from the Arabic, proposed, and existing texts, and the accuracies are shown in Figure 5.
LSI, which was proposed by [39,40], is an efficient information retrieval algorithm [41]. In LSI, there is a cosine similarity measurement between the coordinates (
It is clear that LSI uses the cosine similarity measure to rank the data with respect to the query and to find the points (
Table 10 depicts the similarity distance of a sample of 10 Arabic documents that were queried, including queries for the proposed and existing methods.
As shown in Table 11, it is clear that the values of the 2nd column (proposed work) are most similar to the Arabic text, and the values from the 3rd column (existing approach) have many different values. Figure 6 illustrates the comparison of the existing and proposed methods with the Arabic text. The figure clearly shows that the line for the proposed method is much closer to the line for the Arabic text than the line for the existing method is.
In Table 10, the document is 100% closer to the query if the value is one, 50% closer to the query for a value of 0.5, and 90% closer for a value of 0.9. For the Arabic text and the proposed approach (green color), D7, D8, and D10 are close to their queries, whereas for the existing approach (gray color), D1, D2, D3, D6, and D9 are close to their queries, as shown in Table 11. This table represents the proposed method’s text converted into English, which is similar to the existing English-converted text.
It was concluded that using the proposed work, all Arabic texts such as Arabic stop words, dictionaries, online reviews, tweets, lexicons, Quranic text, and hadith text can be easily converted into English texts, allowing for several types of analyses that require English text as input. Researchers can also use this lexicon to convert Arabic datasets into English texts.
Although Arabic text can be converted into English in a readable manner using these case systems, this readability can also be improved by using other case systems, such as Maddah/ /, Dagger Alif / /, Shaddah / /, etc. Future works will focus on converting Arabic texts to English texts by inverting the pairs ASC:ESC to ESC:ASC, ASC.A-FKD:ESC.E-rFKD to ESC.E-rFKD:ASC. A-FKD and ASC. A-T:ESC.E-rT to ESC.ErT: ASC.A-T with the inclusion of the remaining case systems.
No potential conflict of interest relevant to this article was reported.
Table 5. Sample sentences from Dataset-1.
Serial No | Arabic text | English converted text (Existing approach) |
---|---|---|
1 | “Allahumma innee a’uzubika min ham ayhzununee, wa min fikr yuqliqunee, wa 3ilm yut3ibunee, wa shakhS yahmilu khubsan-lee” | |
2 | “Allahuma inni ashku ilayka du’fa quwati wa qilata heelati wa hawaani ala annaas. yaa arham araahimin, anta rab wa anta rabi” | |
3 | “Allahumma inni a’udhu bika min zawali ni’matika, wa tahawwuli’afiyatika, wa fuja’ati niqmatika, wa jami’i sakhatika” | |
4 | “Allahumma azhib Gaydha Qalbee” | |
5 | “Rabbi IshraHlee Sadree, wa yassirlee amree, waHlul Uqdatan min lisaanee, yafqahuu qawlee” | |
6 | “Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal” |
Table 7. Queries document from all datasets.
Query from dataset | Document No from dataset | Text of selected document |
---|---|---|
Query (Arabic documents) | Document-6 | |
Query (Existing approach) | Document-6 | “Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal” |
Query (Proposed approach) | Document-6 |
Table 8. Coordinates of all documents all datasets.
Doc-No | Arabic X | Arabic Y | Proposed X | Proposed Y | Existing X | Existing Y |
---|---|---|---|---|---|---|
1 | −1.38872E-16 | 6.4016E-17 | 8.4641E-16 | 9.0951E-19 | −0.31422 | 0.219757 |
2 | −0.985261873 | 0.044420577 | 0.98526187 | 0.04442058 | −0.28588 | 0.207485 |
3 | −4.17806E-16 | −1.08385E-16 | 9.645E-17 | 1.0961E-18 | −0.29549 | 0.170587 |
4 | −8.27432E-17 | 2.33523E-16 | 8.2149E-17 | −9.72E-17 | −0.02333 | −0.02504 |
5 | 3.1507E-16 | 7.12019E-17 | −1.099E-15 | −1.881E-17 | −0.10085 | 0.080281 |
6 | 2.06014E-16 | 4.21434E-17 | −6.323E-16 | 1.8952E-19 | −0.64589 | 0.343129 |
7 | −0.029725105 | −0.956431375 | 0.02972511 | −0.9564314 | −0.1016 | 0.033962 |
8 | −0.016770068 | −0.27385777 | 0.01677007 | −0.2738578 | −0.18761 | 0.036649 |
9 | −0.167423851 | −0.060957415 | 0.16742385 | −0.0609574 | −0.11867 | 0.07706 |
10 | −0.007967295 | −0.067468161 | 0.0079673 | −0.0674682 | −0.49471 | −0.86389 |
Table 9. Query coordinates from all documents.
Coordinates | Arabic query | Proposed query | Existing query |
---|---|---|---|
X,Y | −0.022388, −0.756423 | 0.033842, −0.7614 | −0.65167, 0.3559 |
Table 10. Distance of all documents of datasets from their relevant query.
Documents | Arabic | Proposed | Existing |
---|---|---|---|
D1 | −0.391582646215 | 0.0433259292082 | 0.993919277857 |
D2 | −0.0154648277095 | −0.000640494192478 | 0.991839130391 |
D3 | 0.279630504392 | 0.0330440114234 | 0.999721682081 |
D4 | −0.932286912287 | 0.79168449571 | 0.247472435013 |
D5 | −0.249189850556 | −0.0273075914125 | 0.985180590511 |
D6 | −0.22931184731 | −0.0446988765309 | 0.999932316216 |
D7 | 0.999998904667 | 0.999910958682 | 0.984310426495 |
D8 | 0.999501678847 | 0.999859788963 | 0.953227104676 |
D9 | 0.36976960257 | 0.383502674056 | 0.997116046374 |
D10 | 0.996134343355 | 0.997327099051 | 0.0200963618144 |
Table 11. Closest documents to relevant query.
Approaches | Close to query w.r.t points |
---|---|
Arabic text documents | D7(0.999998904667), D8(0.999501678847), D10(0.996134343355) |
Proposed text documents | D7(0.999910958682), D8(0.999859788963), D10(0.997327099051) |
Existing text documents | D1(0.993919277857), D2(0.991839130391), D3(0.999721682081), D6(0.999932316216) |
E-mail: hjalyamani@kau.edu.sa
E-mail: sarahamd@kau.edu.sa
E-mail: shassan1@kau.edu.sa
E-mail: saqibsheikh4@gu.edu.pk
E-mail: yalotaibi@kau.edu.sa
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 409-422
Published online December 25, 2021 https://doi.org/10.5391/IJFIS.2021.21.4.409
Copyright © The Korean Institute of Intelligent Systems.
Hasan J. Alyamani1, Shakeel Ahmad2, Asif Hassan Syed2, Sheikh Muhammad Saqib3 , and Yasser D. Al-Otaibi1
1Department of Information Systems, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
2Department of Computer Science, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
3Institute of Computing and Information Technology, Gomal University, Dera Ismail Khan, Pakistan
Correspondence to:Sheikh Muhammad Saqib (saqibsheikh4@gu.edu.pk)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Due to the large amount of Arabic text produced on a daily basis, there is a need to analyze these texts. Following a comprehensive literature review, there become clear several issues related to Arabic text summaries, keyword extraction, and sentiment analyses. These issues occur owing to several factors, such as the structure and morphology of Arabic text, a lack of machine-readable Arabic dictionaries, insufficient tools to manage Arabic text, no standard datasets, inherently cursive scripts, and isolated characters; thus, there is a need to create Arabic text in forms that can be easily read by machine learning, deep learning algorithms, and existing analysis tools. To achieve this, the Arabic texts must be converted into English texts. This paper proposes a lexicon called the AEC-Lexicon for use by all researchers working in Arabic, which is based on the Arabic case system and converts Arabic text into English text. Based on the experimental results of latent semantic indexing (LSI), it was found that texts generated from the proposed work exhibited a significant improvement over existing work (converted Arabic to English texts), considering reading and understanding as well as the relevance to the original Arabic text.
Keywords: Arabic text, Machine learning algorithms, Lexicon, Automatic generated summary
This section describes the problems in the literature and how the proposed work will overcome these problems. There are six major dialectical groups in Saudi Arabia. The main dialects in Saudi Arabia are the Hejazi dialect and the Nejdi dialect, with six million and eight million speakers, respectively [1]. According to the Arab Social Media Report in 2017 [2], there are 2.6 million active Twitter users from Saudi Arabia. Learning Arabic makes one stand out, as there are very few people from the West who speak Arabic, and having command of the Arabic language seems clever and sophisticated [3] because a considerable amount of research has been done on Arabic texts. Arabic summarization is a highly difficult task because the Arabic language is complex with respect to its morphology and structure [4]. Although standard Arabic morphological analysis tools are not available [5], some researchers have utilized existing common Arabic stemming as a workaround [6].
Existing resources for Arabic texts are limited, and the accuracy of existing methods is low owing to Arabic specific (e.g., limited resources, Arabic’s morphological complexity, differing dialects) and general linguistic issues (e.g., implicit sentiment, sarcasm, polarity fuzziness, polarity strength, review quality, spam, domain dependence) [7]. Stemming Arabic words is a complex issue in the field of Arabic text classification [8]. As reported in [9] and [10], the issues and challenges in identifying sentiments in informal Arabic language in the context of Twitter and YouTube Arabic content were investigated, which is unstructured in nature. Limited research has been conducted on opinion mining using Arabic Twitter [11] because the majority of the natural language processing (NLP) tools for the Arabic language have been developed for Modern Standard Arabic (MSA) rather than informal Arabic [12]. All machine learning algorithms are available for the English language, so there the above issue does not apply to machine-readable texts.
Thus, there is a strong need to analyze the bulk of Arabic text being created daily, but it is difficult to manage because of the following issues involved in handling Arabic text:
- The complexity of the Arabic language related to structure and morphology [4].
- Due to the inherent complexity of the Arabic language, very little research has been conducted combining Arabic text summarization and text entailment to produce extracts [13].
- Lack of existing Arabic morphological analysis tools [5].
- A limited number of documents with keywords are available online [14].
- There is a lack of techniques for scanning and matching of Arabic texts to identify key phrases [15].
- No gold standard summaries for Arabic and a lack of machine-readable Arabic dictionaries [16].
- Little current research on the automatic classification of text documents due to stemming Arabic [8], Arabic-specific, and general linguistic issues [7]; and existing tools are used for MSA and ignore informal Arabic [11].
The most obvious characteristics of the Arabic language are that Arabic scripts are inherently cursive and that isolated characters are written out, as shown in Figure 1 [17]; thus, there is a need to convert Arabic text into English text for further processing.
The primary purpose of the proposed methodology is to address all of the issues discussed. For example, there should be a common and consistent Arabic text that can easily solve the issues relating to machine-readability, the generation of standard summary, and morphology, which can be read by different tools. In this study, Arabic texts were converted into English texts by preparing a lexicon for conversion. The coordinates of existing documents were also compared with the proposed documents using the latent segmenting indexing method, and it was found that the documents produced by the proposed method were highly similar to the Arabic documents. The key contribution of this work is the proposed lexicon, which is based on several generated rules:
- Proposing Simple-Rules (based on the absence of case systems).
- Proposing FKD-Rules based on Fattah, Kasrah, and Dammah.
- Proposing T-Rules based on Tanveen
- Proposing a user-friendly interface for converting Arabic texts into English texts.
Using the Internet, a study [18] was conducted investigating Arabic and Islamic content containing relevant information from the prophetic narration texts using an artificial intelligence (AI) approach. The authors proposed a semantically driven approach to analyze Arabic discourse following the segmented discourse representation theory (SDRT) framework. It has also been found that discourse analysis can be used to produce indicative summaries of Arabic documents [13]. The authors of [19] proposed a new model for summarization based on clusters resulting from the document clustering extraction of keyphrases. The authors of [20] proposed an approach called Adaboost based on a supervised method to produce Arabic summary extraction. For this study, a set of statistical features were utilized including sentence position, the number of keywords in a sentence, overlap with word title, and sentence length. The proposed approach produces automatic text summarization of multiple large-scale Arabic documents using a genetic algorithm and the MapReduce parallel programming model [21]. Few studies have been conducted on an approach that ensures scalability, speed, and accuracy of summary generation [13].
Arabic text summarization is still a new approach because of the inherent complexity of the Arabic language as well as the lack of a standardized evaluation process, because there are no gold standard summaries for Arabic machine-readable dictionaries [16]. There is a model that presents an Arabic text summarization system (query-based) using Arabic WordNet and an extracted knowledge base [22]. As reported in [23], a weighted directed graph represents each document, with nodes corresponding to sentences and edge weights representing similarities between the sentences. The authors of [24] introduced a new method for Arabic text summarization based on graph theory; however, summarization systems for Arabic are still not as mature or as reliable as those developed for other languages such as English. In [25], the authors addressed the problem of developing an Arabic text summarization system (LCEAS) that produces extracts without redundancy.
Many text classification (TC) research studies have been conducted and tested with on the English, French, German, Spanish, Chinese, Greek, and Japanese languages [26]. Opinion mining is the process of automatically identifying opinions expressed in Arabic texts on certain subjects [27]; however, there is little current research on the automatic classification of text documents in Arabic because of varying spellings of certain words, various ways to write certain combination characters, and short (diacritics) and long vowels. In addition, most Arabic words contain affixes [28]. Mesleh [29] utilized three machine-learning algorithms–support vector machine (SVM), k-nearest neighbor (KNN), and naïve Bayes (NB)–to classify Arabic data collected from Arabic newspapers. The studies [30] and [31] used TC and analyses of Arabic texts to automatically assign texts to predefined categories based on their linguistic features. The NB, KNN, multinomial logistic regression (MLR), and maximum weight (MW) classifiers were used, with the NB classifier implemented as the master and the others as slaves [32]. All tests were performed after preprocessing Arabic text documents. The results showed that the master-slave technique yields a significant improvement in the accuracy of text document classification compared to other techniques [32]. The authors of [7] also used three machine-learning classification techniques, NB, SVM, and decision trees, to improve the sentiment analyzer. Stemmer and document embedding techniques were used for text classification to investigate the impact of the preprocessing methods on the performance of three machine-learning algorithms: NB, discriminative nulti-nominal naive Bayes (DMNB) text, and C4.5, all of which were used for Arabic text categorization [33]. The results of the sentiment analysis and subjectivity analysis demonstrated that while the use of lexeme or lemma data is useful, there is also a need for individualized solutions for each task and genre [10].
This section provides information about the case system of Arabic, which corresponds to vowels in English. The case endings in Arabic are little patterns ( Harakaat) which are appended to the ends of words to indicate the their grammatical functions in a sentence. Case endings are usually not written (with one exception) outside of the Qur’an/Bible and children’s books; however, newscasters pronounce them, and to speak formal Arabic well, it is important to be familiar with the case system [3,34]:
In the proposed work, three rules were created to update the lexicon: the Simple-Rule, FKD-Rule, and T-Rule. The FKD-Rule is based on Fathah, Kasrah, and Dammah, while the T-Rule is based on Tanveen. The following Arabic case systems ( ) will be matched with vowels and some other sounds of Arabic. In Arabic, Fathah is a small diagonal line placed above a letter ( ) and maps to the letter a in English. The Kasrah ( ) is a small diagonal line placed below a letter ( ) and represents a short
The Arabic to English convertor lexicon (AEC-Lexicon) was developed from different sets of pairs and consists of a set of three categories. Some heuristics from Arabic sounds were used to develop this lexicon. The first set was generated using only simple similar sounding characters from Arabic and English, as mentioned in the transliteration system [35]. The second set was generated from similar sounding characters in Arabic and English using Fathah, Kasrah, and Dammah. The third set was generated from similar sounding characters in Arabic and English using Tanveen.
The Simple-Rule consists of simple characters without a case system. This rule consists of a tuple containing a pair denoted ASC:ESC, where ASC means Arabic sound character(s) (i.e., ) and ESC means relevant English sound character (i.e., S). After processing, 20 pairs of Arabic to English sounds were produced, as shown in Table 1.
The basic case systems used in Arabic which produce sound are A-FKD {FATHAH, KASRAH, and DUMMA} and E-rFKD{A, I, and U}, where “A” is an alternate of FATHAH, “I” is an alternate of KASRAH, and “U” is an alternate of DUMMA. As shown in
The equation Pair-FKD(A:E) is a pair consisting of ASC.A-FKD and ESC.E-rFKD and represents a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ), as shown in Figure 2.
Table 2 contains all pairs generated using the FKD-Rule.
The “T” in T-Rule refers to Tanveen ( ) and is also based on the FKD-Rule. Tanveen used in Arabic are A-T { } and ErT {AN, IN, UN}, where “AN” is a replacement for , “IN” is a replacement for , and “UN” is a replacement for , as shown in
The equation Pair-T(A:E) is a pair of ASC.A-T and ESC.E-rT representing a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ) as shown in Figure 3.
All pairs generated using the T-Rule are shown in Table 3. In Table 3, the rows with black text have unique sounds, while the remaining colors maintain the same sounds from Arabic to English.
These rules were implemented in Python because it is an open-source language that is easy to understand, learn, and use. Python also has highly useful libraries for the manipulation and analysis of data. When modern languages (existing) are assessed, the utility of Python-based solutions is remarkable in terms of agility and productivity. Many companies in all regions, including the largest investment banks and the smallest social/mobile web app startups, have used Python to operate their businesses and manage their data [36].
After generating the three rules, they were applied to Arabic sentences. Either the FKD-Rule or T-Rule can be applied first, and then the Simple-Rule may be applied, because applying the Simple-Rule first when converting Arabic texts into English may result in a less readable form. The conversion process is shown in Figure 4.
Quranic Arabic texts from “Surah Ikhlas” (chapter 112 verses 1–4) and “Sura An-Nass” (chapter 114 verses 1–6) have were converted, as shown in Table 4. These results are generated using only FATAH, KASRAh, DUMMA, and TANVEEN.
It is worth mentioning that “no generally accepted database for Arabic text recognition is freely available for researchers” [37]. Hence, different researchers investigating Arabic text recognition use different data, and therefore, the recognition rates of the different techniques may not be comparable. The website (
After the conversion of Arabic texts to English texts, the existing and manual approaches were compared, and our proposed approach yielded the best readable form. These formulations are presented in Table 6 based on the sample data. In the second column, the red-colored words from the existing approach are less readable, whereas in the third column, the relevant words in green color produced using the proposed approach can be easily read.
The converted documents were also checked using an online speaking tool (
The latent semantic indexing approach was applied to determine the accuracy of the existing approach. For this purpose, the vector coordinates of the documents of the Arabic texts, the documents of the existing approach, and documents of the proposed approach were examined. The coordinates of the existing documents and the proposed documents with Arabic text were then compared. A query to determine the coordinates was also required. The documents of all approaches remained in the same order, and Document-6 was taken as a query, as shown in Table 7. In addition, the similarity distance of each document was determined using a specific query.
The main objective was to identify the coordinates of each document and query. Singular value decomposition (SVD) can determine the points or coordinates of a document and query. Using SVD, three matrices S, V, and U, can be determined by a matrix and used for further processing. To determine the values of these variables, SVD requires a matrix consisting of rows and columns of integers and different text documents as inputs. A feature matrix can be obtained by calculating the frequency of each word. This means a feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using the NumPy library. The coordinates of all the documents were determined from S, and these coordinates were merged with the query to obtain the query coordinates. Finally, a cosine similarity function was applied to these coordinates to identify the documents closest to the queries [38].
From S, the coordinates of all documents corresponding to Arabic texts (column-1 in Table 5), texts converted using the existing approach (column-1 in Table 6), and texts converted using the proposed approach (column-2 in Table 6), were determined, as shown in Table 8. These coordinates were merged with relevant queries from Table 8 to find the query coordinates, which are shown in Table 9. The data provided in Table 8 represent 10 documents from the Arabic, proposed, and existing texts, and the accuracies are shown in Figure 5.
LSI, which was proposed by [39,40], is an efficient information retrieval algorithm [41]. In LSI, there is a cosine similarity measurement between the coordinates (
It is clear that LSI uses the cosine similarity measure to rank the data with respect to the query and to find the points (
Table 10 depicts the similarity distance of a sample of 10 Arabic documents that were queried, including queries for the proposed and existing methods.
As shown in Table 11, it is clear that the values of the 2nd column (proposed work) are most similar to the Arabic text, and the values from the 3rd column (existing approach) have many different values. Figure 6 illustrates the comparison of the existing and proposed methods with the Arabic text. The figure clearly shows that the line for the proposed method is much closer to the line for the Arabic text than the line for the existing method is.
In Table 10, the document is 100% closer to the query if the value is one, 50% closer to the query for a value of 0.5, and 90% closer for a value of 0.9. For the Arabic text and the proposed approach (green color), D7, D8, and D10 are close to their queries, whereas for the existing approach (gray color), D1, D2, D3, D6, and D9 are close to their queries, as shown in Table 11. This table represents the proposed method’s text converted into English, which is similar to the existing English-converted text.
It was concluded that using the proposed work, all Arabic texts such as Arabic stop words, dictionaries, online reviews, tweets, lexicons, Quranic text, and hadith text can be easily converted into English texts, allowing for several types of analyses that require English text as input. Researchers can also use this lexicon to convert Arabic datasets into English texts.
Although Arabic text can be converted into English in a readable manner using these case systems, this readability can also be improved by using other case systems, such as Maddah/ /, Dagger Alif / /, Shaddah / /, etc. Future works will focus on converting Arabic texts to English texts by inverting the pairs ASC:ESC to ESC:ASC, ASC.A-FKD:ESC.E-rFKD to ESC.E-rFKD:ASC. A-FKD and ASC. A-T:ESC.E-rT to ESC.ErT: ASC.A-T with the inclusion of the remaining case systems.
A sample of Arabic writing [
Conversion of Arabic character to English character using A-FKD.
Conversion of Arabic character to English character using A-T.
Arabic to English complete sentence conversion.
Comparison of existing and proposed approaches with Arabic text based on coordinates.
Comparison of proposed work and existing work based on similarity distance.
Table 1 . All ASC:ESC pairs.
Table 2 . All ASC.A-FKD and ESC.E-rFKD pairs.
Table 3 . All ASC.A-T and ESC.E-rT pairs.
Table 4 . Conversion of “Surah Ikhlas” and “Sura An-Nass”.
Table 5 . Sample sentences from Dataset-1.
Serial No | Arabic text | English converted text (Existing approach) |
---|---|---|
1 | “Allahumma innee a’uzubika min ham ayhzununee, wa min fikr yuqliqunee, wa 3ilm yut3ibunee, wa shakhS yahmilu khubsan-lee” | |
2 | “Allahuma inni ashku ilayka du’fa quwati wa qilata heelati wa hawaani ala annaas. yaa arham araahimin, anta rab wa anta rabi” | |
3 | “Allahumma inni a’udhu bika min zawali ni’matika, wa tahawwuli’afiyatika, wa fuja’ati niqmatika, wa jami’i sakhatika” | |
4 | “Allahumma azhib Gaydha Qalbee” | |
5 | “Rabbi IshraHlee Sadree, wa yassirlee amree, waHlul Uqdatan min lisaanee, yafqahuu qawlee” | |
6 | “Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal” |
Table 6 . Text comparison of existing and proposed approaches.
Table 7 . Queries document from all datasets.
Query from dataset | Document No from dataset | Text of selected document |
---|---|---|
Query (Arabic documents) | Document-6 | |
Query (Existing approach) | Document-6 | “Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal” |
Query (Proposed approach) | Document-6 |
Table 8 . Coordinates of all documents all datasets.
Doc-No | Arabic X | Arabic Y | Proposed X | Proposed Y | Existing X | Existing Y |
---|---|---|---|---|---|---|
1 | −1.38872E-16 | 6.4016E-17 | 8.4641E-16 | 9.0951E-19 | −0.31422 | 0.219757 |
2 | −0.985261873 | 0.044420577 | 0.98526187 | 0.04442058 | −0.28588 | 0.207485 |
3 | −4.17806E-16 | −1.08385E-16 | 9.645E-17 | 1.0961E-18 | −0.29549 | 0.170587 |
4 | −8.27432E-17 | 2.33523E-16 | 8.2149E-17 | −9.72E-17 | −0.02333 | −0.02504 |
5 | 3.1507E-16 | 7.12019E-17 | −1.099E-15 | −1.881E-17 | −0.10085 | 0.080281 |
6 | 2.06014E-16 | 4.21434E-17 | −6.323E-16 | 1.8952E-19 | −0.64589 | 0.343129 |
7 | −0.029725105 | −0.956431375 | 0.02972511 | −0.9564314 | −0.1016 | 0.033962 |
8 | −0.016770068 | −0.27385777 | 0.01677007 | −0.2738578 | −0.18761 | 0.036649 |
9 | −0.167423851 | −0.060957415 | 0.16742385 | −0.0609574 | −0.11867 | 0.07706 |
10 | −0.007967295 | −0.067468161 | 0.0079673 | −0.0674682 | −0.49471 | −0.86389 |
Table 9 . Query coordinates from all documents.
Coordinates | Arabic query | Proposed query | Existing query |
---|---|---|---|
X,Y | −0.022388, −0.756423 | 0.033842, −0.7614 | −0.65167, 0.3559 |
Table 10 . Distance of all documents of datasets from their relevant query.
Documents | Arabic | Proposed | Existing |
---|---|---|---|
D1 | −0.391582646215 | 0.0433259292082 | 0.993919277857 |
D2 | −0.0154648277095 | −0.000640494192478 | 0.991839130391 |
D3 | 0.279630504392 | 0.0330440114234 | 0.999721682081 |
D4 | −0.932286912287 | 0.79168449571 | 0.247472435013 |
D5 | −0.249189850556 | −0.0273075914125 | 0.985180590511 |
D6 | −0.22931184731 | −0.0446988765309 | 0.999932316216 |
D7 | 0.999998904667 | 0.999910958682 | 0.984310426495 |
D8 | 0.999501678847 | 0.999859788963 | 0.953227104676 |
D9 | 0.36976960257 | 0.383502674056 | 0.997116046374 |
D10 | 0.996134343355 | 0.997327099051 | 0.0200963618144 |
Table 11 . Closest documents to relevant query.
Approaches | Close to query w.r.t points |
---|---|
Arabic text documents | D7(0.999998904667), D8(0.999501678847), D10(0.996134343355) |
Proposed text documents | D7(0.999910958682), D8(0.999859788963), D10(0.997327099051) |
Existing text documents | D1(0.993919277857), D2(0.991839130391), D3(0.999721682081), D6(0.999932316216) |
A sample of Arabic writing [
Conversion of Arabic character to English character using A-FKD.
|@|~(^,^)~|@|Conversion of Arabic character to English character using A-T.
|@|~(^,^)~|@|Arabic to English complete sentence conversion.
|@|~(^,^)~|@|Comparison of existing and proposed approaches with Arabic text based on coordinates.
|@|~(^,^)~|@|Comparison of proposed work and existing work based on similarity distance.