Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 409-422

Published online December 25, 2021

https://doi.org/10.5391/IJFIS.2021.21.4.409

© The Korean Institute of Intelligent Systems

LSI Authentication-Based Arabic to English Text Converter

Hasan J. Alyamani1, Shakeel Ahmad2, Asif Hassan Syed2, Sheikh Muhammad Saqib3 , and Yasser D. Al-Otaibi1

1Department of Information Systems, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
2Department of Computer Science, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
3Institute of Computing and Information Technology, Gomal University, Dera Ismail Khan, Pakistan

Correspondence to :
Sheikh Muhammad Saqib (saqibsheikh4@gu.edu.pk)

Received: July 7, 2021; Revised: September 9, 2021; Accepted: October 5, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Due to the large amount of Arabic text produced on a daily basis, there is a need to analyze these texts. Following a comprehensive literature review, there become clear several issues related to Arabic text summaries, keyword extraction, and sentiment analyses. These issues occur owing to several factors, such as the structure and morphology of Arabic text, a lack of machine-readable Arabic dictionaries, insufficient tools to manage Arabic text, no standard datasets, inherently cursive scripts, and isolated characters; thus, there is a need to create Arabic text in forms that can be easily read by machine learning, deep learning algorithms, and existing analysis tools. To achieve this, the Arabic texts must be converted into English texts. This paper proposes a lexicon called the AEC-Lexicon for use by all researchers working in Arabic, which is based on the Arabic case system and converts Arabic text into English text. Based on the experimental results of latent semantic indexing (LSI), it was found that texts generated from the proposed work exhibited a significant improvement over existing work (converted Arabic to English texts), considering reading and understanding as well as the relevance to the original Arabic text.

Keywords: Arabic text, Machine learning algorithms, Lexicon, Automatic generated summary

This section describes the problems in the literature and how the proposed work will overcome these problems. There are six major dialectical groups in Saudi Arabia. The main dialects in Saudi Arabia are the Hejazi dialect and the Nejdi dialect, with six million and eight million speakers, respectively [1]. According to the Arab Social Media Report in 2017 [2], there are 2.6 million active Twitter users from Saudi Arabia. Learning Arabic makes one stand out, as there are very few people from the West who speak Arabic, and having command of the Arabic language seems clever and sophisticated [3] because a considerable amount of research has been done on Arabic texts. Arabic summarization is a highly difficult task because the Arabic language is complex with respect to its morphology and structure [4]. Although standard Arabic morphological analysis tools are not available [5], some researchers have utilized existing common Arabic stemming as a workaround [6].

Existing resources for Arabic texts are limited, and the accuracy of existing methods is low owing to Arabic specific (e.g., limited resources, Arabic’s morphological complexity, differing dialects) and general linguistic issues (e.g., implicit sentiment, sarcasm, polarity fuzziness, polarity strength, review quality, spam, domain dependence) [7]. Stemming Arabic words is a complex issue in the field of Arabic text classification [8]. As reported in [9] and [10], the issues and challenges in identifying sentiments in informal Arabic language in the context of Twitter and YouTube Arabic content were investigated, which is unstructured in nature. Limited research has been conducted on opinion mining using Arabic Twitter [11] because the majority of the natural language processing (NLP) tools for the Arabic language have been developed for Modern Standard Arabic (MSA) rather than informal Arabic [12]. All machine learning algorithms are available for the English language, so there the above issue does not apply to machine-readable texts.

Thus, there is a strong need to analyze the bulk of Arabic text being created daily, but it is difficult to manage because of the following issues involved in handling Arabic text:

  • - The complexity of the Arabic language related to structure and morphology [4].

  • - Due to the inherent complexity of the Arabic language, very little research has been conducted combining Arabic text summarization and text entailment to produce extracts [13].

  • - Lack of existing Arabic morphological analysis tools [5].

  • - A limited number of documents with keywords are available online [14].

  • - There is a lack of techniques for scanning and matching of Arabic texts to identify key phrases [15].

  • - No gold standard summaries for Arabic and a lack of machine-readable Arabic dictionaries [16].

  • - Little current research on the automatic classification of text documents due to stemming Arabic [8], Arabic-specific, and general linguistic issues [7]; and existing tools are used for MSA and ignore informal Arabic [11].

The most obvious characteristics of the Arabic language are that Arabic scripts are inherently cursive and that isolated characters are written out, as shown in Figure 1 [17]; thus, there is a need to convert Arabic text into English text for further processing.

The primary purpose of the proposed methodology is to address all of the issues discussed. For example, there should be a common and consistent Arabic text that can easily solve the issues relating to machine-readability, the generation of standard summary, and morphology, which can be read by different tools. In this study, Arabic texts were converted into English texts by preparing a lexicon for conversion. The coordinates of existing documents were also compared with the proposed documents using the latent segmenting indexing method, and it was found that the documents produced by the proposed method were highly similar to the Arabic documents. The key contribution of this work is the proposed lexicon, which is based on several generated rules:

  • - Proposing Simple-Rules (based on the absence of case systems).

  • - Proposing FKD-Rules based on Fattah, Kasrah, and Dammah.

  • - Proposing T-Rules based on Tanveen

  • - Proposing a user-friendly interface for converting Arabic texts into English texts.

Using the Internet, a study [18] was conducted investigating Arabic and Islamic content containing relevant information from the prophetic narration texts using an artificial intelligence (AI) approach. The authors proposed a semantically driven approach to analyze Arabic discourse following the segmented discourse representation theory (SDRT) framework. It has also been found that discourse analysis can be used to produce indicative summaries of Arabic documents [13]. The authors of [19] proposed a new model for summarization based on clusters resulting from the document clustering extraction of keyphrases. The authors of [20] proposed an approach called Adaboost based on a supervised method to produce Arabic summary extraction. For this study, a set of statistical features were utilized including sentence position, the number of keywords in a sentence, overlap with word title, and sentence length. The proposed approach produces automatic text summarization of multiple large-scale Arabic documents using a genetic algorithm and the MapReduce parallel programming model [21]. Few studies have been conducted on an approach that ensures scalability, speed, and accuracy of summary generation [13].

Arabic text summarization is still a new approach because of the inherent complexity of the Arabic language as well as the lack of a standardized evaluation process, because there are no gold standard summaries for Arabic machine-readable dictionaries [16]. There is a model that presents an Arabic text summarization system (query-based) using Arabic WordNet and an extracted knowledge base [22]. As reported in [23], a weighted directed graph represents each document, with nodes corresponding to sentences and edge weights representing similarities between the sentences. The authors of [24] introduced a new method for Arabic text summarization based on graph theory; however, summarization systems for Arabic are still not as mature or as reliable as those developed for other languages such as English. In [25], the authors addressed the problem of developing an Arabic text summarization system (LCEAS) that produces extracts without redundancy.

Many text classification (TC) research studies have been conducted and tested with on the English, French, German, Spanish, Chinese, Greek, and Japanese languages [26]. Opinion mining is the process of automatically identifying opinions expressed in Arabic texts on certain subjects [27]; however, there is little current research on the automatic classification of text documents in Arabic because of varying spellings of certain words, various ways to write certain combination characters, and short (diacritics) and long vowels. In addition, most Arabic words contain affixes [28]. Mesleh [29] utilized three machine-learning algorithms–support vector machine (SVM), k-nearest neighbor (KNN), and naïve Bayes (NB)–to classify Arabic data collected from Arabic newspapers. The studies [30] and [31] used TC and analyses of Arabic texts to automatically assign texts to predefined categories based on their linguistic features. The NB, KNN, multinomial logistic regression (MLR), and maximum weight (MW) classifiers were used, with the NB classifier implemented as the master and the others as slaves [32]. All tests were performed after preprocessing Arabic text documents. The results showed that the master-slave technique yields a significant improvement in the accuracy of text document classification compared to other techniques [32]. The authors of [7] also used three machine-learning classification techniques, NB, SVM, and decision trees, to improve the sentiment analyzer. Stemmer and document embedding techniques were used for text classification to investigate the impact of the preprocessing methods on the performance of three machine-learning algorithms: NB, discriminative nulti-nominal naive Bayes (DMNB) text, and C4.5, all of which were used for Arabic text categorization [33]. The results of the sentiment analysis and subjectivity analysis demonstrated that while the use of lexeme or lemma data is useful, there is also a need for individualized solutions for each task and genre [10].

This section provides information about the case system of Arabic, which corresponds to vowels in English. The case endings in Arabic are little patterns ( Harakaat) which are appended to the ends of words to indicate the their grammatical functions in a sentence. Case endings are usually not written (with one exception) outside of the Qur’an/Bible and children’s books; however, newscasters pronounce them, and to speak formal Arabic well, it is important to be familiar with the case system [3,34]:

In the proposed work, three rules were created to update the lexicon: the Simple-Rule, FKD-Rule, and T-Rule. The FKD-Rule is based on Fathah, Kasrah, and Dammah, while the T-Rule is based on Tanveen. The following Arabic case systems ( ) will be matched with vowels and some other sounds of Arabic. In Arabic, Fathah is a small diagonal line placed above a letter ( ) and maps to the letter a in English. The Kasrah ( ) is a small diagonal line placed below a letter ( ) and represents a short i in English. The Dammah is a small curl placed above a letter ( ) represented by a u or o in English. Tanween corresponds to the three vowel accents that may be doubled at the end of a word, indicating that the vowel is followed by the consonant n. The signs from left to right indicate ( ) (transliterated: an in un).

The Arabic to English convertor lexicon (AEC-Lexicon) was developed from different sets of pairs and consists of a set of three categories. Some heuristics from Arabic sounds were used to develop this lexicon. The first set was generated using only simple similar sounding characters from Arabic and English, as mentioned in the transliteration system [35]. The second set was generated from similar sounding characters in Arabic and English using Fathah, Kasrah, and Dammah. The third set was generated from similar sounding characters in Arabic and English using Tanveen.

4.1 Simple-Rule

The Simple-Rule consists of simple characters without a case system. This rule consists of a tuple containing a pair denoted ASC:ESC, where ASC means Arabic sound character(s) (i.e., ) and ESC means relevant English sound character (i.e., S). After processing, 20 pairs of Arabic to English sounds were produced, as shown in Table 1.

4.2 FKD-Rule

The basic case systems used in Arabic which produce sound are A-FKD {FATHAH, KASRAH, and DUMMA} and E-rFKD{A, I, and U}, where “A” is an alternate of FATHAH, “I” is an alternate of KASRAH, and “U” is an alternate of DUMMA. As shown in Eqs. (1) and (2), the dot symbol ‘.’ connects all Arabic characters in ASC with FATHAH, KASRAH, and DUMMA and all English characters in ESC with A, I, and U. Then, they are paired with full colons, as shown in Eq. (3). All pairs were created based on their relation to ASC.A-FKD:ESC.E-rFKD using the following equations:


E-rFKD={A,I,U}(replacement FKD),Pair-FKD (A:E)=i=1m(j=1m(ASC)i).(A-FKD)j:(ESCi).(E-rFKD)j).

The equation Pair-FKD(A:E) is a pair consisting of ASC.A-FKD and ESC.E-rFKD and represents a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ), as shown in Figure 2.

Table 2 contains all pairs generated using the FKD-Rule.

4.3 T-Rule

The “T” in T-Rule refers to Tanveen ( ) and is also based on the FKD-Rule. Tanveen used in Arabic are A-T { } and ErT {AN, IN, UN}, where “AN” is a replacement for , “IN” is a replacement for , and “UN” is a replacement for , as shown in Eqs. (4) and (5). The dot symbol ‘.’ connects all Arabic characters in ASC with and all English characters ESC with AN, IN, and UN. They are then paired with semicolons, as shown in Eq. (6). All pairs were created based on relations to ASC.A-T:ESC.E-rT.


E-rT={AN,IN,UN},Pair-T(A:E)=i=1m(j=1m(ASCi).(A-T)j:(ESCi).(E-rT)j).

The equation Pair-T(A:E) is a pair of ASC.A-T and ESC.E-rT representing a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ) as shown in Figure 3.

All pairs generated using the T-Rule are shown in Table 3. In Table 3, the rows with black text have unique sounds, while the remaining colors maintain the same sounds from Arabic to English.

These rules were implemented in Python because it is an open-source language that is easy to understand, learn, and use. Python also has highly useful libraries for the manipulation and analysis of data. When modern languages (existing) are assessed, the utility of Python-based solutions is remarkable in terms of agility and productivity. Many companies in all regions, including the largest investment banks and the smallest social/mobile web app startups, have used Python to operate their businesses and manage their data [36].

After generating the three rules, they were applied to Arabic sentences. Either the FKD-Rule or T-Rule can be applied first, and then the Simple-Rule may be applied, because applying the Simple-Rule first when converting Arabic texts into English may result in a less readable form. The conversion process is shown in Figure 4.

Quranic Arabic texts from “Surah Ikhlas” (chapter 112 verses 1–4) and “Sura An-Nass” (chapter 114 verses 1–6) have were converted, as shown in Table 4. These results are generated using only FATAH, KASRAh, DUMMA, and TANVEEN.

It is worth mentioning that “no generally accepted database for Arabic text recognition is freely available for researchers” [37]. Hence, different researchers investigating Arabic text recognition use different data, and therefore, the recognition rates of the different techniques may not be comparable. The website (https://www.sujood.co) was used to prepare Dataset-1, from which there are some collections of prophetic traditions (hadith) that have the desired texts for the proposed approach. This approach was tested on Dataset-1, which contained three forms of different hadith. The first form contained a text in Arabic, the second contained English-converted text, and the third contained a text in the translated English form. All hadith were taken from the website (https://www.sujood.co), where all the texts are publicly available. The first and second text forms are discussed because the primary aim of the proposed method is to convert Arabic texts into English texts. The second form of the text is then compared with the proposed approach. The sample listings of the datasets are presented in Table 5.

After the conversion of Arabic texts to English texts, the existing and manual approaches were compared, and our proposed approach yielded the best readable form. These formulations are presented in Table 6 based on the sample data. In the second column, the red-colored words from the existing approach are less readable, whereas in the third column, the relevant words in green color produced using the proposed approach can be easily read.

5.1 Audio Result

The converted documents were also checked using an online speaking tool (www.fromtexttospeech.com), and the sounds were similar to the original Arabic sounds, to the extent allowed by the developed rules.

5.2 Statistical Measures

The latent semantic indexing approach was applied to determine the accuracy of the existing approach. For this purpose, the vector coordinates of the documents of the Arabic texts, the documents of the existing approach, and documents of the proposed approach were examined. The coordinates of the existing documents and the proposed documents with Arabic text were then compared. A query to determine the coordinates was also required. The documents of all approaches remained in the same order, and Document-6 was taken as a query, as shown in Table 7. In addition, the similarity distance of each document was determined using a specific query.

5.2.1 Vector coordinates of documents

The main objective was to identify the coordinates of each document and query. Singular value decomposition (SVD) can determine the points or coordinates of a document and query. Using SVD, three matrices S, V, and U, can be determined by a matrix and used for further processing. To determine the values of these variables, SVD requires a matrix consisting of rows and columns of integers and different text documents as inputs. A feature matrix can be obtained by calculating the frequency of each word. This means a feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using the NumPy library. The coordinates of all the documents were determined from S, and these coordinates were merged with the query to obtain the query coordinates. Finally, a cosine similarity function was applied to these coordinates to identify the documents closest to the queries [38].

From S, the coordinates of all documents corresponding to Arabic texts (column-1 in Table 5), texts converted using the existing approach (column-1 in Table 6), and texts converted using the proposed approach (column-2 in Table 6), were determined, as shown in Table 8. These coordinates were merged with relevant queries from Table 8 to find the query coordinates, which are shown in Table 9. The data provided in Table 8 represent 10 documents from the Arabic, proposed, and existing texts, and the accuracies are shown in Figure 5.

5.2.2 Similarity distance of documents from the query

LSI, which was proposed by [39,40], is an efficient information retrieval algorithm [41]. In LSI, there is a cosine similarity measurement between the coordinates (X, Y ) of a document vector and the coordinates (X, Y ) of a query vector corresponding to how closely the document matches the query (e.g., if this value is one, the document matches the query 100%; if it is 0.5, the document matches the query 50%; and if it is 0.9, the document matches the query 90%). The important step now is to find the coordinates of each document and query. SVD can determine these coordinates, determining the three values S, V, and U using a matrix. This matrix consists of rows and columns containing integers, for which the inputs are different text documents. A feature matrix can be obtained by calculating the frequency of each word. A feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using NumPy. The coordinates of all the documents are determined from S and are merged with the query to obtain the query coordinates. Finally, a cosine similarity function is applied to these coordinates to find the documents that best match the query.

It is clear that LSI uses the cosine similarity measure to rank the data with respect to the query and to find the points (X, Y ) of each document with respect to the query coordinates (X, Y ) using Eq. (7). LSI will also determine the query coordinates for Arabic text, Arabic to English converted text using the proposed technique, and Arabic to English converted text using the existing technique. These query coordinates are shown in Table 9.

Similarity(A,B)=i=1nA(i)*B(i)i=1nA(i)2i=1nB(i)2.

Table 10 depicts the similarity distance of a sample of 10 Arabic documents that were queried, including queries for the proposed and existing methods.

As shown in Table 11, it is clear that the values of the 2nd column (proposed work) are most similar to the Arabic text, and the values from the 3rd column (existing approach) have many different values. Figure 6 illustrates the comparison of the existing and proposed methods with the Arabic text. The figure clearly shows that the line for the proposed method is much closer to the line for the Arabic text than the line for the existing method is.

In Table 10, the document is 100% closer to the query if the value is one, 50% closer to the query for a value of 0.5, and 90% closer for a value of 0.9. For the Arabic text and the proposed approach (green color), D7, D8, and D10 are close to their queries, whereas for the existing approach (gray color), D1, D2, D3, D6, and D9 are close to their queries, as shown in Table 11. This table represents the proposed method’s text converted into English, which is similar to the existing English-converted text.

It was concluded that using the proposed work, all Arabic texts such as Arabic stop words, dictionaries, online reviews, tweets, lexicons, Quranic text, and hadith text can be easily converted into English texts, allowing for several types of analyses that require English text as input. Researchers can also use this lexicon to convert Arabic datasets into English texts.

Although Arabic text can be converted into English in a readable manner using these case systems, this readability can also be improved by using other case systems, such as Maddah/ /, Dagger Alif / /, Shaddah / /, etc. Future works will focus on converting Arabic texts to English texts by inverting the pairs ASC:ESC to ESC:ASC, ASC.A-FKD:ESC.E-rFKD to ESC.E-rFKD:ASC. A-FKD and ASC. A-T:ESC.E-rT to ESC.ErT: ASC.A-T with the inclusion of the remaining case systems.

Fig. 1.

A sample of Arabic writing [17].


Fig. 2.

Conversion of Arabic character to English character using A-FKD.


Fig. 3.

Conversion of Arabic character to English character using A-T.


Fig. 4.

Arabic to English complete sentence conversion.


Fig. 5.

Comparison of existing and proposed approaches with Arabic text based on coordinates.


Fig. 6.

Comparison of proposed work and existing work based on similarity distance.


Table. 1.

Table 1. All ASC:ESC pairs.


Table. 2.

Table 2. All ASC.A-FKD and ESC.E-rFKD pairs.


Table. 3.

Table 3. All ASC.A-T and ESC.E-rT pairs.


Table. 4.

Table 4. Conversion of “Surah Ikhlas” and “Sura An-Nass”.


Table. 5.

Table 5. Sample sentences from Dataset-1.

Serial NoArabic textEnglish converted text (Existing approach)
1“Allahumma innee a’uzubika min ham ayhzununee, wa min fikr yuqliqunee, wa 3ilm yut3ibunee, wa shakhS yahmilu khubsan-lee”
2“Allahuma inni ashku ilayka du’fa quwati wa qilata heelati wa hawaani ala annaas. yaa arham araahimin, anta rab wa anta rabi”
3“Allahumma inni a’udhu bika min zawali ni’matika, wa tahawwuli’afiyatika, wa fuja’ati niqmatika, wa jami’i sakhatika”
4“Allahumma azhib Gaydha Qalbee”
5“Rabbi IshraHlee Sadree, wa yassirlee amree, waHlul Uqdatan min lisaanee, yafqahuu qawlee”
6“Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal”

Table. 6.

Table 6. Text comparison of existing and proposed approaches.


Table. 7.

Table 7. Queries document from all datasets.

Query from datasetDocument No from datasetText of selected document
Query (Arabic documents)Document-6
Query (Existing approach)Document-6“Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal”
Query (Proposed approach)Document-6

Table. 8.

Table 8. Coordinates of all documents all datasets.

Doc-NoArabic XArabic YProposed XProposed YExisting XExisting Y
1−1.38872E-166.4016E-178.4641E-169.0951E-19−0.314220.219757
2−0.9852618730.0444205770.985261870.04442058−0.285880.207485
3−4.17806E-16−1.08385E-169.645E-171.0961E-18−0.295490.170587
4−8.27432E-172.33523E-168.2149E-17−9.72E-17−0.02333−0.02504
53.1507E-167.12019E-17−1.099E-15−1.881E-17−0.100850.080281
62.06014E-164.21434E-17−6.323E-161.8952E-19−0.645890.343129
7−0.029725105−0.9564313750.02972511−0.9564314−0.10160.033962
8−0.016770068−0.273857770.01677007−0.2738578−0.187610.036649
9−0.167423851−0.0609574150.16742385−0.0609574−0.118670.07706
10−0.007967295−0.0674681610.0079673−0.0674682−0.49471−0.86389

Table. 9.

Table 9. Query coordinates from all documents.

CoordinatesArabic queryProposed queryExisting query
X,Y−0.022388, −0.7564230.033842, −0.7614−0.65167, 0.3559

Table. 10.

Table 10. Distance of all documents of datasets from their relevant query.

DocumentsArabicProposedExisting
D1−0.3915826462150.04332592920820.993919277857
D2−0.0154648277095−0.0006404941924780.991839130391
D30.2796305043920.03304401142340.999721682081
D4−0.9322869122870.791684495710.247472435013
D5−0.249189850556−0.02730759141250.985180590511
D6−0.22931184731−0.04469887653090.999932316216
D70.9999989046670.9999109586820.984310426495
D80.9995016788470.9998597889630.953227104676
D90.369769602570.3835026740560.997116046374
D100.9961343433550.9973270990510.0200963618144

Table. 11.

Table 11. Closest documents to relevant query.

ApproachesClose to query w.r.t points
Arabic text documentsD7(0.999998904667), D8(0.999501678847), D10(0.996134343355)
Proposed text documentsD7(0.999910958682), D8(0.999859788963), D10(0.997327099051)
Existing text documentsD1(0.993919277857), D2(0.991839130391), D3(0.999721682081), D6(0.999932316216)

  1. Aldayel, HK, and Azmi, AM (2016). Arabic tweets sentiment analysis: a hybrid scheme. Journal of Information Science. 42, 782-797. https://doi.org/10.1177%2F0165551515610513
    CrossRef
  2. Salem, F (2017). Social media and the internet of things towards data-driven policymaking in the Arab world: potential, limits and concerns,” MBR School of Government. The Arab Social Media Report no 7. Dubai, UAE
  3. Arabic Language blog. (2011) . Arabic Diacritics (Al-Tashkeel). Available: https://blogs.transparent.com/arabic/2-arabic-diacritics-al-tashkeel
  4. Rose, S, Engel, D, Cramer, N, and Cowley, W (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory. Chichester, UK: Wiely, pp. 1-20 https://doi.org/10.1002/9780470689646.ch1
  5. Saad, MK 2010. The impact of text preprocessing and term weighting on Arabic text classification. Ph.D. dissertation. The Islamic University of Gaza. Palestine.
  6. Conchillo, A, Cercaci, L, Ansorena, D, Rodriguez-Estrada, MT, Lercker, G, and Astiasaran, I (2005). Levels of phytosterol oxides in enriched and nonenriched spreads: application of a thin-layer chromatography-gas chromatography methodology. Journal of Agricultural and Food Chemistry. 53, 7844-7850. https://doi.org/10.1021/jf050539m
    Pubmed CrossRef
  7. Hamouda, AEDA, and El-taher, FEZ (2013). Sentiment analyzer for Arabic comments system. International Journal of Advanced Computer Science and Applications. 4, 99-103. https://doi.org/10.14569/IJACSA.2013.040317
  8. Elhassan, R, and Ahmed, M (2015). Arabic text classification review. International Journal of Computer Science and Software Engineering. 4, 1-5.
  9. AlOtaibi, S, and Khan, MB (2017). Sentiment analysis challenges of informal Arabic. International Journal of Advanced Computer Science and Applications. 8, 278-284. https://doi.org/10.14569/ijacsa.2017.080237
  10. Abdul-Mageed, M, Diab, M, and Kubler, S (2014). SAMAR: Subjectivity and sentiment analysis for Arabic social media. Computer Speech & Language. 28, 20-37. https://doi.org/10.1016/j.csl.2013.03.001
    CrossRef
  11. Alwakid, G, Osman, T, and Hughes-Roberts, T (2017). Challenges in sentiment analysis for Arabic social networks. Procedia Computer Science. 117, 89-100. https://doi.org/10.1016/j.procs.2017.10.097
    CrossRef
  12. Alowisheq, A, Alhumoud, S, Altwairesh, N, and Albuhairi, T (2016). Arabic sentiment analysis resources: a survey. Social Computing and Social Media. Cham, Switzerland: Springer, pp. 267-278 https://doi.org/10.1007/978-3-319-39910-2_25
    CrossRef
  13. Lagrini, S, Redjimi, M, and Azizi, N (2017). Automatic Arabic text summarization approaches. International Journal of Computer Applications. 164, 31-37. https://doi.org/10.5120/ijca2017913628
    CrossRef
  14. El-Fishawy, N, Hamouda, A, Attiya, GM, and Atef, M (2014). Arabic summarization in twitter social network. Ain Shams Engineering Journal. 5, 411-420. https://doi.org/10.1016/j.asej.2013.11.002
    CrossRef
  15. Najadat, HM, Hmeidi, II, Al-Kabi, MN, and Issa, MMB (2016). Automatic keyphrase extractor from Arabic documents. International Journal of Advanced Computer Science and Applications. 7, 192-199. https://doi.org/10.14569/ijacsa.2016.070226
  16. Al Qassem, LM, Wang, D, Al Mahmoud, Z, Barada, H, Al-Rubaie, A, and Almoosa, NI (2017). Automatic Arabic summarization: a survey of methodologies and systems. Procedia Computer Science. 117, 10-18. https://doi.org/10.1016/j.procs.2017.10.088
    CrossRef
  17. Khorsheed, MS, and Al-Omari, H . Recognizing cursive Arabic text: Using statistical features and interconnected mono-HMMs., Proceedings of 2011 4th International Congress on Image and Signal Processing, 2011, Shanghai, China, Array, pp.1540-1543. https://doi.org/10.1109/CISP.2011.6100511
  18. Atwell, E, Brierley, C, Dukes, K, Sawalha, M, and Sharaf, AB . An artificial intelligence approach to Arabic and Islamic content on the Internet., Proceedings of NITS 3rd National Information Technology Symposium, 2011, Riyadh, Saudi Arabia, Array, pp.1-8. https://doi.org/10.13140/2.1.2425.9528
  19. Fejer, HN, and Omar, N (2015). Automatic multi-document Arabic text summarization using clustering and keyphrase extraction. Journal of Artificial Intelligence. 8, 1-9. https://doi.org/10.3923/jai.2015.1.9
    CrossRef
  20. Belkebir, R, and Guessoum, A (2015). A supervised approach to Arabic text summarization using Adaboost. New Contributions in Information Systems and Technologies. Cham, Switzerland: Springer, pp. 227-236 https://doi.org/10.1007/978-3-319-16486-1_23
    CrossRef
  21. Baraka, RS, and Al Breem, SN . Automatic Arabic text summarization for large scale multiple documents using genetic algorithm and MapReduce., Proceedings of 2017 Palestinian International Conference on Information and Communication Technology (PICICT), 2017, Gaza, Palestine, Array, pp.40-45. https://doi.org/10.1109/PICICT.2017.32
  22. Imam, I, Nounou, N, Hamouda, A, and Khalek, HAA (2013). An ontology-based summarization system for Arabic documents (OSSAD). International Journal of Computer Applications. 74, 38-43. https://doi.org/10.5120/12980-0237
    CrossRef
  23. Al-Taani, AT, and Al-Omour, MM . An extractive graph-based Arabic text summarization approach., Proceedings of the International Arab Conference on Information Technology (ACIT), 2014, Nizwa, Oman.
  24. Alami, N, Meknassi, M, Ouatik, SA, and Ennahnahi, N . Arabic text summarization based on graph theory., Proceedings of 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, Marrakech, Morocco, Array, pp.1-8. https://doi.org/10.1109/AICCSA.2015.7507254
  25. Al-Khawaldeh, FT, and Samawi, VW (2015). Lexical cohesion and entailment based segmentation for Arabic text summarization (lCEAS). World of Computer Science & Information Technology Journal. 5, 51-60.
  26. Pang, B, and Lee, L (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval. 2, 1-135. https://doi.org/10.1561/1500000011
    CrossRef
  27. Badaro, G, Baly, R, Hajj, H, Habash, N, and El-Hajj, W . A large scale Arabic sentiment lexicon for Arabic opinion mining., Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, Doha, Qatar, pp.165-173.
  28. Fodil, L, Sayoud, H, and Ouamour, S . Theme classification of Arabic text: a statistical approach., Proceedings of the 11th International Conference on Terminology and Knowledge Engineering, 2014, Berlin, Germany.
  29. Mesleh, AMA (2007). Chi square feature extraction based SVMs Arabic language text categorization system. Journal of Computer Science. 3, 430-435. https://doi.org/10.3844/jcssp.2007.430.435
    CrossRef
  30. Al-Harbi, S, Almuhareb, A, Al-Thubaity, A, Khorsheed, MS, and Al-Rajeh, A . Automatic Arabic text classification., Proceedings of the 9th International Conference on Textual Data Statistical Analysis (JADT), 2008, Lyon, France.
  31. Al-Kabi, MN, Gigieh, AH, Alsmadi, IM, Wahsheh, HA, and Haidar, MM (2014). Opinion mining and analysis for Arabic language. International Journal of Advanced Computer Science and Applications. 5, 181-195. https://doi.org/10.14569/ijacsa.2014.050528
  32. Abutiheen, ZA, Aliwy, AH, and Aljanabi, KB (2018). Arabic text classification using master-slaves technique. Journal of Physics: Conference Series. 1032. article no 012052
  33. Alshammari, R (2018). Arabic text categorization using machine learning approaches. International Journal of Advanced Computer Science and Applications. 9, 226-30. https://doi.org/10.14569/ijacsa.2018.090332
    CrossRef
  34. Arabic learning resources: Arabic media vocabulary. Available: https://arabic.desert-sky.net/g_cases.html
  35. QAMUS. (2002) . Arabic transliteration. Available: http://www.qamus.org/transliteration.htm
  36. (). Why Python for Big Data? | Continuum.
  37. Al-Muhtaseb, HA, Mahmoud, SA, and Qahwaji, RS . Statistical analysis of Arabic text to support optical Arabic text recognition., Proceedings of the International Symposium on Computer and Arabic Language & Exhibition (ISCAL), 2007, Riyadh, Saudi Arabia, pp.1-16.
  38. Saqib, SM, Mahmood, K, and Naeem, T (2016). Comparison of LSI algorithms without and with pre-processing: using text document based search. ACCENTS Transactions on Information Security. 1, 44-51.
  39. Deerwester, S, Dumais, ST, Furnas, GW, Landauer, TK, and Harshman, R (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science. 41, 391-407. https://doi.org/10.1002/(sici)1097-4571(199009)41:6〈391::aid-asi1〉3.0.co;2-9
    CrossRef
  40. Blei, DM, Ng, AY, and Jordan, MI (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research. 3, 993-1022.
  41. Phadnis, N, and Gadge, J (2014). Framework for document retrieval using latent semantic indexing. International Journal of Computer Applications. 94, 37-41. https://doi.org/10.5120/16414-6065
    CrossRef

Hasan J. Alyamani is a Chairman of Information Systems Department, King Abdulaziz University, Saudi Arabia. He received his B.Sc. (Computer Science) from Umm Al-Qura University, Saudi Arabia in 2006, M.S. (Computer Science) from The University of Waikato, New Zealand in 2012 and Ph.D. (Computer Science) from Macquarie University, Australia in 2019. He has produced many publications in Journal of international repute and presented papers in international conferences.

E-mail: hjalyamani@kau.edu.sa

Shakeel Ahmad received his B.Sc. with distinction from Gomal University, Pakistan (1986) and M.Sc. (Computer Science) from Qauid-e-Azam University, Pakistan (1990). He received Ph.D. degree in computer science in January 2008. He then completed one-year post-doctoral study from University Science Malaysia (USM) in 2010. He started his career as a lecturer in 1990 and served for 11 years in Institute of Computing and Information Technology (ICIT), Gomal University, Pakistan. Then he served as Assistant Professor, Associate Professor, Professor and Director Institute of Computing and Information Technology (ICIT), Gomal University, Pakistan during 2001–2014. Now days, he is serving as professor in Faculty of Computing and Information Technology Rabigh branch, King Abdulaziz University Jeddah, Kingdom of Saudi Arabia. Dr. Shakeel has an outstanding teaching career with proficient research background, reflecting more than 31 years of teaching and research experience in, machine learning, deep learning, sentiment analysis and text mining, and performance modeling. He has produced many publications in journal of international repute and presented papers in international conferences.

E-mail: sarahamd@kau.edu.sa

Asif Hassan Syed was born in Jamshedpur, India, in 1982. He received the M.Sc. degree in Biotechnology from Shri Ramachandra Medical College and Deemed University, Chennai, India, in 2005, and the Ph.D. degree from the Indian Institute of Technology Roorkee, India. He did his post-doc under the esteemed guidance of Professor Seyed E. Hasnain at the University of Hyderabad from January 2012 to August 2012. In 2012, he started his career as an Assistant Professor with the Faculty of Computing and Information Technology at Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia. He is currently serving as an Associate Professor in the afore-mentioned university. He is an excellent teacher and talented researcher with more than eight years of teaching and research experience in bioinformatics, chemoinformatics, genomics & proteomics, and machine learning. He has produced many publications in the Journal of international repute and presented articles in International conferences. His current research interests are Genomics & proteomics, medical informatics, and machine learning. The author is a member of the International Association of Engineers (IAENG) and a member of the following societies: (a) The IAENG Society of Bioinformatics (b) The IAENG Society of Computer Science and (c) The IAENG Society of Data Mining. The author was the recipient of Deanship of Scientific Research Paper Award from King Abdulaziz University for the paper A qualitative and Quantitative Assay to study DNA/Drug Interaction based on sequence Selective Inhibition of Restriction Endonucleases, in the year 2012. Availed DBT Post-Doctoral Research Associateship in Biotechnology and Life Sciences 2011 (Under Prof. Seyed E. Hasnain) funded by DBT-IISC Bangalore and Bagged the Louis Pasteur Memorial Award for the best outgoing student for the year 2003 by the Department of Microbiology-The New College Chennai-Madras University, India.

E-mail: shassan1@kau.edu.sa

Sheikh Muhammad Saqib received his M.Sc. (Computer Science) from Gomal University, Pakistan (2002). He received M.S. degree in computer science in 2011 and Ph.D. degree in computer science in 2020. He started his career as a lecturer in 2012 at Institute of Computing and Information Technology (ICIT), Gomal University, Pakistan. Now days, he is serving as Assistant Professor in Institute of Computing and Information Technology (ICIT), Gomal University, Pakistan. His research interest includes machine learning, deep learning, sentiment analysis and opinion mining.

E-mail: saqibsheikh4@gu.edu.pk

Yasser D. Al-Otaibi received his Bachelor of Information Technology (with concentration on Information Systems) in 2010 and Master of Information Systems from Griffith University, Australia (2012). He also received Graduate Diploma of Research Studies in 2013 and Ph.D. degree in Information Systems in 2018 from Griffith University, Australia. He started his career as a lecturer in 2016 at the college of Computing and Information Technology (Department of Information Systems), King Abdulaziz University Jeddah. Now days, he is serving as Assistant Professor in Faculty of Computing and Information Technology Rabigh branch, King Abdulaziz University Jeddah, Kingdom of Saudi Arabia. Dr. Yasser has an outstanding teaching career with proficient research background, reflecting more than 5 years of teaching and research experience in, wireless sensor networks, IoT, machine learning and deep learning. He received Griffith Award for Academic Excellence 2011–2012 (for studies in the Master of Information Systems). He also received Griffith Award for Academic Excellence 2012–2013 (for studies in the Graduate Diploma of Research Studies).

E-mail: yalotaibi@kau.edu.sa

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 409-422

Published online December 25, 2021 https://doi.org/10.5391/IJFIS.2021.21.4.409

Copyright © The Korean Institute of Intelligent Systems.

LSI Authentication-Based Arabic to English Text Converter

Hasan J. Alyamani1, Shakeel Ahmad2, Asif Hassan Syed2, Sheikh Muhammad Saqib3 , and Yasser D. Al-Otaibi1

1Department of Information Systems, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
2Department of Computer Science, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah, Saudi Arabia
3Institute of Computing and Information Technology, Gomal University, Dera Ismail Khan, Pakistan

Correspondence to:Sheikh Muhammad Saqib (saqibsheikh4@gu.edu.pk)

Received: July 7, 2021; Revised: September 9, 2021; Accepted: October 5, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Due to the large amount of Arabic text produced on a daily basis, there is a need to analyze these texts. Following a comprehensive literature review, there become clear several issues related to Arabic text summaries, keyword extraction, and sentiment analyses. These issues occur owing to several factors, such as the structure and morphology of Arabic text, a lack of machine-readable Arabic dictionaries, insufficient tools to manage Arabic text, no standard datasets, inherently cursive scripts, and isolated characters; thus, there is a need to create Arabic text in forms that can be easily read by machine learning, deep learning algorithms, and existing analysis tools. To achieve this, the Arabic texts must be converted into English texts. This paper proposes a lexicon called the AEC-Lexicon for use by all researchers working in Arabic, which is based on the Arabic case system and converts Arabic text into English text. Based on the experimental results of latent semantic indexing (LSI), it was found that texts generated from the proposed work exhibited a significant improvement over existing work (converted Arabic to English texts), considering reading and understanding as well as the relevance to the original Arabic text.

Keywords: Arabic text, Machine learning algorithms, Lexicon, Automatic generated summary

1. Introduction

This section describes the problems in the literature and how the proposed work will overcome these problems. There are six major dialectical groups in Saudi Arabia. The main dialects in Saudi Arabia are the Hejazi dialect and the Nejdi dialect, with six million and eight million speakers, respectively [1]. According to the Arab Social Media Report in 2017 [2], there are 2.6 million active Twitter users from Saudi Arabia. Learning Arabic makes one stand out, as there are very few people from the West who speak Arabic, and having command of the Arabic language seems clever and sophisticated [3] because a considerable amount of research has been done on Arabic texts. Arabic summarization is a highly difficult task because the Arabic language is complex with respect to its morphology and structure [4]. Although standard Arabic morphological analysis tools are not available [5], some researchers have utilized existing common Arabic stemming as a workaround [6].

Existing resources for Arabic texts are limited, and the accuracy of existing methods is low owing to Arabic specific (e.g., limited resources, Arabic’s morphological complexity, differing dialects) and general linguistic issues (e.g., implicit sentiment, sarcasm, polarity fuzziness, polarity strength, review quality, spam, domain dependence) [7]. Stemming Arabic words is a complex issue in the field of Arabic text classification [8]. As reported in [9] and [10], the issues and challenges in identifying sentiments in informal Arabic language in the context of Twitter and YouTube Arabic content were investigated, which is unstructured in nature. Limited research has been conducted on opinion mining using Arabic Twitter [11] because the majority of the natural language processing (NLP) tools for the Arabic language have been developed for Modern Standard Arabic (MSA) rather than informal Arabic [12]. All machine learning algorithms are available for the English language, so there the above issue does not apply to machine-readable texts.

Thus, there is a strong need to analyze the bulk of Arabic text being created daily, but it is difficult to manage because of the following issues involved in handling Arabic text:

  • - The complexity of the Arabic language related to structure and morphology [4].

  • - Due to the inherent complexity of the Arabic language, very little research has been conducted combining Arabic text summarization and text entailment to produce extracts [13].

  • - Lack of existing Arabic morphological analysis tools [5].

  • - A limited number of documents with keywords are available online [14].

  • - There is a lack of techniques for scanning and matching of Arabic texts to identify key phrases [15].

  • - No gold standard summaries for Arabic and a lack of machine-readable Arabic dictionaries [16].

  • - Little current research on the automatic classification of text documents due to stemming Arabic [8], Arabic-specific, and general linguistic issues [7]; and existing tools are used for MSA and ignore informal Arabic [11].

The most obvious characteristics of the Arabic language are that Arabic scripts are inherently cursive and that isolated characters are written out, as shown in Figure 1 [17]; thus, there is a need to convert Arabic text into English text for further processing.

The primary purpose of the proposed methodology is to address all of the issues discussed. For example, there should be a common and consistent Arabic text that can easily solve the issues relating to machine-readability, the generation of standard summary, and morphology, which can be read by different tools. In this study, Arabic texts were converted into English texts by preparing a lexicon for conversion. The coordinates of existing documents were also compared with the proposed documents using the latent segmenting indexing method, and it was found that the documents produced by the proposed method were highly similar to the Arabic documents. The key contribution of this work is the proposed lexicon, which is based on several generated rules:

  • - Proposing Simple-Rules (based on the absence of case systems).

  • - Proposing FKD-Rules based on Fattah, Kasrah, and Dammah.

  • - Proposing T-Rules based on Tanveen

  • - Proposing a user-friendly interface for converting Arabic texts into English texts.

2. Related Work

Using the Internet, a study [18] was conducted investigating Arabic and Islamic content containing relevant information from the prophetic narration texts using an artificial intelligence (AI) approach. The authors proposed a semantically driven approach to analyze Arabic discourse following the segmented discourse representation theory (SDRT) framework. It has also been found that discourse analysis can be used to produce indicative summaries of Arabic documents [13]. The authors of [19] proposed a new model for summarization based on clusters resulting from the document clustering extraction of keyphrases. The authors of [20] proposed an approach called Adaboost based on a supervised method to produce Arabic summary extraction. For this study, a set of statistical features were utilized including sentence position, the number of keywords in a sentence, overlap with word title, and sentence length. The proposed approach produces automatic text summarization of multiple large-scale Arabic documents using a genetic algorithm and the MapReduce parallel programming model [21]. Few studies have been conducted on an approach that ensures scalability, speed, and accuracy of summary generation [13].

Arabic text summarization is still a new approach because of the inherent complexity of the Arabic language as well as the lack of a standardized evaluation process, because there are no gold standard summaries for Arabic machine-readable dictionaries [16]. There is a model that presents an Arabic text summarization system (query-based) using Arabic WordNet and an extracted knowledge base [22]. As reported in [23], a weighted directed graph represents each document, with nodes corresponding to sentences and edge weights representing similarities between the sentences. The authors of [24] introduced a new method for Arabic text summarization based on graph theory; however, summarization systems for Arabic are still not as mature or as reliable as those developed for other languages such as English. In [25], the authors addressed the problem of developing an Arabic text summarization system (LCEAS) that produces extracts without redundancy.

Many text classification (TC) research studies have been conducted and tested with on the English, French, German, Spanish, Chinese, Greek, and Japanese languages [26]. Opinion mining is the process of automatically identifying opinions expressed in Arabic texts on certain subjects [27]; however, there is little current research on the automatic classification of text documents in Arabic because of varying spellings of certain words, various ways to write certain combination characters, and short (diacritics) and long vowels. In addition, most Arabic words contain affixes [28]. Mesleh [29] utilized three machine-learning algorithms–support vector machine (SVM), k-nearest neighbor (KNN), and naïve Bayes (NB)–to classify Arabic data collected from Arabic newspapers. The studies [30] and [31] used TC and analyses of Arabic texts to automatically assign texts to predefined categories based on their linguistic features. The NB, KNN, multinomial logistic regression (MLR), and maximum weight (MW) classifiers were used, with the NB classifier implemented as the master and the others as slaves [32]. All tests were performed after preprocessing Arabic text documents. The results showed that the master-slave technique yields a significant improvement in the accuracy of text document classification compared to other techniques [32]. The authors of [7] also used three machine-learning classification techniques, NB, SVM, and decision trees, to improve the sentiment analyzer. Stemmer and document embedding techniques were used for text classification to investigate the impact of the preprocessing methods on the performance of three machine-learning algorithms: NB, discriminative nulti-nominal naive Bayes (DMNB) text, and C4.5, all of which were used for Arabic text categorization [33]. The results of the sentiment analysis and subjectivity analysis demonstrated that while the use of lexeme or lemma data is useful, there is also a need for individualized solutions for each task and genre [10].

3. Case System

This section provides information about the case system of Arabic, which corresponds to vowels in English. The case endings in Arabic are little patterns ( Harakaat) which are appended to the ends of words to indicate the their grammatical functions in a sentence. Case endings are usually not written (with one exception) outside of the Qur’an/Bible and children’s books; however, newscasters pronounce them, and to speak formal Arabic well, it is important to be familiar with the case system [3,34]:

In the proposed work, three rules were created to update the lexicon: the Simple-Rule, FKD-Rule, and T-Rule. The FKD-Rule is based on Fathah, Kasrah, and Dammah, while the T-Rule is based on Tanveen. The following Arabic case systems ( ) will be matched with vowels and some other sounds of Arabic. In Arabic, Fathah is a small diagonal line placed above a letter ( ) and maps to the letter a in English. The Kasrah ( ) is a small diagonal line placed below a letter ( ) and represents a short i in English. The Dammah is a small curl placed above a letter ( ) represented by a u or o in English. Tanween corresponds to the three vowel accents that may be doubled at the end of a word, indicating that the vowel is followed by the consonant n. The signs from left to right indicate ( ) (transliterated: an in un).

4. AEC-Lexicon

The Arabic to English convertor lexicon (AEC-Lexicon) was developed from different sets of pairs and consists of a set of three categories. Some heuristics from Arabic sounds were used to develop this lexicon. The first set was generated using only simple similar sounding characters from Arabic and English, as mentioned in the transliteration system [35]. The second set was generated from similar sounding characters in Arabic and English using Fathah, Kasrah, and Dammah. The third set was generated from similar sounding characters in Arabic and English using Tanveen.

4.1 Simple-Rule

The Simple-Rule consists of simple characters without a case system. This rule consists of a tuple containing a pair denoted ASC:ESC, where ASC means Arabic sound character(s) (i.e., ) and ESC means relevant English sound character (i.e., S). After processing, 20 pairs of Arabic to English sounds were produced, as shown in Table 1.

4.2 FKD-Rule

The basic case systems used in Arabic which produce sound are A-FKD {FATHAH, KASRAH, and DUMMA} and E-rFKD{A, I, and U}, where “A” is an alternate of FATHAH, “I” is an alternate of KASRAH, and “U” is an alternate of DUMMA. As shown in Eqs. (1) and (2), the dot symbol ‘.’ connects all Arabic characters in ASC with FATHAH, KASRAH, and DUMMA and all English characters in ESC with A, I, and U. Then, they are paired with full colons, as shown in Eq. (3). All pairs were created based on their relation to ASC.A-FKD:ESC.E-rFKD using the following equations:


E-rFKD={A,I,U}(replacement FKD),Pair-FKD (A:E)=i=1m(j=1m(ASC)i).(A-FKD)j:(ESCi).(E-rFKD)j).

The equation Pair-FKD(A:E) is a pair consisting of ASC.A-FKD and ESC.E-rFKD and represents a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ), as shown in Figure 2.

Table 2 contains all pairs generated using the FKD-Rule.

4.3 T-Rule

The “T” in T-Rule refers to Tanveen ( ) and is also based on the FKD-Rule. Tanveen used in Arabic are A-T { } and ErT {AN, IN, UN}, where “AN” is a replacement for , “IN” is a replacement for , and “UN” is a replacement for , as shown in Eqs. (4) and (5). The dot symbol ‘.’ connects all Arabic characters in ASC with and all English characters ESC with AN, IN, and UN. They are then paired with semicolons, as shown in Eq. (6). All pairs were created based on relations to ASC.A-T:ESC.E-rT.


E-rT={AN,IN,UN},Pair-T(A:E)=i=1m(j=1m(ASCi).(A-T)j:(ESCi).(E-rT)j).

The equation Pair-T(A:E) is a pair of ASC.A-T and ESC.E-rT representing a tuple, i.e., for Arabic character ‘ ’, the tuple with the English character B becomes ( ) as shown in Figure 3.

All pairs generated using the T-Rule are shown in Table 3. In Table 3, the rows with black text have unique sounds, while the remaining colors maintain the same sounds from Arabic to English.

These rules were implemented in Python because it is an open-source language that is easy to understand, learn, and use. Python also has highly useful libraries for the manipulation and analysis of data. When modern languages (existing) are assessed, the utility of Python-based solutions is remarkable in terms of agility and productivity. Many companies in all regions, including the largest investment banks and the smallest social/mobile web app startups, have used Python to operate their businesses and manage their data [36].

After generating the three rules, they were applied to Arabic sentences. Either the FKD-Rule or T-Rule can be applied first, and then the Simple-Rule may be applied, because applying the Simple-Rule first when converting Arabic texts into English may result in a less readable form. The conversion process is shown in Figure 4.

Quranic Arabic texts from “Surah Ikhlas” (chapter 112 verses 1–4) and “Sura An-Nass” (chapter 114 verses 1–6) have were converted, as shown in Table 4. These results are generated using only FATAH, KASRAh, DUMMA, and TANVEEN.

5. Results

It is worth mentioning that “no generally accepted database for Arabic text recognition is freely available for researchers” [37]. Hence, different researchers investigating Arabic text recognition use different data, and therefore, the recognition rates of the different techniques may not be comparable. The website (https://www.sujood.co) was used to prepare Dataset-1, from which there are some collections of prophetic traditions (hadith) that have the desired texts for the proposed approach. This approach was tested on Dataset-1, which contained three forms of different hadith. The first form contained a text in Arabic, the second contained English-converted text, and the third contained a text in the translated English form. All hadith were taken from the website (https://www.sujood.co), where all the texts are publicly available. The first and second text forms are discussed because the primary aim of the proposed method is to convert Arabic texts into English texts. The second form of the text is then compared with the proposed approach. The sample listings of the datasets are presented in Table 5.

After the conversion of Arabic texts to English texts, the existing and manual approaches were compared, and our proposed approach yielded the best readable form. These formulations are presented in Table 6 based on the sample data. In the second column, the red-colored words from the existing approach are less readable, whereas in the third column, the relevant words in green color produced using the proposed approach can be easily read.

5.1 Audio Result

The converted documents were also checked using an online speaking tool (www.fromtexttospeech.com), and the sounds were similar to the original Arabic sounds, to the extent allowed by the developed rules.

5.2 Statistical Measures

The latent semantic indexing approach was applied to determine the accuracy of the existing approach. For this purpose, the vector coordinates of the documents of the Arabic texts, the documents of the existing approach, and documents of the proposed approach were examined. The coordinates of the existing documents and the proposed documents with Arabic text were then compared. A query to determine the coordinates was also required. The documents of all approaches remained in the same order, and Document-6 was taken as a query, as shown in Table 7. In addition, the similarity distance of each document was determined using a specific query.

5.2.1 Vector coordinates of documents

The main objective was to identify the coordinates of each document and query. Singular value decomposition (SVD) can determine the points or coordinates of a document and query. Using SVD, three matrices S, V, and U, can be determined by a matrix and used for further processing. To determine the values of these variables, SVD requires a matrix consisting of rows and columns of integers and different text documents as inputs. A feature matrix can be obtained by calculating the frequency of each word. This means a feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using the NumPy library. The coordinates of all the documents were determined from S, and these coordinates were merged with the query to obtain the query coordinates. Finally, a cosine similarity function was applied to these coordinates to identify the documents closest to the queries [38].

From S, the coordinates of all documents corresponding to Arabic texts (column-1 in Table 5), texts converted using the existing approach (column-1 in Table 6), and texts converted using the proposed approach (column-2 in Table 6), were determined, as shown in Table 8. These coordinates were merged with relevant queries from Table 8 to find the query coordinates, which are shown in Table 9. The data provided in Table 8 represent 10 documents from the Arabic, proposed, and existing texts, and the accuracies are shown in Figure 5.

5.2.2 Similarity distance of documents from the query

LSI, which was proposed by [39,40], is an efficient information retrieval algorithm [41]. In LSI, there is a cosine similarity measurement between the coordinates (X, Y ) of a document vector and the coordinates (X, Y ) of a query vector corresponding to how closely the document matches the query (e.g., if this value is one, the document matches the query 100%; if it is 0.5, the document matches the query 50%; and if it is 0.9, the document matches the query 90%). The important step now is to find the coordinates of each document and query. SVD can determine these coordinates, determining the three values S, V, and U using a matrix. This matrix consists of rows and columns containing integers, for which the inputs are different text documents. A feature matrix can be obtained by calculating the frequency of each word. A feature matrix is initially created from all the documents before the SVD is calculated. Subsequently, the supporting variables S, V, and U are calculated using NumPy. The coordinates of all the documents are determined from S and are merged with the query to obtain the query coordinates. Finally, a cosine similarity function is applied to these coordinates to find the documents that best match the query.

It is clear that LSI uses the cosine similarity measure to rank the data with respect to the query and to find the points (X, Y ) of each document with respect to the query coordinates (X, Y ) using Eq. (7). LSI will also determine the query coordinates for Arabic text, Arabic to English converted text using the proposed technique, and Arabic to English converted text using the existing technique. These query coordinates are shown in Table 9.

Similarity(A,B)=i=1nA(i)*B(i)i=1nA(i)2i=1nB(i)2.

Table 10 depicts the similarity distance of a sample of 10 Arabic documents that were queried, including queries for the proposed and existing methods.

As shown in Table 11, it is clear that the values of the 2nd column (proposed work) are most similar to the Arabic text, and the values from the 3rd column (existing approach) have many different values. Figure 6 illustrates the comparison of the existing and proposed methods with the Arabic text. The figure clearly shows that the line for the proposed method is much closer to the line for the Arabic text than the line for the existing method is.

In Table 10, the document is 100% closer to the query if the value is one, 50% closer to the query for a value of 0.5, and 90% closer for a value of 0.9. For the Arabic text and the proposed approach (green color), D7, D8, and D10 are close to their queries, whereas for the existing approach (gray color), D1, D2, D3, D6, and D9 are close to their queries, as shown in Table 11. This table represents the proposed method’s text converted into English, which is similar to the existing English-converted text.

6. Conclusion

It was concluded that using the proposed work, all Arabic texts such as Arabic stop words, dictionaries, online reviews, tweets, lexicons, Quranic text, and hadith text can be easily converted into English texts, allowing for several types of analyses that require English text as input. Researchers can also use this lexicon to convert Arabic datasets into English texts.

Although Arabic text can be converted into English in a readable manner using these case systems, this readability can also be improved by using other case systems, such as Maddah/ /, Dagger Alif / /, Shaddah / /, etc. Future works will focus on converting Arabic texts to English texts by inverting the pairs ASC:ESC to ESC:ASC, ASC.A-FKD:ESC.E-rFKD to ESC.E-rFKD:ASC. A-FKD and ASC. A-T:ESC.E-rT to ESC.ErT: ASC.A-T with the inclusion of the remaining case systems.

Fig 1.

Figure 1.

A sample of Arabic writing [17].

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Fig 2.

Figure 2.

Conversion of Arabic character to English character using A-FKD.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Fig 3.

Figure 3.

Conversion of Arabic character to English character using A-T.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Fig 4.

Figure 4.

Arabic to English complete sentence conversion.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Fig 5.

Figure 5.

Comparison of existing and proposed approaches with Arabic text based on coordinates.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Fig 6.

Figure 6.

Comparison of proposed work and existing work based on similarity distance.

The International Journal of Fuzzy Logic and Intelligent Systems 2021; 21: 409-422https://doi.org/10.5391/IJFIS.2021.21.4.409

Table 1 . All ASC:ESC pairs.


Table 2 . All ASC.A-FKD and ESC.E-rFKD pairs.


Table 3 . All ASC.A-T and ESC.E-rT pairs.


Table 4 . Conversion of “Surah Ikhlas” and “Sura An-Nass”.


Table 5 . Sample sentences from Dataset-1.

Serial NoArabic textEnglish converted text (Existing approach)
1“Allahumma innee a’uzubika min ham ayhzununee, wa min fikr yuqliqunee, wa 3ilm yut3ibunee, wa shakhS yahmilu khubsan-lee”
2“Allahuma inni ashku ilayka du’fa quwati wa qilata heelati wa hawaani ala annaas. yaa arham araahimin, anta rab wa anta rabi”
3“Allahumma inni a’udhu bika min zawali ni’matika, wa tahawwuli’afiyatika, wa fuja’ati niqmatika, wa jami’i sakhatika”
4“Allahumma azhib Gaydha Qalbee”
5“Rabbi IshraHlee Sadree, wa yassirlee amree, waHlul Uqdatan min lisaanee, yafqahuu qawlee”
6“Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal”

Table 6 . Text comparison of existing and proposed approaches.


Table 7 . Queries document from all datasets.

Query from datasetDocument No from datasetText of selected document
Query (Arabic documents)Document-6
Query (Existing approach)Document-6“Allahumma innee a3uzubika min alham wa alhuzn wa al3ajz wa alkusl wa albukhl wa aljubn wa galbah aldayn wa Galbah alrijaal”
Query (Proposed approach)Document-6

Table 8 . Coordinates of all documents all datasets.

Doc-NoArabic XArabic YProposed XProposed YExisting XExisting Y
1−1.38872E-166.4016E-178.4641E-169.0951E-19−0.314220.219757
2−0.9852618730.0444205770.985261870.04442058−0.285880.207485
3−4.17806E-16−1.08385E-169.645E-171.0961E-18−0.295490.170587
4−8.27432E-172.33523E-168.2149E-17−9.72E-17−0.02333−0.02504
53.1507E-167.12019E-17−1.099E-15−1.881E-17−0.100850.080281
62.06014E-164.21434E-17−6.323E-161.8952E-19−0.645890.343129
7−0.029725105−0.9564313750.02972511−0.9564314−0.10160.033962
8−0.016770068−0.273857770.01677007−0.2738578−0.187610.036649
9−0.167423851−0.0609574150.16742385−0.0609574−0.118670.07706
10−0.007967295−0.0674681610.0079673−0.0674682−0.49471−0.86389

Table 9 . Query coordinates from all documents.

CoordinatesArabic queryProposed queryExisting query
X,Y−0.022388, −0.7564230.033842, −0.7614−0.65167, 0.3559

Table 10 . Distance of all documents of datasets from their relevant query.

DocumentsArabicProposedExisting
D1−0.3915826462150.04332592920820.993919277857
D2−0.0154648277095−0.0006404941924780.991839130391
D30.2796305043920.03304401142340.999721682081
D4−0.9322869122870.791684495710.247472435013
D5−0.249189850556−0.02730759141250.985180590511
D6−0.22931184731−0.04469887653090.999932316216
D70.9999989046670.9999109586820.984310426495
D80.9995016788470.9998597889630.953227104676
D90.369769602570.3835026740560.997116046374
D100.9961343433550.9973270990510.0200963618144

Table 11 . Closest documents to relevant query.

ApproachesClose to query w.r.t points
Arabic text documentsD7(0.999998904667), D8(0.999501678847), D10(0.996134343355)
Proposed text documentsD7(0.999910958682), D8(0.999859788963), D10(0.997327099051)
Existing text documentsD1(0.993919277857), D2(0.991839130391), D3(0.999721682081), D6(0.999932316216)

References

  1. Aldayel, HK, and Azmi, AM (2016). Arabic tweets sentiment analysis: a hybrid scheme. Journal of Information Science. 42, 782-797. https://doi.org/10.1177%2F0165551515610513
    CrossRef
  2. Salem, F (2017). Social media and the internet of things towards data-driven policymaking in the Arab world: potential, limits and concerns,” MBR School of Government. The Arab Social Media Report no 7. Dubai, UAE
  3. Arabic Language blog. (2011) . Arabic Diacritics (Al-Tashkeel). Available: https://blogs.transparent.com/arabic/2-arabic-diacritics-al-tashkeel
  4. Rose, S, Engel, D, Cramer, N, and Cowley, W (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory. Chichester, UK: Wiely, pp. 1-20 https://doi.org/10.1002/9780470689646.ch1
  5. Saad, MK 2010. The impact of text preprocessing and term weighting on Arabic text classification. Ph.D. dissertation. The Islamic University of Gaza. Palestine.
  6. Conchillo, A, Cercaci, L, Ansorena, D, Rodriguez-Estrada, MT, Lercker, G, and Astiasaran, I (2005). Levels of phytosterol oxides in enriched and nonenriched spreads: application of a thin-layer chromatography-gas chromatography methodology. Journal of Agricultural and Food Chemistry. 53, 7844-7850. https://doi.org/10.1021/jf050539m
    Pubmed CrossRef
  7. Hamouda, AEDA, and El-taher, FEZ (2013). Sentiment analyzer for Arabic comments system. International Journal of Advanced Computer Science and Applications. 4, 99-103. https://doi.org/10.14569/IJACSA.2013.040317
  8. Elhassan, R, and Ahmed, M (2015). Arabic text classification review. International Journal of Computer Science and Software Engineering. 4, 1-5.
  9. AlOtaibi, S, and Khan, MB (2017). Sentiment analysis challenges of informal Arabic. International Journal of Advanced Computer Science and Applications. 8, 278-284. https://doi.org/10.14569/ijacsa.2017.080237
  10. Abdul-Mageed, M, Diab, M, and Kubler, S (2014). SAMAR: Subjectivity and sentiment analysis for Arabic social media. Computer Speech & Language. 28, 20-37. https://doi.org/10.1016/j.csl.2013.03.001
    CrossRef
  11. Alwakid, G, Osman, T, and Hughes-Roberts, T (2017). Challenges in sentiment analysis for Arabic social networks. Procedia Computer Science. 117, 89-100. https://doi.org/10.1016/j.procs.2017.10.097
    CrossRef
  12. Alowisheq, A, Alhumoud, S, Altwairesh, N, and Albuhairi, T (2016). Arabic sentiment analysis resources: a survey. Social Computing and Social Media. Cham, Switzerland: Springer, pp. 267-278 https://doi.org/10.1007/978-3-319-39910-2_25
    CrossRef
  13. Lagrini, S, Redjimi, M, and Azizi, N (2017). Automatic Arabic text summarization approaches. International Journal of Computer Applications. 164, 31-37. https://doi.org/10.5120/ijca2017913628
    CrossRef
  14. El-Fishawy, N, Hamouda, A, Attiya, GM, and Atef, M (2014). Arabic summarization in twitter social network. Ain Shams Engineering Journal. 5, 411-420. https://doi.org/10.1016/j.asej.2013.11.002
    CrossRef
  15. Najadat, HM, Hmeidi, II, Al-Kabi, MN, and Issa, MMB (2016). Automatic keyphrase extractor from Arabic documents. International Journal of Advanced Computer Science and Applications. 7, 192-199. https://doi.org/10.14569/ijacsa.2016.070226
  16. Al Qassem, LM, Wang, D, Al Mahmoud, Z, Barada, H, Al-Rubaie, A, and Almoosa, NI (2017). Automatic Arabic summarization: a survey of methodologies and systems. Procedia Computer Science. 117, 10-18. https://doi.org/10.1016/j.procs.2017.10.088
    CrossRef
  17. Khorsheed, MS, and Al-Omari, H . Recognizing cursive Arabic text: Using statistical features and interconnected mono-HMMs., Proceedings of 2011 4th International Congress on Image and Signal Processing, 2011, Shanghai, China, Array, pp.1540-1543. https://doi.org/10.1109/CISP.2011.6100511
  18. Atwell, E, Brierley, C, Dukes, K, Sawalha, M, and Sharaf, AB . An artificial intelligence approach to Arabic and Islamic content on the Internet., Proceedings of NITS 3rd National Information Technology Symposium, 2011, Riyadh, Saudi Arabia, Array, pp.1-8. https://doi.org/10.13140/2.1.2425.9528
  19. Fejer, HN, and Omar, N (2015). Automatic multi-document Arabic text summarization using clustering and keyphrase extraction. Journal of Artificial Intelligence. 8, 1-9. https://doi.org/10.3923/jai.2015.1.9
    CrossRef
  20. Belkebir, R, and Guessoum, A (2015). A supervised approach to Arabic text summarization using Adaboost. New Contributions in Information Systems and Technologies. Cham, Switzerland: Springer, pp. 227-236 https://doi.org/10.1007/978-3-319-16486-1_23
    CrossRef
  21. Baraka, RS, and Al Breem, SN . Automatic Arabic text summarization for large scale multiple documents using genetic algorithm and MapReduce., Proceedings of 2017 Palestinian International Conference on Information and Communication Technology (PICICT), 2017, Gaza, Palestine, Array, pp.40-45. https://doi.org/10.1109/PICICT.2017.32
  22. Imam, I, Nounou, N, Hamouda, A, and Khalek, HAA (2013). An ontology-based summarization system for Arabic documents (OSSAD). International Journal of Computer Applications. 74, 38-43. https://doi.org/10.5120/12980-0237
    CrossRef
  23. Al-Taani, AT, and Al-Omour, MM . An extractive graph-based Arabic text summarization approach., Proceedings of the International Arab Conference on Information Technology (ACIT), 2014, Nizwa, Oman.
  24. Alami, N, Meknassi, M, Ouatik, SA, and Ennahnahi, N . Arabic text summarization based on graph theory., Proceedings of 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, Marrakech, Morocco, Array, pp.1-8. https://doi.org/10.1109/AICCSA.2015.7507254
  25. Al-Khawaldeh, FT, and Samawi, VW (2015). Lexical cohesion and entailment based segmentation for Arabic text summarization (lCEAS). World of Computer Science & Information Technology Journal. 5, 51-60.
  26. Pang, B, and Lee, L (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval. 2, 1-135. https://doi.org/10.1561/1500000011
    CrossRef
  27. Badaro, G, Baly, R, Hajj, H, Habash, N, and El-Hajj, W . A large scale Arabic sentiment lexicon for Arabic opinion mining., Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, Doha, Qatar, pp.165-173.
  28. Fodil, L, Sayoud, H, and Ouamour, S . Theme classification of Arabic text: a statistical approach., Proceedings of the 11th International Conference on Terminology and Knowledge Engineering, 2014, Berlin, Germany.
  29. Mesleh, AMA (2007). Chi square feature extraction based SVMs Arabic language text categorization system. Journal of Computer Science. 3, 430-435. https://doi.org/10.3844/jcssp.2007.430.435
    CrossRef
  30. Al-Harbi, S, Almuhareb, A, Al-Thubaity, A, Khorsheed, MS, and Al-Rajeh, A . Automatic Arabic text classification., Proceedings of the 9th International Conference on Textual Data Statistical Analysis (JADT), 2008, Lyon, France.
  31. Al-Kabi, MN, Gigieh, AH, Alsmadi, IM, Wahsheh, HA, and Haidar, MM (2014). Opinion mining and analysis for Arabic language. International Journal of Advanced Computer Science and Applications. 5, 181-195. https://doi.org/10.14569/ijacsa.2014.050528
  32. Abutiheen, ZA, Aliwy, AH, and Aljanabi, KB (2018). Arabic text classification using master-slaves technique. Journal of Physics: Conference Series. 1032. article no 012052
  33. Alshammari, R (2018). Arabic text categorization using machine learning approaches. International Journal of Advanced Computer Science and Applications. 9, 226-30. https://doi.org/10.14569/ijacsa.2018.090332
    CrossRef
  34. Arabic learning resources: Arabic media vocabulary. Available: https://arabic.desert-sky.net/g_cases.html
  35. QAMUS. (2002) . Arabic transliteration. Available: http://www.qamus.org/transliteration.htm
  36. (). Why Python for Big Data? | Continuum.
  37. Al-Muhtaseb, HA, Mahmoud, SA, and Qahwaji, RS . Statistical analysis of Arabic text to support optical Arabic text recognition., Proceedings of the International Symposium on Computer and Arabic Language & Exhibition (ISCAL), 2007, Riyadh, Saudi Arabia, pp.1-16.
  38. Saqib, SM, Mahmood, K, and Naeem, T (2016). Comparison of LSI algorithms without and with pre-processing: using text document based search. ACCENTS Transactions on Information Security. 1, 44-51.
  39. Deerwester, S, Dumais, ST, Furnas, GW, Landauer, TK, and Harshman, R (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science. 41, 391-407. https://doi.org/10.1002/(sici)1097-4571(199009)41:6〈391::aid-asi1〉3.0.co;2-9
    CrossRef
  40. Blei, DM, Ng, AY, and Jordan, MI (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research. 3, 993-1022.
  41. Phadnis, N, and Gadge, J (2014). Framework for document retrieval using latent semantic indexing. International Journal of Computer Applications. 94, 37-41. https://doi.org/10.5120/16414-6065
    CrossRef