The rapid expansion of data published on the web has given rise to the similarity problem on a large scale, a very important subject for scientific research in the field of computer science. Several methods have been developed for this. In this paper, we propose the first mathematical model to find the similarity value between generalized trapezoidal fuzzy numbers (GTFNs). This model employs fuzzy inference systems to find the value of an effective weighting, the weights to be associated to different kinds of methods that can handle an important scale of the data. This model will allow us to develop intelligent systems. A comparative study based on 21 sets of GTFNs has been carried out to demonstrate the difference between our approach and existing methods. This study shows that our model is more reasonable than existing methods.
The concept of fuzzy logic, proposed in 1965 by Zadeh [1], has been used to manage a kind of probability. To employ this concept, the generalized trapezoidal fuzzy numbers (GTFNs) are most the most popular in practice. In the literature, several similarity methods between GTFNs have been introduced (e.g., [2–10]). But, these existing methods of similarity measures have many weaknesses. In many situations, such methods cannot appropriately find the similarity between two GTFNs. In the present study, a novel mathematical model for a fuzzy-number similarity method between GTFNs has been created, based on the weights associated to each similarity measure. This model uses the cosine coefficient and the Jaccard Index. Additionally, we describe and provide three characteristics of the proposed model. A comparative study has been carried out, based on 21 sets of GTFNs, to show that the model can surmount the limitations of the existing measures.
The remainder of this article is structured as follows. Section 2 presents a summary of the fundamental notions of existing methods. Section 3 introduces the novel approach, presenting the similarity methods employed and finding the weights associated to each similarity method. Many properties are suggested and proven. Section 4 compares it with the existing similarity methods. We give our conclusions in Section 5.
The notion of a GFN T̃ is presented by Chen [11, 12] as follows: T̃ = {t_{1}, t_{2}, t_{3}, t_{4}, w_{T̃} }, where the t_{k}k = 1, 2, 3, 4 denote real numbers, and we take into account a weight w_{T̃} with 0 < w_{T̃} < 1. When w_{T̃} = 1, T̃ is a normal trapezoidal fuzzy number (NTFN). When t_{2} = t_{3}, T̃ is called a GTFN. When t_{1} = t_{2} and t_{3} = t_{4}, T̃ is called a crisp interval. When w_{T̃} = 1 and t_{1} = t_{k} (where k =1,2,3,4), T̃ is a real number.
Several Similarity methods between GTFNs T̃ and H̃ = {h_{1}, h_{2}, h_{3}, h_{4}, w _{H̃} } have been introduced:
Chen [2] proposed a new method to solve the similarity problems of the GFNs. This method is based on the geometric distance. The author used this method to clear the problems of fuzzy risk analysis and subjective mental workload assessment.
Hsieh and Chen [3] introduced a new approach to find a solution to the similarity problems of the GFNs. This approach is based on the average integration representation distance.
where d(T̃, H̃) = |P(T) − P(H)|,
Lee [4] presented a new measure to answer the similarity problems of the GFNs. This measure is based on the the L_{P} norm. The author used this measure to solve the problems of aggregating individual opinions into group consensus under group decision.
where
where
where S_{T̃} = t_{4} − t_{1}, S_{H̃} = h_{4} − h_{1}. Wei and Chen [7] proposed a new approach in order to solve the similarity problems of GFNs. It is based on the concepts of geometric distance, the perimeter and the height of GFNs. The authors used this approach to to answer the problems of fuzzy risk analysis.
where
Recently, Xu et al. [8] changed the similarity method SCC proposed by Chen and Chen [5], introducing the new similarity S_{X}. The authors used this method to answer the problems of fuzzy risk analysis.
where
Chen [9] introduced a new proposed method to find a solution to the similarity problems of GFNs. This method is based on the geometric mean operator. The author used this approach to clear the problems of fuzzy-number used in the information retrieval.
where
We will now present a novel mathematical model, MCESTA (Mohamedou Cheikh Elghotob Cheikh Saad bouh Cheikh Tourad Abass), Abdelmounaim Abdali, a large scale similarity method between GTFNs, call them T̃ and H̃. It is a kind of hybrid of the similarity measures S_{k} and
where
Our model will be as follows:
where S_{k} is a similarity method between T̃ and H̃,
The descriptions of the similarity methods employed, the cosine coefficient and the Jaccard Index [14], are given below. This choice of similarity measures will be validated in Section 4. We have n = 2, S_{1} = cos and S_{2} = Jaccard and we choose m_{1} = 1 and m_{2} = 2, which implies, in (
for each k we provide a corresponding m_{k}.
where
The cosine coefficient calculates the similarity between vectors in an easy and more intelligent way: it is by the determination of the direction. cos(T̃, H̃) is calculated as follows:
where [k = {1, 2, 3, 4}], x_{5} = w_{T}, x_{6} = t_{2} −t_{1}, x_{7} = t_{3} −t_{2}, x_{8} = t_{4} − t_{3}. Also, y_{k} = h_{k} where [k = {1, 2, 3, 4}], y_{5} = w_{H}, y_{6} = h_{2} − h_{1}, y_{7} = h_{3} − h_{2}, y_{8} = h_{4} − h_{3}.
The Jaccard Index, Jaccard(T̃, H̃) [10], measures the similarity as follows: the size of the intersection is divided by the size of the union.
Jaccard_{1}(T̃, H̃) is calculated as follows: x_{k} = t_{k} − m, where [k = {1, 2, 3, 4}], x_{5} = w_{T̃}, x_{6} = t_{2} −t_{1}, x_{7} = t_{3} −t_{2}, x_{8} = t_{4} − t_{3}, and y_{k} = h_{k} − m, where [k = {1, 2, 3, 4}], y_{5} = w_{H̃}, y_{6} = h_{2} − h_{1}, y_{7} = h_{3} − h_{2}, y_{8} = h_{4} − h_{3}, where m = min(t_{1}, h_{1}).
Jaccard_{2}(T̃, H̃) is calculated as following:
Our choice is based on an FIS, such systems are recognized and have been used in many different fields [15, 16]. To calculate the weights mentioned in
In Table 1 there is presented the FIS fuzzy rule for the MCESTA, which is a set of semantic declarations that define how the FIS-Mamdani must carry out its decision making for input state (Cosine, Jaccard_{1}, and Jaccard_{2}) or controlling an output(C, J_{1} and J_{2}). This table uses the values of the functions presented in Figure 2.
In Figure 3, the FIS-diagram for the MCESTA defined as a procedure for developing the association relation (from a given value to an output value) based on the concepts of fuzzy logic. This figure shows the values which are the most important for the weights following C, J_{1} and J_{2}.
Figure 4 to Figure 9 show the evolution of the FIS-Rule surface diagram for the weights (C, J_{1} and J_{2}) in MCESTA.
This property has been used by Hwang and Yang [10] for validating their proposed similarity method. We have that
for T̃_{1} = α × T̃ = (α×t_{1}, α×t_{2}, α×t_{3}, α×t_{4}, α×w_{T̃}) and H̃_{1} = α × H̃ = (α × h_{1}, α × h_{2}, α × h_{3}, α × h_{4}, α × w_{H̃}), where α ≥ 0. Normally, the best similarity should be scale free. This is a very significant property for the large scale. According to this result, we deduce that a very large data Set_{1} = (T̃_{1}, H̃_{1}) can be analyzed using another batch of data of small size Set = (T̃, H̃). But according to the scale relation T̃_{1} = α × T̃ and H̃_{1} = α × H̃, the interpretation and the analysis of Set gives a global idea of the interpretation of Set_{1}, which is important when it can be difficult to treat everything in the context of big data or the machines don’t have sufficient capacity. This property is very important for large scale studies.
To see that our model verifies the three properties, it is enough that every similarity (Cosine, Jaccard_{1} and Jaccard_{2}) used in this model satisfies these properties. For the sub-measures Jaccard_{1} and Jaccard_{2}, this is proved by Hwang and Yang [10]. It remains to treat Cosine.
The ⇒ is proved by observing that since T̃ = H̃, then x_{k} = y_{k}, k = {1, 2, 3, 4, 6, 7, 8}. Therefore
Thus,
The ⇐ is proved as follows: Since
we have that:
(a)
(b)
we have:
so (a)–(b)= 0 ⇒ (x_{k} × y_{q})) × [(x_{k} × y_{q}) − (x_{q} × y_{k})] = 0, we have x_{k} ≠ 0 and y_{k} ≠ 0. Thus (a)–(b)= 0 ⇒ [(x_{k} ×y_{q})− (x_{q} × y_{k})] = 0. ⇒ ∃λ/∀λ ≥ 0,
Therefore, we have that x_{k} = y_{k}, [k = 1, 2, 3, 4, 5, 6, 7, 8], T̃= H̃.
This follows since
Suppose T̃_{1} = α × T̃ = (α × t_{1}, α × t_{2}, α × t_{3}, α × t_{4}, α × w_{T̃}) and H̃_{1} = α × H̃ = (α×h_{1}, α×h_{2}, α×h_{3}, α×h_{4}, α×w_{H̃}), where α ≥ 0.
We have
Our approach to similarity and the existing methods S_{C} (
To validate and compare our contributions with the existing methods, we made the calculations on the sets already used by Hwang and Yang [10] in 2014. There are 21 sets (Set_{1} to Set_{21}) of fuzzy numbers, shown in Figure 10 to Figure 30 respectively. The estimated times taken by our approach and by the existing methods are given in Table 2. We can demonstrate and analyze the weaknesses and limitations of each of the existing similarity measures from the table.
We can analyze Table 2 in terms of three types: incorrect results, scale-dependent results, and direction.
We have in Set_{1}, S_{HC}(Ã,Ã_{1}) =1: we can judge the result as incorrect.
For Set_{3} and Set_{4}, we have that the similarity of Set_{4} should be more similar than that of Set_{3}, but the similarity methods S_{C}(Ã,Ã_{1}), S_{HC}(Ã,Ã_{1}) and S_{L}(Ã,Ã_{1}) produce the same result: we can judge the result as incorrect.
We have in Set_{5}, Ã ≠ Ã_{1}. The similarity produced by S_{C}(Ã,Ã_{1}), S_{C}(Ã,Ã_{1}) and S_{L}(Ã,Ã_{1}) is 1: we can say the result is incorrect.
We have in Set_{6} for S_{L}(Ã,Ã_{1}), it can’t evaluate the similarity degree. we can judge the result as incorrect.
For Set_{7}, the similarity produced by S_{L}(Ã,Ã_{1}) is 0: the result is incorrect.
For Set_{8} and Set_{9}, we have that the similarity of Set_{4} should be different from the similarity of those with Set_{3}, but the similarity methods S_{C}(Ã,Ã_{1}) and S_{C}(Ã,Ã_{1}) produce the same result: we can say the result is incorrect.
For Set_{9} and Set_{10}, we have that the similarity of Set_{9} should be more similar than that Set_{10}, however, Table 2 proves that the similarity produced by the methods S_{C}(Ã,Ã_{1}), S_{HC}(Ã,Ã_{1}), S_{L}(Ã,Ã_{1}), S_{CC}(Ã,Ã_{1}) and S_{X}(Ã,Ã_{1}) are incorrect results.
We have in Set_{11}, Ã ≠ Ã_{1}. The similarity produced by S_{HC}(Ã,Ã_{1}) is 1: we can judge the result as incorrect.
We have in Set_{14}, Ã ≠ Ã_{1}. The similarity produced by S_{C}(Ã,Ã_{1}), S_{HC}(Ã,Ã_{1}) and S_{L}(Ã,Ã_{1}) is 1: the result is incorrect.
For Set_{14} and Set_{15}, we have that the similarity of Set_{14} should be more similar than that of Set_{15}. However, Table 2 proves that the similarity produced by the methods S_{C}(Ã,Ã_{1}), S_{HC}(Ã,Ã_{1}), S_{L}(Ã,Ã_{1}), S_{CC}(Ã,Ã_{1}), S_{X}(Ã,Ã_{1}) and S_{SJ}(Ã,Ã_{1}) are incorrect results.
It can be seen for the two different Set_{16} and Set_{17} that the methods S_{C}(Ã,Ã_{1}), S_{HC}(Ã,Ã_{1}), S_{CC}(Ã,Ã_{1}) and S_{X}(Ã,Ã_{1}) produce incorrect results.
For Set_{18} and Set_{19}, we have that the similarity of Set_{18} should be more similar than that of Set_{19}, but the similarity methods S_{C}(Ã, Ã_{1}), S_{HC}(Ã,Ã_{1}), S_{L}(Ã,Ã_{1}), S_{CC}(Ã, Ã_{1}), S_{WC}(Ã,Ã_{1}), S_{X}(Ã, Ã_{1}) and S_{SJ}(Ã, Ã_{1}) produce the same result: we can say the result is incorrect.
For Set_{20} and Set_{21}, we have that Set_{20} and Set_{21} are in double-scale relation. Usually, the best similarity methods must verify property 3 (scale-free), but the similarity methods S_{C}(Ã, Ã_{1}), S_{HC}(Ã, Ã_{1}), S_{L}(Ã, Ã_{1}), S_{CC}(Ã, Ã_{1}), S_{WC}(Ã Ã_{1}), S_{X}(Ã, Ã_{1}) and S_{SJ}(Ã, Ã_{1}) are scaledependent (are not scale-free), while our approach is scale-free.
The direction similarity is very important. This measurement is used in the big data framework when the data tends towards the infinite. But the similarity methods S_{C}(Ã,Ã_{1}), S_{HC}(AÃ,Ã_{1}), S_{L}(Ã,Ã_{1}), S_{CC}(Ã,Ã_{1}), S_{WC}(Ã,Ã_{1}), S_{X}(Ã,Ã_{1})and S_{SJ}(Ã,Ã_{1}) do not use the direction, while our approach uses this measurement with a weight C = 0.508, see Figure 3.
After the analysis of Table 2, the three properties and the direction based on the cosine coefficient help this similarity approach to be the best choice for the large scale.
A novel mathematical model MCESTA for GTFNs has been is presented (as well as the existing methods). This model is based on using weights associated with each one of several differeur nt similarity measures. We have been able to infer the importance weights by a Mamdani-type FIS [17]. This model uses the cosine coefficient and Jaccard Index. Three properties of the model are proved, one property is advantageous for being used with large scale datasets. A comparative study has also been presented to explain how onovel approach can overcome the limitations and weaknesses of the existing methods. This approach will help us develop an intelligent filtering of the pub-sub system [20, 21].
No potential conflict of interest relevant to this article was reported.
The Mamdani for MCESTA.
FIS value membership functions.
FIS-diagram for the MCESTA.
FIS-rule surface diagram for C by Cosine and Jaccard1 in MCESTA.
FIS-rule surface diagram for C by Cosine and Jaccard2 in MCESTA.
FIS-rule surface diagram for J1 by Jaccard1 and Cosine in MCESTA.
FIS-rule surface diagram for J1 by Jaccard1 and Jaccard2 in MCESTA.
FIS-rule surface diagram for J2 by Jaccard2 and Cosine in MCESTA.
FIS-rule surface diagram for J2 by Jaccard2 and Jaccard2 in MCESTA.
Set_{1}.
Set_{2}.
Set_{3}.
Set_{4}.
Set_{5}.
Set_{6}.
Set_{7}.
Set_{8}.
Set_{9}.
Set_{10}.
Set_{11}.
Set_{4}.
Set_{13}.
Set_{14}.
Set_{15}.
Set_{16}.
Set_{17}.
Set_{18}.
Set_{19}.
Set_{20}.
Set_{21}.
FIS fuzzy rule for MCESTA
IF (Jaccard_{1} is L) and (Jaccard_{2} is L) and (Cos is L) THEN (J_{1} is L) (J_{2} is L) (C is L) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is M) and (Cos is L) THEN (J_{1} is L) (J_{2} is M) (C is L) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is H) and (Cos is L) THEN (J_{1} is L) (J_{2} is H) (C is L) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is L) and (Cos is L) THEN (J_{1} is M) (J_{2} is L) (C is L) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is M) and (Cos is L) THEN (J_{1} is M) (J_{2} is M) (C is L) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is H) and (Cos is L) THEN (J_{1} is M) (J_{2} is H) (C is L) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is L) and (Cos is L) THEN J_{1} is H) (J_{2} is L) (C is L) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is M) and (Cos is L) THEN (J_{1} is H) (J_{2} is M) (C is L) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is H) and (Cos is L) THEN (J_{1} is H) (J_{2} is H) (C is L) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is L) and (Cos is M) THEN (J_{1} is L) (J_{2} is L) (C is M) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is M) and (Cos is M) THEN (J_{1} is L) (J_{2} is M) (C is M) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is H) and (Cos is M) THEN (J_{1} is L) (J_{2} is H) (C is M) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is L) and (Cos is M) THEN (J_{1} is M) (J_{2} is L) (C is M) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is M) and (Cos is M) THEN (J_{1} is M) (J_{2} is M) (C is M) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is H) and (Cos is M) THEN (J_{1} is M) (J_{2} is H) (C is M) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is L) and (Cos is M) THEN (J_{1} is H) (J_{2} is L) (C is M) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is M) and (Cos is M) THEN (J_{1} is H) (J_{2} is M) (C is M) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is H) and (Cos is M) THEN (J_{1} is H) (J_{2} is H) (C is M) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is L) and (Cos is H) THEN (J_{1} is L) (J_{2} is L) (C is H) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is M) and (Cos is H) THEN (J_{1} is L) (J_{2} is M) (C is H) (1). |
IF (Jaccard_{1} is L) and (Jaccard_{2} is H) and (Cos is H) THEN (J_{1} is L) (J_{2} is H) (C is H) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is L) and (Cos is H) THEN (J_{1} is M) (J_{2} is L) (C is H) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is M) and (Cos is H) THEN (J_{1} is M) (J_{2} is M) (C is H) (1). |
IF (Jaccard_{1} is M) and (Jaccard_{2} is H) and (Cos is H) THEN (J_{1} is M) (J_{2} is H) (C is H) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is L) and (Cos is H) THEN (J_{1} is H) (J_{2} is L) (C is H) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is M) and (Cos is H) THEN (J_{1} is H) (J_{2} is M) (C is H) (1). |
IF (Jaccard_{1} is H) and (Jaccard_{2} is H) and (Cos is H) THEN (J_{1} is H) (J_{2} is H) (C is H) (1). |
L, Low; M, Medium; H, High.
Comparison
Method/Set |
S |
S |
S |
S |
S |
S |
S |
Our approach |
---|---|---|---|---|---|---|---|---|
1 | 0.9167 | 0.975 | 0.8357 | 0.95 | 0.9627 | 0.8356 | 0.9877 | |
2 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
3 | 0.42 | 0.682 | 0.7136 | 0.5997 | 0.8475 | |||
4 | 0.49 | 0.7 | 0.7158 | 0.7 | 0.8551 | |||
5 | 0.8 | 0.8248 | 0.9652 | 0.8 | 0.9797 | |||
6 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | |
7 | 0.9091 | 0.9 | 0.9 | 0.9 | 0.9053 | 0.9 | 0.9725 | |
8 | 0.5 | 0.54 | 0.8411 | 0.8631 | 0.5991 | 0.9482 | ||
9 | 0.9 | 0.9 | 0.9756 | |||||
10 | 0.7833 | 0.8974 | 0.9311 | |||||
11 | 0.75 | 0.9 | 0.72 | 0.8003 | 0.9127 | 0.72 | 0.9572 | |
12 | 0.8 | 0.9375 | 0.9 | 0.78 | 0.8309 | 0.8904 | 0.8959 | 0.9068 |
13 | 0.75 | 0.9091 | 0.9 | 0.81 | 0.9 | 0.9053 | 0.9 | 0.979 |
14 | 0.7209 | 0.9484 | ||||||
15 | 0.75 | 0.95 | 0.6215 | 0.9187 | ||||
16 | 0.4 | 0.6222 | 0.6971 | 0.8125 | ||||
17 | 0.25 | 0.7 | 0.7 | 0.828 | ||||
18 | 0.8551 | |||||||
19 | 0.7636 | |||||||
20 | 0.8551 | |||||||
21 | 0.8551 |
Bold text represents incorrect results and italicized text, scale-dependent results.
E-mail: cheikhtouradmohamedou@gmail.com