Article Search
닫기

## Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349

Published online December 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.4.339

© The Korean Institute of Intelligent Systems

## Extended Siamese Convolutional Neural Networks for Discriminative Feature Learning

Sangyun Lee and Sungjun Hong

School of Information Technology, Sungkonghoe University, Seoul, Korea

Correspondence to :
Sungjun Hong (sjhong@skhu.ac.kr)

Received: November 9, 2021; Revised: June 30, 2022; Accepted: October 17, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.

Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network

Given a pair of images, determining whether they show the same or different objects is an important problem in computer vision and image analysis [1]. This problem is called visual object verification, and it plays a key role in large-scale computer vision systems such as face verification [2] and object tracking [3]. Visual object verification differs from visual categorization and is especially challenging owing to issues such as changes in camera viewpoint, variations in illumination, and occlusion. For example, the visual appearance of a single pedestrian may vary owing to changes in perspective, pose, or lighting.

Over the past decade, handcrafted descriptors such as SIFT and SURF have been used in previous studies on visual object verification [4, 5]. Unfortunately, methods based on descriptors do not perform well for this problem owing to the lack of category-specific training [6]. Furthermore, the handcrafted descriptors are designed to be robust to small differences; thus, they are not well-suited for the visual verification problem, which requires even small differences between two images to be exploited as fully as possible. In contrast, convolutional neural networks (CNNs) have become popular for their ability to learn rich features and have demonstrated record-breaking performance in many computer vision tasks over the past few years, including semantic segmentation [7] and image classification [8]. CNNs have also been applied to the visual object verification problem using learned features in [9, 10].

The Siamese convolutional neural network (SCNN) proposed by LeCun et al. [2] has received considerable attention as a possible solution to the visual object-verification problem [1114]. Typical SCNNs learn not only the features but also a similarity metric for a given visual object verification problem. Their topology is shown in Figure 1. As shown in the figure, SCNNs have two parallel branches, one for each image, and the two branches share the same weights. Thus, if the two images provided to the SCNN are the same, the corresponding output feature maps should also be the same. At the end of the network, distance functions such as Euclidean distance are used to provide the similarity (more precisely, distance) metric for the two given images. The output map from each branch can be viewed as a feature that discriminates the given image from that of the opposite branch. Following the success of SCNNs, several variants have been proposed to learn similarity metrics, such as the Triplet [15] and Quadruplet [16] networks.

However, SCNNs and their variants have the weakness that the feature map computed for each branch depends only on the image applied to that branch and is completely independent of the image applied to the opposite branch. More specifically, let us consider the following two cases. In the first case, the two images, A and A′, applied to an SCNN originate from the same object. In the second case, the two images A and B come from different objects. Because the feature map of each branch only uses the image applied to that branch, the output feature map for branch A is the same in both cases. Triplet [15] and Quadruplet [16] models have been developed to learn discriminative features with more sophisticated loss functions, but they still have the same weakness. For example, Triplets have three branches for three images A, A′ and B and their output features are completely independent of the other images. The independence of the output maps generated for the two branches reduces the effectiveness of discriminative feature learning. To the best of our knowledge, no definitive solutions to this problem have been reported.

To solve this problem, we propose a new SCNN variant called extended SCNN (ESCNN). The proposed ESCNN learns not only independent features for each image in a pair but also their relative features to improve its ability to discriminate between the two images. To learn the two types of features, ESCNNs include a feature augmentation architecture designed to exploit the multilevel features of the underlying conventional SCNN. The augmented features provide discriminatory information for two images that offer additional visual object verification cues. Furthermore, inspired by recent successful applications of fully connected layers (FCLs) in similarity metric learning [9,11,12], a fully connected layer is also used in the ESCNN to decide whether the input pairs are from the same or different objects.

The two main contributions of this study are summarized as follows.

• A new SCNN variant called an extended SCNN (ESCNN) is proposed for visual object verification. Compared with an SCNN, the proposed ESCNN exploits relative information between branches and extracts new discriminative features to improve its performance in learning a similarity metric.

• The performance of ESCNNs for visual object verification is analyzed from two perspectives. First, the feature maps of each ESCNN layer are visualized and we show that the proposed ESCNN can to encode relative and discriminative information for the two input images. Second, the proposed ESCNN is applied to the person verification problem and we demonstrate that the ESCNN significantly outperformed previous methods using handcrafted descriptors (e.g., SIFT, SURF) and a conventional SCNN. Further, the results show that ESCNNs can even demonstrate better performance with fewer weights than an SCNN would require.

The remainder of this study is organized as follows. In Section 2, we briefly review some related works and present a description of the problem considered. Section 3 introduces the proposed ESCNN and its training strategy. Section 4 discusses the experimental setup, including the dataset used, other models used for comparison, and the implementation details. The results are then described, and the advantages of the ESCNN are compared with those of the other models. Finally, Section 5 presents our conclusions.

### 2.1 Related Work

A typical approach to visual object verification is to extract features from a pair of images and compute a similarity metric between these features [6]. To extract the features, handcrafted feature descriptors such as SIFT and SURF, have been used for the verification problem [4, 5]. However, handcrafted features often require expensive human labor and rely on expert knowledge. Without category-specific training, these methods do not perform well in visual object verification [6]. Recently, a method was proposed to learn these features [17]. However, further processing is generally required to compute the similarity metric using methods such as support-vector machines or Euclidean distance.

Thus, deep learning has attracted increasing attention in the field of computer vision, and several deep learning methods have been proposed. Naturally, similarity metric learning methods using deep learning have also been proposed [9, 11, 18]. In particular, the authors of a recent study [18] demonstrated that the features learned by a CNN trained on ImageNet outperformed SIFT. These advances have encouraged the use of CNNs in various practical applications, including face verification [2].

The most popular CNN architecture for similarity metric learning is the SCNN model proposed by LeCun et al. [2]. As shown in Figure 1, the SCNNs have two branches that share the same architecture and weight values. In general, branch outputs are assigned to distance functions (e.g., squared Euclidean distance) to compute the similarity metrics for the input pairs [2]. However, [9, 11, 12] used multiple linear FCLs as an alternative to the distance function. In fact, the results of [11] show that this learning metric performs better than the Euclidean distance function. Thus, we used FCLs in the proposed network for similar metric learning.

In this paper, we propose a structural enhancement of a conventional SCNN. Several variants inspired by the success of SCNNs were proposed in [15, 16]. In [15], a Triplet network was proposed that consisted of three branches for more efficient training. In addition, a Quadruplet network consisting of four branches was proposed for the local descriptors [16] problems. Although SCNN variants have demonstrated outstanding performance in various tasks, they have the inherent structural limitation that they extract features for each image independently without knowledge of the other image in the pair. Therefore, to address this issue, we propose an SCNN variant that extracts new discriminative features by exploiting the relative differences in input pairs.

### 2.2 Problem Statement

Visual object verification is a special case of object recognition in which the object category is known (e.g., pedestrians or cars), and one must determine whether a given pair of images represent the same object. We consider the visual object verification problem as a classification problem that takes a pair of images as input and outputs a classification result for the pair of images, either positive if the two images depict the same object or negative if the images show different objects. In other words, we developed an SCNN variant to solve the classification problem for a given dataset D as follows.

D={(x1(n),x2(n),y(n))n=1,,N,y(n){0,1}},

where x1(n) and x2(n) are the nth pair of images, y(n) is the class label representing whether the pair of images comes from the same object (y(n) = 1) or not (y(n) = 0), and N is the number of image pairs. As discussed in the previous subsection, a typical SCNN has only two components, one for extracting discriminative features from each image in an input pair and the other for comparing these features and determining the class of the input pair. In this study, we propose an SCNN variant with a new extended structure to solve the classification problem and enhance performance. The proposed architecture and an associated training strategy are described in detail below.

### 3. Proposed Architecture

In this section, we describe the proposed deep neural network, called an extended Siamese convolutional neural network (ESCNN). The model is designed to learn discriminative features for pairs of images and provide cues to determine whether two given images depict the same object. The ESCNN architecture, as shown in Figure 2, consists of three parts, including Siamese, extension, and decision components. These are explained in the following subsections.

### 3.1 Siamese Part

To extract discriminative features for a pair of input images, we used a typical Siamese architecture with two branches. This network model processes the two inputs separately using the same convolution weights. In the proposed approach, the Siamese part consists of L(= 5) layers, as shown in Figure 2(a). Each layer in the figure consists of a convolution layer with a fixed kernel size of three and a stride of one, batch normalization, a leaky rectified linear unit (l-ReLU), and max pooling with a filter size and a stride of two. In the last two layers, a convolutional operation without padding is applied to reduce the size of the subsequent layers.

RGB images of size 128 × 64 × 3 are applied to the first convolution layer of the Siamese part. The resulting feature maps are passed through a max-pooling layer that reduces the width and height of the feature maps by half before they are fed into the next convolution layer. At the end of the Siamese part, each input image is represented by a feature map of size 5 × 1 × 100. Finally, these feature maps are vectorized before being fed into the decision module. Clearly, weight sharing between the two branches in the Siamese part enables pairs of images showing the same object to produce similar feature maps because the two images look similar.

### 3.2 Extension Part

The Siamese part outputs two feature vectors f5 and g5 for each pair of input images, as shown in Figure 2, and the two feature vectors include cues as to whether the two images show the same object. Unfortunately, however, the two feature vectors are computed independently of each other without considering the other image; thus, the cues provided by the two feature vectors do not suffice to determine whether the two images depict the same object. To solve this problem, we propose a feature-augmentation architecture. Let f and g represent the output feature maps of the th layer, (ℓ ∈ {1, 2, 3, 4, 5}) for the two branches of the Siamese part, as shown in Figure 2. As shown in Figure 2(b), the extra layers added below the Siamese part first compute the difference map h between the feature maps f and g and then pass the augmented difference map through the subsequent layers until the resulting map is reduced to 5 × 1 × 100, which is the same size as the output of the Siamese part. These additional feature maps are also vectorized before being fed into the decision module. The extra layers are used throughout the Siamese part and provide multi-level spatial discriminative information about the two images, as shown in Figure 2(b).

Figure 3 shows a visualization of the feature responses obtained at each layer of the proposed network except for the 5th layer ( = 5). Each feature response of size h×w×d is visualized by taking the average of the feature maps of size h×w×1 and upsampling to 128 × 64 × 1 using bilinear interpolation [19].

Figure 3(a) shows a positive pair and the corresponding feature responses, from which it may be observed that the activation pairs f and g appear similar because the pair is positive. The activations h introduced by the extension are generally low, which means that there is little difference between the two given images. In contrast, a negative pair and the corresponding feature responses are shown in Figure 3(b). Because the input pairs appear significantly different, the activations f1 and g1 also look different. However, as the features pass through the subsequent layers, the differences gradually disappear, and the features f4 and g4 in the last layer show no significant differences for the upper body. In this case, the extension can remedy the problem by providing discriminative features for a pair of images. In Figure 3(b), the activation of the difference map h1 is high for the upper body, indicating that the parts of the two people are quite different. As the features passed through the subsequent layers from h1 to h14, stronger activations were observed in the feature maps. Furthermore, the extension exploits multi-level feature differences h1, h2, h3, and h4, which enables it to capture discriminative information for the two images at multiple scales. For example, the features in h1 focus on large differences, such as between the upper bodies, whereas the other features in h2, h3, h4 represent small-scale differences, such as between the heads. It may also be observed that the difference features h in Figure 3(b) are more strongly activated than those in Figure 3(a), implying that the extension can effectively extract discriminative information from the feature maps of the Siamese part.

### 3.3 Decision Part

The decision part is formed by adding two FCLs to the previous parts, as shown in Figure 2(c). This part determines whether the two given input images represent the same object using the feature vector fc1 from the Siamese and extension parts. The feature vector fc2 from the first FCL is passed through batch-normalization and l-ReLU nonlinear activation function, then fed into the second FCL and finally a softmax function, producing the output

y^=(y^0y^1)=(exp(y˜0)j=01exp(y˜j)exp(y˜1)j=01exp(y˜j)),

where y˜=(y˜0y˜1)T is the output of the second FCL and y^=(y^0y^1)T is the final output of the decision part. The outputs ŷ0 and ŷ1 represent the probabilities of two given images originating from the same or different objects, respectively.

### 3.4 Network Training

First, a margin-based contrastive loss function was employed [6] to optimize the Siamese part of the proposed ESCNN. This loss function is intended to encourage positive pairs to move closer together while pushing negative pairs sufficiently far apart in the feature space. The contrastive loss function is defined as

Lcont(x1,x2,y)=12yD2+12(1-y){max (0,m-D)}2,

where y is a binary label denoting whether the input pair is a positive (y = 1) or negative (y = 0) result, D = ||f5– g5||2 is the Euclidean distance between the final feature vectors from the Siamese part, and m > 0 is the allowed margin for the negative pairs. Two negative images within a distance of m are penalized using (3).

A cross-entropy loss function is used to train the extension and decision parts, which is defined as

Lcross(y,y^)=-(ylog y^1+(1-y)log y^2),

where y^=[y^1y^2]T denotes the predicted posterior probability (2) returned by the decision part. While the contrastive loss (3) only affects the Siamese part, the cross-entropy loss (4) affects all parts, including the extension and decision parts. For end-to-end training, the optimization objective function is designed to include both the loss functions (3) and (4) as follows.

minW1Nn=1N{λLcont(x1(n),x2(n),y(n);W)+(1-λ)Lcross(y(n),y^(x1(n),x2(n));W)},

where W are the trainable weights of the network, N denotes the mini-batch size, and λ is the balancing parameter between the two loss functions. The overall training strategy is illustrated in Figure 4.

### 4.1 iLIDS–VID Dataset

In this section, the proposed ESCNN is applied to the pedestrian verification problem defined by [20]. This is a challenging problem because different people may wear similar clothing, and illumination and viewpoints can vary between frames, and pedestrians are likely to be occluded by other objects. The iLIDS–VID dataset was used [21, 22]. This dataset was developed for person re-identification and was created by observing pedestrians from two disjoint camera angles in a public open space. It comprised 600 image sequences from 300 distinct individuals. Each image sequence had a different length, ranging from 23 to 192 image frames, with an average of 73. Examples from the iLIDS–VID dataset are shown in Figure 5.

In our experiments, 200 and 50 people were used for training and validation, respectively, and the remaining 50 people were used for testing. To generate a positive pair, two images of the same person were randomly selected from an image sequence. To generate a negative pair, two images of different people were randomly selected from two different image sequences. By repeating this procedure, we generated 20 positive and 20 negative pairs for each person, yielding 8000 pairs for training. In addition, we doubled the size of the training dataset by horizontally reflecting all image pairs. Validation and testing data were generated and augmented in the same manner.

### 4.2 Experimental Results

In this subsection, we describe the results of an experimental evaluation in which the proposed ESCNN was applied to the pedestrian verification problem and its performance compared with that of previous methods, including the standard Siamese method [12] and two handcrafted descriptor methods [4, 5]. Through a comparison with the standard Siamese model, the structural effectiveness of the ESCNN can be demonstrated because the standard Siamese model can be considered as a special case of the ESCNN without the extension. That is, the experiment was designed to verify the effectiveness of the extension part of the ESCNN without comparison with other works[16,23] that focus on better loss functions. In descriptor-based methods [4, 5], the verification score was used to determine whether a pair of images represented the same object. In the experiment, the training dataset was augmented by horizontal reflection and divided into mini-batches of size 64. Both SCNN and ESCNN deep learning methods were implemented on a single NVIDIA GeForce GTX Titan X with an Intel Core i5-4670. The Adam optimizer [24] was used with a constant learning rate of 0.0001 to train the proposed ESCNN and other deep learning methods.

Figure 6 shows examples of the results produced by the methods compared. As shown in the first row of Figure 6(a), all the methods correctly identified positive pairs with clean backgrounds. However, the descriptor-based methods failed to cope with the more complicated pairs, as shown in the second and third rows of Figure 6(a), and it appears that the presence of key points from different backgrounds degraded the verification performance. The standard Siamese model succeeded in recognizing these examples but still had difficulty recognizing the most difficult examples, as shown in the last two rows of Figure 6(a). The ESCNN successfully recognized all examples despite a variety of background conditions.

Figure 6(b) shows the results of some negative examples. All competing methods gave correct answers for easy examples, such as that in the first row, where the two people had different hairstyles and wore different-looking clothes. The two pairs given in the second and third rows of Figure 6(b), however, were more difficult to distinguish because of their similar appearances; thus, the descriptor-based methods failed to cope with these harder negative examples. The standard Siamese model succeeded in distinguishing these two pairs but failed for the most difficult examples, such as those given in the last two rows of Figure 6(b). The ESCNN succeeded in distinguishing all five negative examples, including the last.

To provide a concrete comparison, the results of all methods compared are summarized in terms of the following five measures.

• True positive rate (TPR)=number of true positivesnumber of positive samples

• True negative rate (TNR)=number of true negativesnumber of negative samples

• Positive predictive value (PPV)=number of true positivesnumber of positive predictions

• F1score=2·PPV·TPRPPV+TPR

• Accuracy=number of true predictionstotal number of samples

The results of the quantitative comparison are listed in Table 1. Two different versions of the ESCNN were used for the comparison: that explained in Section 3 and that in which h = |f – g| was used instead of h = f – g, where ℓ ∈ {1, 2, 3, 4, 5} denotes the layer number and |•| denotes the elementwise absolute value. The reason for considering this variant, denoted as ESCNN-abs, was to prove the hypothesis that negative values can be beneficial in providing subsequent layers with more discriminative information.

From the results in Table 1, it may be noted that the classical descriptor-based matching methods exhibited a relatively low performance compared with the deep learning methods, which was not sufficient to address this challenging verification problem. The standard Siamese model is a deep learning method, but its performance was insufficient except in terms of TPR. This means that the standard Siamese model often mistook different people as being the same person, which is a critical fault for many practical applications such as object tracking and surveillance systems. The two ESCNN variants achieved excellent verification performance, providing significant improvements over competing methods. In particular, the TNR measure showed the largest improvement among all measures. This means that the extension part of the ESCNN contributed significantly to learning discriminative features for negative input pairs. Of the two proposed versions, ESCNN performed slightly better than ESCNN-abs in terms of all measures except TPR; thus, it can be concluded that h = f – g was more important than h = |f – g| for discriminative feature learning for input image pairs. The performance of the proposed network can also be visualized using a receiver operating characteristic (ROC) curve, as shown in Figure 7. The ROC curves showed that the ESCNN variants achieved large improvements in performance compared to competing methods.

The experimental results show that the extension allows the ESCNN to learn more sophisticated discriminative features that provide effective cues for accurate verification. However, it should be noted that the features of the extension part can only be calculated when both images are provided. This implies that the proposed ESCNN cannot be used for the feature embedding of a single image.

### 4.3 Tiny ESCNN Model

A direct comparison of the proposed ESCNN with the standard Siamese model might be unfair because the proposed ESCNN uses more weights than the standard Siamese model owing to the extension. Thus, to provide a fair comparison, a small ESCNN was constructed with a number of weights less than or equal to that of the standard Siamese model. This variant is denoted as ESCNN-tiny, and its configuration is specified by the bracketed numbers in Figure 2. ESCNN-tiny was trained in the same manner as ESCNN. Table 2 presents a comparison of the number of weights used by different networks. Table 3 summarizes the performance of ESCNN-tiny and compares it to that of the standard Siamese model.

As shown in the table, ESCNN-tiny used a smaller number of trainable weights but demonstrated better performance than the standard Siamese model. This clearly shows that the excellent performance of ESCNN was not simply the result of additional features, but also resulted from the structural superiority of the design. The ROC curve of ESCNN-tiny is also compared with those of the others in Figure 7.

In this study, we have proposed a new discriminative feature learning method called ESCNN to solve the visual object verification problem. By exploiting the differences between the images applied to the two branches, the proposed ESCNN learns independent, relative, and discriminative features for the image pairs. The physical properties of the discriminative features learned by the ESCNN were demonstrated by feature visualization, and the performance of the proposed ESCNN was compared with that of previous methods in terms of recall, specificity, precision, F1 score, and accuracy. The results showed that the ESCNN demonstrated significant improvements over the competing methods, including SCNN, and that this remained true even when a smaller number of weights was used.

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(grant number: NRF-2019R1I1A1A01059759).

Fig. 1.

Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.

Fig. 2.

The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as h × w × d, and the bracketed numbers correspond to ESCNN-tiny, introduced in Section 4.3

Fig. 3.

Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.

Fig. 4.

Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.

Fig. 5.

Examples from the iLIDS–VID dataset.

Fig. 6.

Some example results: (a) positive and (b) negative samples.

Fig. 7.

ROC curves for the methods under consideration.

Table. 1.

Table 1. Quantitative results.

TPR (recall)TNR (specificity)PPV (precision)F1 scoreAccuracy
SIFT [4]0.8860.8900.8890.8880.888
SURF [5]0.7480.7510.7500.7500.750
Standard Siamese model [12]0.9790.8490.8660.9190.914
ESCNN-abs0.9940.9050.9130.9520.949
ESCNN0.9830.9710.9710.9770.977

Table. 2.

Table 2. Comparison of the number of trainable weights.

Weight typeStandard Siamese model [12]ESCNN & ESCNN-absESCNN-tiny
Convolution144,540612,54062,820
Convolution (bias)3001,100350
Batch normalization8002,400900
Fully connected layer100,200350,200105,200
Fully connected layer (bias)102102102
Total number of weights245,942966,342169,372

Table. 3.

Table 3. Quantitative results for the ESCNN-tiny.

TPR (recall)TNR (specificity)PPV (precision)F1 scoreAccuracy
Standard Siamese model [12]0.9790.8490.8660.9190.914
ESCNN-tiny0.9850.9550.9570.9710.970

1. Nowak, E, and Jurie, F . Learning visual similarity measures for comparing never seen objects., 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp.1-8.
2. Chopra, S, Hadsell, R, and LeCun, Y . Learning a similarity metric discriminatively, with application to face verification., 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005, pp.539-546.
3. Jin, T, Morioka, K, and Hashimoto, H (2004). Appearance based object identification for mobile robot localization in intelligent space with distributed vision sensors. International Journal of Fuzzy Logic and Intelligent systems. 4, 165-171.
4. Yamazaki, M, Li, D, Isshiki, T, and Kunieda, H . Sift-based algorithm for fingerprint authentication on smartphone., 2015 6th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), 2015, pp.1-5.
5. Chihaoui, T, Jlassi, H, Kachouri, R, Hamrouni, K, and Akil, M . Personal verification system based on retina and surf descriptors., 2016 13th International Multi-Conference on Systems, Signals Devices (SSD), 2016, pp.280-286.
6. Ferencz, A, Learned-Miller, E, and Malik, J . Building a classification cascade for visual identification from one example., Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, 2005, pp.286-293.
7. Choi, S-H, and Jung, SH (2020). Stable acquisition of fine-grained segments using batch normalization and focal loss with l1 regularization in u-net structure. International Journal of Fuzzy Logic and Intelligent systems. 20, 59-68.
8. Cho, S-M, and Choi, B-J (2020). Cnn-based recognition algorithm for four classes of roads. International Journal of Fuzzy Logic and Intelligent systems. 20, 114-118.
9. Han, X, Leung, T, Jia, Y, Sukthankar, R, and Berg, AC . Matchnet: Unifying feature and metric learning for patch-based matching., 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.3279-3286.
10. Sun, Y, Wang, X, and Tang, X . Deep learning face representation from predicting 10,000 classes., 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.1891-1898.
11. Zagoruyko, S, and Komodakis, N . Learning to compare image patches via convolutional neural networks., 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.4353-4361.
12. Koch, GR (2015). Siamese neural networks for one-shot image recognition.
13. Yang, X, Guo, H, Wang, N, Song, B, and Gao, X (2020). A novel symmetry driven siamese network for thz concealed object verification. IEEE Transactions on Image Processing. 29, 5447-5456.
14. Azadani, MN, and Boukerche, A (2022). Siamese temporal convolutional networks for driver identification using driver steering behavior analysis. IEEE Transactions on Intelligent Transportation Systems, 1-12.
15. Hoffer, E, and Ailon, N (2015). Deep metric learning using triplet network. Similarity-Based Pattern Recognition, Feragen, A, Pelillo, M, and Loog, M, ed. Cham: Springer International Publishing, pp. 84-92
16. Aguilera, CA, Sappa, AD, Aguilera, C, and Toledo, R (2017). Cross-spectral local descriptors via quadruplet network. Sensors. 17.
17. Simonyan, K, Vedaldi, A, and Zisserman, A (2014). Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 36, 1573-1585.
18. Fischer, P, Dosovitskiy, A, and Brox, T (2015). Descriptor matching with convolutional neural networks: a comparison to sift.
19. Keys, R (1981). Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing. 29, 1153-1160.
20. Li, Z, Chang, S, Liang, F, Huang, TS, Cao, L, and Smith, JR . Learning locally-adaptive decision functions for person verification., 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp.3610-3617.
21. Wang, T, Gong, S, Zhu, X, and Wang, S (2016). Person re-identification by discriminative selection in video ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38, 2501-2514.
22. Wang, T, Gong, S, Zhu, X, and Wang, S (2014). Person re-identification by video ranking. Computer Vision - ECCV 2014: Springer International Publishing, pp. 688-703
23. Li, W, Zhao, R, Xiao, T, and Wang, X . Deepreid: Deep filter pairing neural network for person re-identification., 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.152-159.
24. Kingma, DP, and Ba, J (2017). Adam: A method for stochastic optimization.

Sangyun Lee received the B.S. and Ph.D. degrees in Electrical and Electronic Engineering from Yonsei University, Seoul, Korea, in 2011 and 2018, respectively. From 2018 to 2021, he was a Senior Researcher in Samsung Electronics Co., Ltd. Since 2021, he has been with the faculty of the School of Information Technology, Sungkonghoe University, Seoul, Korea, where he is currently an Assistant Professor. His current research interests include artificial intelligence, computer vision, and their various applications.

Sungjun Hong received the B.S. degree in Electrical and Electronic Engineering and Computer Science and the Ph.D. degree in Electrical and Electronic Engineering from Yonsei University, Seoul, Korea, in 2005 and 2012, respectively. Upon his graduation, he worked with LG Electronics, a connected car industry, as a senior researcher, from 2012 to 2013. He worked as a Lead Software Engineer with The Pinkfong Company, from 2013 to 2016. He was a Postdoctoral Researcher and a Research Professor with the School of Electrical and Electronic Engineering, Yonsei University, from 2016 to 2020, prior to his current appointment. He is currently an Assistant Professor with the School of Information Technology, Sungkonghoe university, Seoul, Korea. His research interests include machine learning, deep learning, computer vision, and their various applications. He received the IET Computer Vision Premium Award from the Institution of Engineering and Technology (IET), U.K., in 2015.

### Article

#### Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349

Published online December 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.4.339

## Extended Siamese Convolutional Neural Networks for Discriminative Feature Learning

Sangyun Lee and Sungjun Hong

School of Information Technology, Sungkonghoe University, Seoul, Korea

Correspondence to:Sungjun Hong (sjhong@skhu.ac.kr)

Received: November 9, 2021; Revised: June 30, 2022; Accepted: October 17, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.

Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network

### 1. Introduction

Given a pair of images, determining whether they show the same or different objects is an important problem in computer vision and image analysis [1]. This problem is called visual object verification, and it plays a key role in large-scale computer vision systems such as face verification [2] and object tracking [3]. Visual object verification differs from visual categorization and is especially challenging owing to issues such as changes in camera viewpoint, variations in illumination, and occlusion. For example, the visual appearance of a single pedestrian may vary owing to changes in perspective, pose, or lighting.

Over the past decade, handcrafted descriptors such as SIFT and SURF have been used in previous studies on visual object verification [4, 5]. Unfortunately, methods based on descriptors do not perform well for this problem owing to the lack of category-specific training [6]. Furthermore, the handcrafted descriptors are designed to be robust to small differences; thus, they are not well-suited for the visual verification problem, which requires even small differences between two images to be exploited as fully as possible. In contrast, convolutional neural networks (CNNs) have become popular for their ability to learn rich features and have demonstrated record-breaking performance in many computer vision tasks over the past few years, including semantic segmentation [7] and image classification [8]. CNNs have also been applied to the visual object verification problem using learned features in [9, 10].

The Siamese convolutional neural network (SCNN) proposed by LeCun et al. [2] has received considerable attention as a possible solution to the visual object-verification problem [1114]. Typical SCNNs learn not only the features but also a similarity metric for a given visual object verification problem. Their topology is shown in Figure 1. As shown in the figure, SCNNs have two parallel branches, one for each image, and the two branches share the same weights. Thus, if the two images provided to the SCNN are the same, the corresponding output feature maps should also be the same. At the end of the network, distance functions such as Euclidean distance are used to provide the similarity (more precisely, distance) metric for the two given images. The output map from each branch can be viewed as a feature that discriminates the given image from that of the opposite branch. Following the success of SCNNs, several variants have been proposed to learn similarity metrics, such as the Triplet [15] and Quadruplet [16] networks.

However, SCNNs and their variants have the weakness that the feature map computed for each branch depends only on the image applied to that branch and is completely independent of the image applied to the opposite branch. More specifically, let us consider the following two cases. In the first case, the two images, A and A′, applied to an SCNN originate from the same object. In the second case, the two images A and B come from different objects. Because the feature map of each branch only uses the image applied to that branch, the output feature map for branch A is the same in both cases. Triplet [15] and Quadruplet [16] models have been developed to learn discriminative features with more sophisticated loss functions, but they still have the same weakness. For example, Triplets have three branches for three images A, A′ and B and their output features are completely independent of the other images. The independence of the output maps generated for the two branches reduces the effectiveness of discriminative feature learning. To the best of our knowledge, no definitive solutions to this problem have been reported.

To solve this problem, we propose a new SCNN variant called extended SCNN (ESCNN). The proposed ESCNN learns not only independent features for each image in a pair but also their relative features to improve its ability to discriminate between the two images. To learn the two types of features, ESCNNs include a feature augmentation architecture designed to exploit the multilevel features of the underlying conventional SCNN. The augmented features provide discriminatory information for two images that offer additional visual object verification cues. Furthermore, inspired by recent successful applications of fully connected layers (FCLs) in similarity metric learning [9,11,12], a fully connected layer is also used in the ESCNN to decide whether the input pairs are from the same or different objects.

The two main contributions of this study are summarized as follows.

• A new SCNN variant called an extended SCNN (ESCNN) is proposed for visual object verification. Compared with an SCNN, the proposed ESCNN exploits relative information between branches and extracts new discriminative features to improve its performance in learning a similarity metric.

• The performance of ESCNNs for visual object verification is analyzed from two perspectives. First, the feature maps of each ESCNN layer are visualized and we show that the proposed ESCNN can to encode relative and discriminative information for the two input images. Second, the proposed ESCNN is applied to the person verification problem and we demonstrate that the ESCNN significantly outperformed previous methods using handcrafted descriptors (e.g., SIFT, SURF) and a conventional SCNN. Further, the results show that ESCNNs can even demonstrate better performance with fewer weights than an SCNN would require.

The remainder of this study is organized as follows. In Section 2, we briefly review some related works and present a description of the problem considered. Section 3 introduces the proposed ESCNN and its training strategy. Section 4 discusses the experimental setup, including the dataset used, other models used for comparison, and the implementation details. The results are then described, and the advantages of the ESCNN are compared with those of the other models. Finally, Section 5 presents our conclusions.

### 2.1 Related Work

A typical approach to visual object verification is to extract features from a pair of images and compute a similarity metric between these features [6]. To extract the features, handcrafted feature descriptors such as SIFT and SURF, have been used for the verification problem [4, 5]. However, handcrafted features often require expensive human labor and rely on expert knowledge. Without category-specific training, these methods do not perform well in visual object verification [6]. Recently, a method was proposed to learn these features [17]. However, further processing is generally required to compute the similarity metric using methods such as support-vector machines or Euclidean distance.

Thus, deep learning has attracted increasing attention in the field of computer vision, and several deep learning methods have been proposed. Naturally, similarity metric learning methods using deep learning have also been proposed [9, 11, 18]. In particular, the authors of a recent study [18] demonstrated that the features learned by a CNN trained on ImageNet outperformed SIFT. These advances have encouraged the use of CNNs in various practical applications, including face verification [2].

The most popular CNN architecture for similarity metric learning is the SCNN model proposed by LeCun et al. [2]. As shown in Figure 1, the SCNNs have two branches that share the same architecture and weight values. In general, branch outputs are assigned to distance functions (e.g., squared Euclidean distance) to compute the similarity metrics for the input pairs [2]. However, [9, 11, 12] used multiple linear FCLs as an alternative to the distance function. In fact, the results of [11] show that this learning metric performs better than the Euclidean distance function. Thus, we used FCLs in the proposed network for similar metric learning.

In this paper, we propose a structural enhancement of a conventional SCNN. Several variants inspired by the success of SCNNs were proposed in [15, 16]. In [15], a Triplet network was proposed that consisted of three branches for more efficient training. In addition, a Quadruplet network consisting of four branches was proposed for the local descriptors [16] problems. Although SCNN variants have demonstrated outstanding performance in various tasks, they have the inherent structural limitation that they extract features for each image independently without knowledge of the other image in the pair. Therefore, to address this issue, we propose an SCNN variant that extracts new discriminative features by exploiting the relative differences in input pairs.

### 2.2 Problem Statement

Visual object verification is a special case of object recognition in which the object category is known (e.g., pedestrians or cars), and one must determine whether a given pair of images represent the same object. We consider the visual object verification problem as a classification problem that takes a pair of images as input and outputs a classification result for the pair of images, either positive if the two images depict the same object or negative if the images show different objects. In other words, we developed an SCNN variant to solve the classification problem for a given dataset D as follows.

$D={(x1(n),x2(n),y(n))∣n=1,⋯,N,y(n)∈{0,1}},$

where $x1(n)$ and $x2(n)$ are the nth pair of images, y(n) is the class label representing whether the pair of images comes from the same object (y(n) = 1) or not (y(n) = 0), and N is the number of image pairs. As discussed in the previous subsection, a typical SCNN has only two components, one for extracting discriminative features from each image in an input pair and the other for comparing these features and determining the class of the input pair. In this study, we propose an SCNN variant with a new extended structure to solve the classification problem and enhance performance. The proposed architecture and an associated training strategy are described in detail below.

### 3. Proposed Architecture

In this section, we describe the proposed deep neural network, called an extended Siamese convolutional neural network (ESCNN). The model is designed to learn discriminative features for pairs of images and provide cues to determine whether two given images depict the same object. The ESCNN architecture, as shown in Figure 2, consists of three parts, including Siamese, extension, and decision components. These are explained in the following subsections.

### 3.1 Siamese Part

To extract discriminative features for a pair of input images, we used a typical Siamese architecture with two branches. This network model processes the two inputs separately using the same convolution weights. In the proposed approach, the Siamese part consists of L(= 5) layers, as shown in Figure 2(a). Each layer in the figure consists of a convolution layer with a fixed kernel size of three and a stride of one, batch normalization, a leaky rectified linear unit (l-ReLU), and max pooling with a filter size and a stride of two. In the last two layers, a convolutional operation without padding is applied to reduce the size of the subsequent layers.

RGB images of size 128 × 64 × 3 are applied to the first convolution layer of the Siamese part. The resulting feature maps are passed through a max-pooling layer that reduces the width and height of the feature maps by half before they are fed into the next convolution layer. At the end of the Siamese part, each input image is represented by a feature map of size 5 × 1 × 100. Finally, these feature maps are vectorized before being fed into the decision module. Clearly, weight sharing between the two branches in the Siamese part enables pairs of images showing the same object to produce similar feature maps because the two images look similar.

### 3.2 Extension Part

The Siamese part outputs two feature vectors f5 and g5 for each pair of input images, as shown in Figure 2, and the two feature vectors include cues as to whether the two images show the same object. Unfortunately, however, the two feature vectors are computed independently of each other without considering the other image; thus, the cues provided by the two feature vectors do not suffice to determine whether the two images depict the same object. To solve this problem, we propose a feature-augmentation architecture. Let f and g represent the output feature maps of the th layer, (ℓ ∈ {1, 2, 3, 4, 5}) for the two branches of the Siamese part, as shown in Figure 2. As shown in Figure 2(b), the extra layers added below the Siamese part first compute the difference map h between the feature maps f and g and then pass the augmented difference map through the subsequent layers until the resulting map is reduced to 5 × 1 × 100, which is the same size as the output of the Siamese part. These additional feature maps are also vectorized before being fed into the decision module. The extra layers are used throughout the Siamese part and provide multi-level spatial discriminative information about the two images, as shown in Figure 2(b).

Figure 3 shows a visualization of the feature responses obtained at each layer of the proposed network except for the 5th layer ( = 5). Each feature response of size h×w×d is visualized by taking the average of the feature maps of size h×w×1 and upsampling to 128 × 64 × 1 using bilinear interpolation [19].

Figure 3(a) shows a positive pair and the corresponding feature responses, from which it may be observed that the activation pairs f and g appear similar because the pair is positive. The activations h introduced by the extension are generally low, which means that there is little difference between the two given images. In contrast, a negative pair and the corresponding feature responses are shown in Figure 3(b). Because the input pairs appear significantly different, the activations f1 and g1 also look different. However, as the features pass through the subsequent layers, the differences gradually disappear, and the features f4 and g4 in the last layer show no significant differences for the upper body. In this case, the extension can remedy the problem by providing discriminative features for a pair of images. In Figure 3(b), the activation of the difference map h1 is high for the upper body, indicating that the parts of the two people are quite different. As the features passed through the subsequent layers from h1 to h14, stronger activations were observed in the feature maps. Furthermore, the extension exploits multi-level feature differences h1, h2, h3, and h4, which enables it to capture discriminative information for the two images at multiple scales. For example, the features in h1 focus on large differences, such as between the upper bodies, whereas the other features in h2, h3, h4 represent small-scale differences, such as between the heads. It may also be observed that the difference features h in Figure 3(b) are more strongly activated than those in Figure 3(a), implying that the extension can effectively extract discriminative information from the feature maps of the Siamese part.

### 3.3 Decision Part

The decision part is formed by adding two FCLs to the previous parts, as shown in Figure 2(c). This part determines whether the two given input images represent the same object using the feature vector fc1 from the Siamese and extension parts. The feature vector fc2 from the first FCL is passed through batch-normalization and l-ReLU nonlinear activation function, then fed into the second FCL and finally a softmax function, producing the output

$y^=(y^0y^1)=(exp(y˜0)∑j=01exp(y˜j)exp(y˜1)∑j=01exp(y˜j)),$

where $y˜=(y˜0y˜1)T$ is the output of the second FCL and $y^=(y^0y^1)T$ is the final output of the decision part. The outputs ŷ0 and ŷ1 represent the probabilities of two given images originating from the same or different objects, respectively.

### 3.4 Network Training

First, a margin-based contrastive loss function was employed [6] to optimize the Siamese part of the proposed ESCNN. This loss function is intended to encourage positive pairs to move closer together while pushing negative pairs sufficiently far apart in the feature space. The contrastive loss function is defined as

$Lcont (x1,x2,y)=12yD2+12(1-y) {max (0,m-D)}2,$

where y is a binary label denoting whether the input pair is a positive (y = 1) or negative (y = 0) result, D = ||f5– g5||2 is the Euclidean distance between the final feature vectors from the Siamese part, and m > 0 is the allowed margin for the negative pairs. Two negative images within a distance of m are penalized using (3).

A cross-entropy loss function is used to train the extension and decision parts, which is defined as

$Lcross (y,y^)=-(y log y^1+(1-y) log y^2),$

where $y^=[y^1y^2]T$ denotes the predicted posterior probability (2) returned by the decision part. While the contrastive loss (3) only affects the Siamese part, the cross-entropy loss (4) affects all parts, including the extension and decision parts. For end-to-end training, the optimization objective function is designed to include both the loss functions (3) and (4) as follows.

$minW 1N∑n=1N{λLcont (x1(n),x2(n),y(n);W)+(1-λ) Lcross (y(n),y^(x1(n),x2(n)) ;W)},$

where W are the trainable weights of the network, N denotes the mini-batch size, and λ is the balancing parameter between the two loss functions. The overall training strategy is illustrated in Figure 4.

### 4.1 iLIDS–VID Dataset

In this section, the proposed ESCNN is applied to the pedestrian verification problem defined by [20]. This is a challenging problem because different people may wear similar clothing, and illumination and viewpoints can vary between frames, and pedestrians are likely to be occluded by other objects. The iLIDS–VID dataset was used [21, 22]. This dataset was developed for person re-identification and was created by observing pedestrians from two disjoint camera angles in a public open space. It comprised 600 image sequences from 300 distinct individuals. Each image sequence had a different length, ranging from 23 to 192 image frames, with an average of 73. Examples from the iLIDS–VID dataset are shown in Figure 5.

In our experiments, 200 and 50 people were used for training and validation, respectively, and the remaining 50 people were used for testing. To generate a positive pair, two images of the same person were randomly selected from an image sequence. To generate a negative pair, two images of different people were randomly selected from two different image sequences. By repeating this procedure, we generated 20 positive and 20 negative pairs for each person, yielding 8000 pairs for training. In addition, we doubled the size of the training dataset by horizontally reflecting all image pairs. Validation and testing data were generated and augmented in the same manner.

### 4.2 Experimental Results

In this subsection, we describe the results of an experimental evaluation in which the proposed ESCNN was applied to the pedestrian verification problem and its performance compared with that of previous methods, including the standard Siamese method [12] and two handcrafted descriptor methods [4, 5]. Through a comparison with the standard Siamese model, the structural effectiveness of the ESCNN can be demonstrated because the standard Siamese model can be considered as a special case of the ESCNN without the extension. That is, the experiment was designed to verify the effectiveness of the extension part of the ESCNN without comparison with other works[16,23] that focus on better loss functions. In descriptor-based methods [4, 5], the verification score was used to determine whether a pair of images represented the same object. In the experiment, the training dataset was augmented by horizontal reflection and divided into mini-batches of size 64. Both SCNN and ESCNN deep learning methods were implemented on a single NVIDIA GeForce GTX Titan X with an Intel Core i5-4670. The Adam optimizer [24] was used with a constant learning rate of 0.0001 to train the proposed ESCNN and other deep learning methods.

Figure 6 shows examples of the results produced by the methods compared. As shown in the first row of Figure 6(a), all the methods correctly identified positive pairs with clean backgrounds. However, the descriptor-based methods failed to cope with the more complicated pairs, as shown in the second and third rows of Figure 6(a), and it appears that the presence of key points from different backgrounds degraded the verification performance. The standard Siamese model succeeded in recognizing these examples but still had difficulty recognizing the most difficult examples, as shown in the last two rows of Figure 6(a). The ESCNN successfully recognized all examples despite a variety of background conditions.

Figure 6(b) shows the results of some negative examples. All competing methods gave correct answers for easy examples, such as that in the first row, where the two people had different hairstyles and wore different-looking clothes. The two pairs given in the second and third rows of Figure 6(b), however, were more difficult to distinguish because of their similar appearances; thus, the descriptor-based methods failed to cope with these harder negative examples. The standard Siamese model succeeded in distinguishing these two pairs but failed for the most difficult examples, such as those given in the last two rows of Figure 6(b). The ESCNN succeeded in distinguishing all five negative examples, including the last.

To provide a concrete comparison, the results of all methods compared are summarized in terms of the following five measures.

• $True positive rate (TPR)=number of true positivesnumber of positive samples$

• $True negative rate (TNR)=number of true negativesnumber of negative samples$

• $Positive predictive value (PPV)=number of true positivesnumber of positive predictions$

• $F1score=2·PPV·TPRPPV+TPR$

• $Accuracy=number of true predictionstotal number of samples$

The results of the quantitative comparison are listed in Table 1. Two different versions of the ESCNN were used for the comparison: that explained in Section 3 and that in which h = |f – g| was used instead of h = f – g, where ℓ ∈ {1, 2, 3, 4, 5} denotes the layer number and |•| denotes the elementwise absolute value. The reason for considering this variant, denoted as ESCNN-abs, was to prove the hypothesis that negative values can be beneficial in providing subsequent layers with more discriminative information.

From the results in Table 1, it may be noted that the classical descriptor-based matching methods exhibited a relatively low performance compared with the deep learning methods, which was not sufficient to address this challenging verification problem. The standard Siamese model is a deep learning method, but its performance was insufficient except in terms of TPR. This means that the standard Siamese model often mistook different people as being the same person, which is a critical fault for many practical applications such as object tracking and surveillance systems. The two ESCNN variants achieved excellent verification performance, providing significant improvements over competing methods. In particular, the TNR measure showed the largest improvement among all measures. This means that the extension part of the ESCNN contributed significantly to learning discriminative features for negative input pairs. Of the two proposed versions, ESCNN performed slightly better than ESCNN-abs in terms of all measures except TPR; thus, it can be concluded that h = f – g was more important than h = |f – g| for discriminative feature learning for input image pairs. The performance of the proposed network can also be visualized using a receiver operating characteristic (ROC) curve, as shown in Figure 7. The ROC curves showed that the ESCNN variants achieved large improvements in performance compared to competing methods.

The experimental results show that the extension allows the ESCNN to learn more sophisticated discriminative features that provide effective cues for accurate verification. However, it should be noted that the features of the extension part can only be calculated when both images are provided. This implies that the proposed ESCNN cannot be used for the feature embedding of a single image.

### 4.3 Tiny ESCNN Model

A direct comparison of the proposed ESCNN with the standard Siamese model might be unfair because the proposed ESCNN uses more weights than the standard Siamese model owing to the extension. Thus, to provide a fair comparison, a small ESCNN was constructed with a number of weights less than or equal to that of the standard Siamese model. This variant is denoted as ESCNN-tiny, and its configuration is specified by the bracketed numbers in Figure 2. ESCNN-tiny was trained in the same manner as ESCNN. Table 2 presents a comparison of the number of weights used by different networks. Table 3 summarizes the performance of ESCNN-tiny and compares it to that of the standard Siamese model.

As shown in the table, ESCNN-tiny used a smaller number of trainable weights but demonstrated better performance than the standard Siamese model. This clearly shows that the excellent performance of ESCNN was not simply the result of additional features, but also resulted from the structural superiority of the design. The ROC curve of ESCNN-tiny is also compared with those of the others in Figure 7.

### 5. Conclusions

In this study, we have proposed a new discriminative feature learning method called ESCNN to solve the visual object verification problem. By exploiting the differences between the images applied to the two branches, the proposed ESCNN learns independent, relative, and discriminative features for the image pairs. The physical properties of the discriminative features learned by the ESCNN were demonstrated by feature visualization, and the performance of the proposed ESCNN was compared with that of previous methods in terms of recall, specificity, precision, F1 score, and accuracy. The results showed that the ESCNN demonstrated significant improvements over the competing methods, including SCNN, and that this remained true even when a smaller number of weights was used.

### Fig 1.

Figure 1.

Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 2.

Figure 2.

The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as h × w × d, and the bracketed numbers correspond to ESCNN-tiny, introduced in Section 4.3

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 3.

Figure 3.

Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 4.

Figure 4.

Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 5.

Figure 5.

Examples from the iLIDS–VID dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 6.

Figure 6.

Some example results: (a) positive and (b) negative samples.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

### Fig 7.

Figure 7.

ROC curves for the methods under consideration.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 339-349https://doi.org/10.5391/IJFIS.2022.22.4.339

Quantitative results.

TPR (recall)TNR (specificity)PPV (precision)F1 scoreAccuracy
SIFT [4]0.8860.8900.8890.8880.888
SURF [5]0.7480.7510.7500.7500.750
Standard Siamese model [12]0.9790.8490.8660.9190.914
ESCNN-abs0.9940.9050.9130.9520.949
ESCNN0.9830.9710.9710.9770.977

Comparison of the number of trainable weights.

Weight typeStandard Siamese model [12]ESCNN & ESCNN-absESCNN-tiny
Convolution144,540612,54062,820
Convolution (bias)3001,100350
Batch normalization8002,400900
Fully connected layer100,200350,200105,200
Fully connected layer (bias)102102102
Total number of weights245,942966,342169,372

Quantitative results for the ESCNN-tiny.

TPR (recall)TNR (specificity)PPV (precision)F1 scoreAccuracy
Standard Siamese model [12]0.9790.8490.8660.9190.914
ESCNN-tiny0.9850.9550.9570.9710.970

### References

1. Nowak, E, and Jurie, F . Learning visual similarity measures for comparing never seen objects., 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp.1-8.
2. Chopra, S, Hadsell, R, and LeCun, Y . Learning a similarity metric discriminatively, with application to face verification., 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005, pp.539-546.
3. Jin, T, Morioka, K, and Hashimoto, H (2004). Appearance based object identification for mobile robot localization in intelligent space with distributed vision sensors. International Journal of Fuzzy Logic and Intelligent systems. 4, 165-171.
4. Yamazaki, M, Li, D, Isshiki, T, and Kunieda, H . Sift-based algorithm for fingerprint authentication on smartphone., 2015 6th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), 2015, pp.1-5.
5. Chihaoui, T, Jlassi, H, Kachouri, R, Hamrouni, K, and Akil, M . Personal verification system based on retina and surf descriptors., 2016 13th International Multi-Conference on Systems, Signals Devices (SSD), 2016, pp.280-286.
6. Ferencz, A, Learned-Miller, E, and Malik, J . Building a classification cascade for visual identification from one example., Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, 2005, pp.286-293.
7. Choi, S-H, and Jung, SH (2020). Stable acquisition of fine-grained segments using batch normalization and focal loss with l1 regularization in u-net structure. International Journal of Fuzzy Logic and Intelligent systems. 20, 59-68.
8. Cho, S-M, and Choi, B-J (2020). Cnn-based recognition algorithm for four classes of roads. International Journal of Fuzzy Logic and Intelligent systems. 20, 114-118.
9. Han, X, Leung, T, Jia, Y, Sukthankar, R, and Berg, AC . Matchnet: Unifying feature and metric learning for patch-based matching., 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.3279-3286.
10. Sun, Y, Wang, X, and Tang, X . Deep learning face representation from predicting 10,000 classes., 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.1891-1898.
11. Zagoruyko, S, and Komodakis, N . Learning to compare image patches via convolutional neural networks., 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp.4353-4361.
12. Koch, GR (2015). Siamese neural networks for one-shot image recognition.
13. Yang, X, Guo, H, Wang, N, Song, B, and Gao, X (2020). A novel symmetry driven siamese network for thz concealed object verification. IEEE Transactions on Image Processing. 29, 5447-5456.
14. Azadani, MN, and Boukerche, A (2022). Siamese temporal convolutional networks for driver identification using driver steering behavior analysis. IEEE Transactions on Intelligent Transportation Systems, 1-12.
15. Hoffer, E, and Ailon, N (2015). Deep metric learning using triplet network. Similarity-Based Pattern Recognition, Feragen, A, Pelillo, M, and Loog, M, ed. Cham: Springer International Publishing, pp. 84-92
16. Aguilera, CA, Sappa, AD, Aguilera, C, and Toledo, R (2017). Cross-spectral local descriptors via quadruplet network. Sensors. 17.
17. Simonyan, K, Vedaldi, A, and Zisserman, A (2014). Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 36, 1573-1585.
18. Fischer, P, Dosovitskiy, A, and Brox, T (2015). Descriptor matching with convolutional neural networks: a comparison to sift.
19. Keys, R (1981). Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing. 29, 1153-1160.
20. Li, Z, Chang, S, Liang, F, Huang, TS, Cao, L, and Smith, JR . Learning locally-adaptive decision functions for person verification., 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp.3610-3617.
21. Wang, T, Gong, S, Zhu, X, and Wang, S (2016). Person re-identification by discriminative selection in video ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38, 2501-2514.
22. Wang, T, Gong, S, Zhu, X, and Wang, S (2014). Person re-identification by video ranking. Computer Vision - ECCV 2014: Springer International Publishing, pp. 688-703
23. Li, W, Zhao, R, Xiao, T, and Wang, X . Deepreid: Deep filter pairing neural network for person re-identification., 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.152-159.
24. Kingma, DP, and Ba, J (2017). Adam: A method for stochastic optimization.