International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349
Published online December 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.4.339
© The Korean Institute of Intelligent Systems
School of Information Technology, Sungkonghoe University, Seoul, Korea
Correspondence to :
Sungjun Hong (sjhong@skhu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.
Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network
Given a pair of images, determining whether they show the same or different objects is an important problem in computer vision and image analysis [1]. This problem is called visual object verification, and it plays a key role in large-scale computer vision systems such as face verification [2] and object tracking [3]. Visual object verification differs from visual categorization and is especially challenging owing to issues such as changes in camera viewpoint, variations in illumination, and occlusion. For example, the visual appearance of a single pedestrian may vary owing to changes in perspective, pose, or lighting.
Over the past decade, handcrafted descriptors such as SIFT and SURF have been used in previous studies on visual object verification [4, 5]. Unfortunately, methods based on descriptors do not perform well for this problem owing to the lack of category-specific training [6]. Furthermore, the handcrafted descriptors are designed to be robust to small differences; thus, they are not well-suited for the visual verification problem, which requires even small differences between two images to be exploited as fully as possible. In contrast, convolutional neural networks (CNNs) have become popular for their ability to learn rich features and have demonstrated record-breaking performance in many computer vision tasks over the past few years, including semantic segmentation [7] and image classification [8]. CNNs have also been applied to the visual object verification problem using learned features in [9, 10].
The Siamese convolutional neural network (SCNN) proposed by LeCun et al. [2] has received considerable attention as a possible solution to the visual object-verification problem [11–14]. Typical SCNNs learn not only the features but also a similarity metric for a given visual object verification problem. Their topology is shown in Figure 1. As shown in the figure, SCNNs have two parallel branches, one for each image, and the two branches share the same weights. Thus, if the two images provided to the SCNN are the same, the corresponding output feature maps should also be the same. At the end of the network, distance functions such as Euclidean distance are used to provide the similarity (more precisely, distance) metric for the two given images. The output map from each branch can be viewed as a feature that discriminates the given image from that of the opposite branch. Following the success of SCNNs, several variants have been proposed to learn similarity metrics, such as the Triplet [15] and Quadruplet [16] networks.
However, SCNNs and their variants have the weakness that the feature map computed for each branch depends only on the image applied to that branch and is completely independent of the image applied to the opposite branch. More specifically, let us consider the following two cases. In the first case, the two images,
To solve this problem, we propose a new SCNN variant called extended SCNN (ESCNN). The proposed ESCNN learns not only independent features for each image in a pair but also their relative features to improve its ability to discriminate between the two images. To learn the two types of features, ESCNNs include a feature augmentation architecture designed to exploit the multilevel features of the underlying conventional SCNN. The augmented features provide discriminatory information for two images that offer additional visual object verification cues. Furthermore, inspired by recent successful applications of fully connected layers (FCLs) in similarity metric learning [9,11,12], a fully connected layer is also used in the ESCNN to decide whether the input pairs are from the same or different objects.
The two main contributions of this study are summarized as follows.
A new SCNN variant called an extended SCNN (ESCNN) is proposed for visual object verification. Compared with an SCNN, the proposed ESCNN exploits relative information between branches and extracts new discriminative features to improve its performance in learning a similarity metric.
The performance of ESCNNs for visual object verification is analyzed from two perspectives. First, the feature maps of each ESCNN layer are visualized and we show that the proposed ESCNN can to encode relative and discriminative information for the two input images. Second, the proposed ESCNN is applied to the person verification problem and we demonstrate that the ESCNN significantly outperformed previous methods using handcrafted descriptors (e.g., SIFT, SURF) and a conventional SCNN. Further, the results show that ESCNNs can even demonstrate better performance with fewer weights than an SCNN would require.
The remainder of this study is organized as follows. In Section 2, we briefly review some related works and present a description of the problem considered. Section 3 introduces the proposed ESCNN and its training strategy. Section 4 discusses the experimental setup, including the dataset used, other models used for comparison, and the implementation details. The results are then described, and the advantages of the ESCNN are compared with those of the other models. Finally, Section 5 presents our conclusions.
A typical approach to visual object verification is to extract features from a pair of images and compute a similarity metric between these features [6]. To extract the features, handcrafted feature descriptors such as SIFT and SURF, have been used for the verification problem [4, 5]. However, handcrafted features often require expensive human labor and rely on expert knowledge. Without category-specific training, these methods do not perform well in visual object verification [6]. Recently, a method was proposed to learn these features [17]. However, further processing is generally required to compute the similarity metric using methods such as support-vector machines or Euclidean distance.
Thus, deep learning has attracted increasing attention in the field of computer vision, and several deep learning methods have been proposed. Naturally, similarity metric learning methods using deep learning have also been proposed [9, 11, 18]. In particular, the authors of a recent study [18] demonstrated that the features learned by a CNN trained on ImageNet outperformed SIFT. These advances have encouraged the use of CNNs in various practical applications, including face verification [2].
The most popular CNN architecture for similarity metric learning is the SCNN model proposed by LeCun et al. [2]. As shown in Figure 1, the SCNNs have two branches that share the same architecture and weight values. In general, branch outputs are assigned to distance functions (e.g., squared Euclidean distance) to compute the similarity metrics for the input pairs [2]. However, [9, 11, 12] used multiple linear FCLs as an alternative to the distance function. In fact, the results of [11] show that this learning metric performs better than the Euclidean distance function. Thus, we used FCLs in the proposed network for similar metric learning.
In this paper, we propose a structural enhancement of a conventional SCNN. Several variants inspired by the success of SCNNs were proposed in [15, 16]. In [15], a Triplet network was proposed that consisted of three branches for more efficient training. In addition, a Quadruplet network consisting of four branches was proposed for the local descriptors [16] problems. Although SCNN variants have demonstrated outstanding performance in various tasks, they have the inherent structural limitation that they extract features for each image independently without knowledge of the other image in the pair. Therefore, to address this issue, we propose an SCNN variant that extracts new discriminative features by exploiting the relative differences in input pairs.
Visual object verification is a special case of object recognition in which the object category is known (e.g., pedestrians or cars), and one must determine whether a given pair of images represent the same object. We consider the visual object verification problem as a classification problem that takes a pair of images as input and outputs a classification result for the pair of images, either positive if the two images depict the same object or negative if the images show different objects. In other words, we developed an SCNN variant to solve the classification problem for a given dataset
where
In this section, we describe the proposed deep neural network, called an extended Siamese convolutional neural network (ESCNN). The model is designed to learn discriminative features for pairs of images and provide cues to determine whether two given images depict the same object. The ESCNN architecture, as shown in Figure 2, consists of three parts, including Siamese, extension, and decision components. These are explained in the following subsections.
To extract discriminative features for a pair of input images, we used a typical Siamese architecture with two branches. This network model processes the two inputs separately using the same convolution weights. In the proposed approach, the Siamese part consists of
RGB images of size 128
The Siamese part outputs two feature vectors
Figure 3 shows a visualization of the feature responses obtained at each layer of the proposed network except for the 5th layer (
Figure 3(a) shows a positive pair and the corresponding feature responses, from which it may be observed that the activation pairs
The decision part is formed by adding two FCLs to the previous parts, as shown in Figure 2(c). This part determines whether the two given input images represent the same object using the feature vector
where
First, a margin-based contrastive loss function was employed [6] to optimize the Siamese part of the proposed ESCNN. This loss function is intended to encourage positive pairs to move closer together while pushing negative pairs sufficiently far apart in the feature space. The contrastive loss function is defined as
where
A cross-entropy loss function is used to train the extension and decision parts, which is defined as
where
where
In this section, the proposed ESCNN is applied to the pedestrian verification problem defined by [20]. This is a challenging problem because different people may wear similar clothing, and illumination and viewpoints can vary between frames, and pedestrians are likely to be occluded by other objects. The iLIDS–VID dataset was used [21, 22]. This dataset was developed for person re-identification and was created by observing pedestrians from two disjoint camera angles in a public open space. It comprised 600 image sequences from 300 distinct individuals. Each image sequence had a different length, ranging from 23 to 192 image frames, with an average of 73. Examples from the iLIDS–VID dataset are shown in Figure 5.
In our experiments, 200 and 50 people were used for training and validation, respectively, and the remaining 50 people were used for testing. To generate a positive pair, two images of the same person were randomly selected from an image sequence. To generate a negative pair, two images of different people were randomly selected from two different image sequences. By repeating this procedure, we generated 20 positive and 20 negative pairs for each person, yielding 8000 pairs for training. In addition, we doubled the size of the training dataset by horizontally reflecting all image pairs. Validation and testing data were generated and augmented in the same manner.
In this subsection, we describe the results of an experimental evaluation in which the proposed ESCNN was applied to the pedestrian verification problem and its performance compared with that of previous methods, including the standard Siamese method [12] and two handcrafted descriptor methods [4, 5]. Through a comparison with the standard Siamese model, the structural effectiveness of the ESCNN can be demonstrated because the standard Siamese model can be considered as a special case of the ESCNN without the extension. That is, the experiment was designed to verify the effectiveness of the extension part of the ESCNN without comparison with other works[16,23] that focus on better loss functions. In descriptor-based methods [4, 5], the verification score was used to determine whether a pair of images represented the same object. In the experiment, the training dataset was augmented by horizontal reflection and divided into mini-batches of size 64. Both SCNN and ESCNN deep learning methods were implemented on a single NVIDIA GeForce GTX Titan X with an Intel Core i5-4670. The Adam optimizer [24] was used with a constant learning rate of 0.0001 to train the proposed ESCNN and other deep learning methods.
Figure 6 shows examples of the results produced by the methods compared. As shown in the first row of Figure 6(a), all the methods correctly identified positive pairs with clean backgrounds. However, the descriptor-based methods failed to cope with the more complicated pairs, as shown in the second and third rows of Figure 6(a), and it appears that the presence of key points from different backgrounds degraded the verification performance. The standard Siamese model succeeded in recognizing these examples but still had difficulty recognizing the most difficult examples, as shown in the last two rows of Figure 6(a). The ESCNN successfully recognized all examples despite a variety of background conditions.
Figure 6(b) shows the results of some negative examples. All competing methods gave correct answers for easy examples, such as that in the first row, where the two people had different hairstyles and wore different-looking clothes. The two pairs given in the second and third rows of Figure 6(b), however, were more difficult to distinguish because of their similar appearances; thus, the descriptor-based methods failed to cope with these harder negative examples. The standard Siamese model succeeded in distinguishing these two pairs but failed for the most difficult examples, such as those given in the last two rows of Figure 6(b). The ESCNN succeeded in distinguishing all five negative examples, including the last.
To provide a concrete comparison, the results of all methods compared are summarized in terms of the following five measures.
The results of the quantitative comparison are listed in Table 1. Two different versions of the ESCNN were used for the comparison: that explained in Section 3 and that in which
From the results in Table 1, it may be noted that the classical descriptor-based matching methods exhibited a relatively low performance compared with the deep learning methods, which was not sufficient to address this challenging verification problem. The standard Siamese model is a deep learning method, but its performance was insufficient except in terms of TPR. This means that the standard Siamese model often mistook different people as being the same person, which is a critical fault for many practical applications such as object tracking and surveillance systems. The two ESCNN variants achieved excellent verification performance, providing significant improvements over competing methods. In particular, the TNR measure showed the largest improvement among all measures. This means that the extension part of the ESCNN contributed significantly to learning discriminative features for negative input pairs. Of the two proposed versions, ESCNN performed slightly better than ESCNN-abs in terms of all measures except TPR; thus, it can be concluded that
The experimental results show that the extension allows the ESCNN to learn more sophisticated discriminative features that provide effective cues for accurate verification. However, it should be noted that the features of the extension part can only be calculated when both images are provided. This implies that the proposed ESCNN cannot be used for the feature embedding of a single image.
A direct comparison of the proposed ESCNN with the standard Siamese model might be unfair because the proposed ESCNN uses more weights than the standard Siamese model owing to the extension. Thus, to provide a fair comparison, a small ESCNN was constructed with a number of weights less than or equal to that of the standard Siamese model. This variant is denoted as ESCNN-tiny, and its configuration is specified by the bracketed numbers in Figure 2. ESCNN-tiny was trained in the same manner as ESCNN. Table 2 presents a comparison of the number of weights used by different networks. Table 3 summarizes the performance of ESCNN-tiny and compares it to that of the standard Siamese model.
As shown in the table, ESCNN-tiny used a smaller number of trainable weights but demonstrated better performance than the standard Siamese model. This clearly shows that the excellent performance of ESCNN was not simply the result of additional features, but also resulted from the structural superiority of the design. The ROC curve of ESCNN-tiny is also compared with those of the others in Figure 7.
In this study, we have proposed a new discriminative feature learning method called ESCNN to solve the visual object verification problem. By exploiting the differences between the images applied to the two branches, the proposed ESCNN learns independent, relative, and discriminative features for the image pairs. The physical properties of the discriminative features learned by the ESCNN were demonstrated by feature visualization, and the performance of the proposed ESCNN was compared with that of previous methods in terms of recall, specificity, precision,
No potential conflict of interest relevant to this article was reported.
Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.
The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as
Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.
Table 1. Quantitative results.
TPR (recall) | TNR (specificity) | PPV (precision) | Accuracy | ||
---|---|---|---|---|---|
SIFT [4] | 0.886 | 0.890 | 0.889 | 0.888 | 0.888 |
SURF [5] | 0.748 | 0.751 | 0.750 | 0.750 | 0.750 |
Standard Siamese model [12] | 0.979 | 0.849 | 0.866 | 0.919 | 0.914 |
ESCNN-abs | 0.905 | 0.913 | 0.952 | 0.949 | |
ESCNN | 0.983 |
Table 2. Comparison of the number of trainable weights.
Weight type | Standard Siamese model [12] | ESCNN & ESCNN-abs | ESCNN-tiny |
---|---|---|---|
Convolution | 144,540 | 612,540 | 62,820 |
Convolution (bias) | 300 | 1,100 | 350 |
Batch normalization | 800 | 2,400 | 900 |
Fully connected layer | 100,200 | 350,200 | 105,200 |
Fully connected layer (bias) | 102 | 102 | 102 |
Total number of weights | 245,942 | 966,342 | 169,372 |
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349
Published online December 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.4.339
Copyright © The Korean Institute of Intelligent Systems.
School of Information Technology, Sungkonghoe University, Seoul, Korea
Correspondence to:Sungjun Hong (sjhong@skhu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.
Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network
Given a pair of images, determining whether they show the same or different objects is an important problem in computer vision and image analysis [1]. This problem is called visual object verification, and it plays a key role in large-scale computer vision systems such as face verification [2] and object tracking [3]. Visual object verification differs from visual categorization and is especially challenging owing to issues such as changes in camera viewpoint, variations in illumination, and occlusion. For example, the visual appearance of a single pedestrian may vary owing to changes in perspective, pose, or lighting.
Over the past decade, handcrafted descriptors such as SIFT and SURF have been used in previous studies on visual object verification [4, 5]. Unfortunately, methods based on descriptors do not perform well for this problem owing to the lack of category-specific training [6]. Furthermore, the handcrafted descriptors are designed to be robust to small differences; thus, they are not well-suited for the visual verification problem, which requires even small differences between two images to be exploited as fully as possible. In contrast, convolutional neural networks (CNNs) have become popular for their ability to learn rich features and have demonstrated record-breaking performance in many computer vision tasks over the past few years, including semantic segmentation [7] and image classification [8]. CNNs have also been applied to the visual object verification problem using learned features in [9, 10].
The Siamese convolutional neural network (SCNN) proposed by LeCun et al. [2] has received considerable attention as a possible solution to the visual object-verification problem [11–14]. Typical SCNNs learn not only the features but also a similarity metric for a given visual object verification problem. Their topology is shown in Figure 1. As shown in the figure, SCNNs have two parallel branches, one for each image, and the two branches share the same weights. Thus, if the two images provided to the SCNN are the same, the corresponding output feature maps should also be the same. At the end of the network, distance functions such as Euclidean distance are used to provide the similarity (more precisely, distance) metric for the two given images. The output map from each branch can be viewed as a feature that discriminates the given image from that of the opposite branch. Following the success of SCNNs, several variants have been proposed to learn similarity metrics, such as the Triplet [15] and Quadruplet [16] networks.
However, SCNNs and their variants have the weakness that the feature map computed for each branch depends only on the image applied to that branch and is completely independent of the image applied to the opposite branch. More specifically, let us consider the following two cases. In the first case, the two images,
To solve this problem, we propose a new SCNN variant called extended SCNN (ESCNN). The proposed ESCNN learns not only independent features for each image in a pair but also their relative features to improve its ability to discriminate between the two images. To learn the two types of features, ESCNNs include a feature augmentation architecture designed to exploit the multilevel features of the underlying conventional SCNN. The augmented features provide discriminatory information for two images that offer additional visual object verification cues. Furthermore, inspired by recent successful applications of fully connected layers (FCLs) in similarity metric learning [9,11,12], a fully connected layer is also used in the ESCNN to decide whether the input pairs are from the same or different objects.
The two main contributions of this study are summarized as follows.
A new SCNN variant called an extended SCNN (ESCNN) is proposed for visual object verification. Compared with an SCNN, the proposed ESCNN exploits relative information between branches and extracts new discriminative features to improve its performance in learning a similarity metric.
The performance of ESCNNs for visual object verification is analyzed from two perspectives. First, the feature maps of each ESCNN layer are visualized and we show that the proposed ESCNN can to encode relative and discriminative information for the two input images. Second, the proposed ESCNN is applied to the person verification problem and we demonstrate that the ESCNN significantly outperformed previous methods using handcrafted descriptors (e.g., SIFT, SURF) and a conventional SCNN. Further, the results show that ESCNNs can even demonstrate better performance with fewer weights than an SCNN would require.
The remainder of this study is organized as follows. In Section 2, we briefly review some related works and present a description of the problem considered. Section 3 introduces the proposed ESCNN and its training strategy. Section 4 discusses the experimental setup, including the dataset used, other models used for comparison, and the implementation details. The results are then described, and the advantages of the ESCNN are compared with those of the other models. Finally, Section 5 presents our conclusions.
A typical approach to visual object verification is to extract features from a pair of images and compute a similarity metric between these features [6]. To extract the features, handcrafted feature descriptors such as SIFT and SURF, have been used for the verification problem [4, 5]. However, handcrafted features often require expensive human labor and rely on expert knowledge. Without category-specific training, these methods do not perform well in visual object verification [6]. Recently, a method was proposed to learn these features [17]. However, further processing is generally required to compute the similarity metric using methods such as support-vector machines or Euclidean distance.
Thus, deep learning has attracted increasing attention in the field of computer vision, and several deep learning methods have been proposed. Naturally, similarity metric learning methods using deep learning have also been proposed [9, 11, 18]. In particular, the authors of a recent study [18] demonstrated that the features learned by a CNN trained on ImageNet outperformed SIFT. These advances have encouraged the use of CNNs in various practical applications, including face verification [2].
The most popular CNN architecture for similarity metric learning is the SCNN model proposed by LeCun et al. [2]. As shown in Figure 1, the SCNNs have two branches that share the same architecture and weight values. In general, branch outputs are assigned to distance functions (e.g., squared Euclidean distance) to compute the similarity metrics for the input pairs [2]. However, [9, 11, 12] used multiple linear FCLs as an alternative to the distance function. In fact, the results of [11] show that this learning metric performs better than the Euclidean distance function. Thus, we used FCLs in the proposed network for similar metric learning.
In this paper, we propose a structural enhancement of a conventional SCNN. Several variants inspired by the success of SCNNs were proposed in [15, 16]. In [15], a Triplet network was proposed that consisted of three branches for more efficient training. In addition, a Quadruplet network consisting of four branches was proposed for the local descriptors [16] problems. Although SCNN variants have demonstrated outstanding performance in various tasks, they have the inherent structural limitation that they extract features for each image independently without knowledge of the other image in the pair. Therefore, to address this issue, we propose an SCNN variant that extracts new discriminative features by exploiting the relative differences in input pairs.
Visual object verification is a special case of object recognition in which the object category is known (e.g., pedestrians or cars), and one must determine whether a given pair of images represent the same object. We consider the visual object verification problem as a classification problem that takes a pair of images as input and outputs a classification result for the pair of images, either positive if the two images depict the same object or negative if the images show different objects. In other words, we developed an SCNN variant to solve the classification problem for a given dataset
where
In this section, we describe the proposed deep neural network, called an extended Siamese convolutional neural network (ESCNN). The model is designed to learn discriminative features for pairs of images and provide cues to determine whether two given images depict the same object. The ESCNN architecture, as shown in Figure 2, consists of three parts, including Siamese, extension, and decision components. These are explained in the following subsections.
To extract discriminative features for a pair of input images, we used a typical Siamese architecture with two branches. This network model processes the two inputs separately using the same convolution weights. In the proposed approach, the Siamese part consists of
RGB images of size 128
The Siamese part outputs two feature vectors
Figure 3 shows a visualization of the feature responses obtained at each layer of the proposed network except for the 5th layer (
Figure 3(a) shows a positive pair and the corresponding feature responses, from which it may be observed that the activation pairs
The decision part is formed by adding two FCLs to the previous parts, as shown in Figure 2(c). This part determines whether the two given input images represent the same object using the feature vector
where
First, a margin-based contrastive loss function was employed [6] to optimize the Siamese part of the proposed ESCNN. This loss function is intended to encourage positive pairs to move closer together while pushing negative pairs sufficiently far apart in the feature space. The contrastive loss function is defined as
where
A cross-entropy loss function is used to train the extension and decision parts, which is defined as
where
where
In this section, the proposed ESCNN is applied to the pedestrian verification problem defined by [20]. This is a challenging problem because different people may wear similar clothing, and illumination and viewpoints can vary between frames, and pedestrians are likely to be occluded by other objects. The iLIDS–VID dataset was used [21, 22]. This dataset was developed for person re-identification and was created by observing pedestrians from two disjoint camera angles in a public open space. It comprised 600 image sequences from 300 distinct individuals. Each image sequence had a different length, ranging from 23 to 192 image frames, with an average of 73. Examples from the iLIDS–VID dataset are shown in Figure 5.
In our experiments, 200 and 50 people were used for training and validation, respectively, and the remaining 50 people were used for testing. To generate a positive pair, two images of the same person were randomly selected from an image sequence. To generate a negative pair, two images of different people were randomly selected from two different image sequences. By repeating this procedure, we generated 20 positive and 20 negative pairs for each person, yielding 8000 pairs for training. In addition, we doubled the size of the training dataset by horizontally reflecting all image pairs. Validation and testing data were generated and augmented in the same manner.
In this subsection, we describe the results of an experimental evaluation in which the proposed ESCNN was applied to the pedestrian verification problem and its performance compared with that of previous methods, including the standard Siamese method [12] and two handcrafted descriptor methods [4, 5]. Through a comparison with the standard Siamese model, the structural effectiveness of the ESCNN can be demonstrated because the standard Siamese model can be considered as a special case of the ESCNN without the extension. That is, the experiment was designed to verify the effectiveness of the extension part of the ESCNN without comparison with other works[16,23] that focus on better loss functions. In descriptor-based methods [4, 5], the verification score was used to determine whether a pair of images represented the same object. In the experiment, the training dataset was augmented by horizontal reflection and divided into mini-batches of size 64. Both SCNN and ESCNN deep learning methods were implemented on a single NVIDIA GeForce GTX Titan X with an Intel Core i5-4670. The Adam optimizer [24] was used with a constant learning rate of 0.0001 to train the proposed ESCNN and other deep learning methods.
Figure 6 shows examples of the results produced by the methods compared. As shown in the first row of Figure 6(a), all the methods correctly identified positive pairs with clean backgrounds. However, the descriptor-based methods failed to cope with the more complicated pairs, as shown in the second and third rows of Figure 6(a), and it appears that the presence of key points from different backgrounds degraded the verification performance. The standard Siamese model succeeded in recognizing these examples but still had difficulty recognizing the most difficult examples, as shown in the last two rows of Figure 6(a). The ESCNN successfully recognized all examples despite a variety of background conditions.
Figure 6(b) shows the results of some negative examples. All competing methods gave correct answers for easy examples, such as that in the first row, where the two people had different hairstyles and wore different-looking clothes. The two pairs given in the second and third rows of Figure 6(b), however, were more difficult to distinguish because of their similar appearances; thus, the descriptor-based methods failed to cope with these harder negative examples. The standard Siamese model succeeded in distinguishing these two pairs but failed for the most difficult examples, such as those given in the last two rows of Figure 6(b). The ESCNN succeeded in distinguishing all five negative examples, including the last.
To provide a concrete comparison, the results of all methods compared are summarized in terms of the following five measures.
The results of the quantitative comparison are listed in Table 1. Two different versions of the ESCNN were used for the comparison: that explained in Section 3 and that in which
From the results in Table 1, it may be noted that the classical descriptor-based matching methods exhibited a relatively low performance compared with the deep learning methods, which was not sufficient to address this challenging verification problem. The standard Siamese model is a deep learning method, but its performance was insufficient except in terms of TPR. This means that the standard Siamese model often mistook different people as being the same person, which is a critical fault for many practical applications such as object tracking and surveillance systems. The two ESCNN variants achieved excellent verification performance, providing significant improvements over competing methods. In particular, the TNR measure showed the largest improvement among all measures. This means that the extension part of the ESCNN contributed significantly to learning discriminative features for negative input pairs. Of the two proposed versions, ESCNN performed slightly better than ESCNN-abs in terms of all measures except TPR; thus, it can be concluded that
The experimental results show that the extension allows the ESCNN to learn more sophisticated discriminative features that provide effective cues for accurate verification. However, it should be noted that the features of the extension part can only be calculated when both images are provided. This implies that the proposed ESCNN cannot be used for the feature embedding of a single image.
A direct comparison of the proposed ESCNN with the standard Siamese model might be unfair because the proposed ESCNN uses more weights than the standard Siamese model owing to the extension. Thus, to provide a fair comparison, a small ESCNN was constructed with a number of weights less than or equal to that of the standard Siamese model. This variant is denoted as ESCNN-tiny, and its configuration is specified by the bracketed numbers in Figure 2. ESCNN-tiny was trained in the same manner as ESCNN. Table 2 presents a comparison of the number of weights used by different networks. Table 3 summarizes the performance of ESCNN-tiny and compares it to that of the standard Siamese model.
As shown in the table, ESCNN-tiny used a smaller number of trainable weights but demonstrated better performance than the standard Siamese model. This clearly shows that the excellent performance of ESCNN was not simply the result of additional features, but also resulted from the structural superiority of the design. The ROC curve of ESCNN-tiny is also compared with those of the others in Figure 7.
In this study, we have proposed a new discriminative feature learning method called ESCNN to solve the visual object verification problem. By exploiting the differences between the images applied to the two branches, the proposed ESCNN learns independent, relative, and discriminative features for the image pairs. The physical properties of the discriminative features learned by the ESCNN were demonstrated by feature visualization, and the performance of the proposed ESCNN was compared with that of previous methods in terms of recall, specificity, precision,
Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.
The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as
Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.
Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.
Examples from the iLIDS–VID dataset.
Some example results: (a) positive and (b) negative samples.
ROC curves for the methods under consideration.
Table 2 . Comparison of the number of trainable weights.
Weight type | Standard Siamese model [12] | ESCNN & ESCNN-abs | ESCNN-tiny |
---|---|---|---|
Convolution | 144,540 | 612,540 | 62,820 |
Convolution (bias) | 300 | 1,100 | 350 |
Batch normalization | 800 | 2,400 | 900 |
Fully connected layer | 100,200 | 350,200 | 105,200 |
Fully connected layer (bias) | 102 | 102 | 102 |
Total number of weights | 245,942 | 966,342 | 169,372 |
Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.
|@|~(^,^)~|@|The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as
Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.
|@|~(^,^)~|@|Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.
|@|~(^,^)~|@|Examples from the iLIDS–VID dataset.
|@|~(^,^)~|@|Some example results: (a) positive and (b) negative samples.
|@|~(^,^)~|@|ROC curves for the methods under consideration.