International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349
Published online December 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.4.339
© The Korean Institute of Intelligent Systems
School of Information Technology, Sungkonghoe University, Seoul, Korea
Correspondence to :
Sungjun Hong (sjhong@skhu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.
Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network
No potential conflict of interest relevant to this article was reported.
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(4): 339-349
Published online December 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.4.339
Copyright © The Korean Institute of Intelligent Systems.
School of Information Technology, Sungkonghoe University, Seoul, Korea
Correspondence to:Sungjun Hong (sjhong@skhu.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Siamese convolutional neural networks (SCNNs) has been considered as among the best deep learning architectures for visual object verification. However, these models involve the drawback that each branch extracts features independently without considering the other branch, which sometimes lead to unsatisfactory performance. In this study, we propose a new architecture called an extended SCNN (ESCNN) that addresses this limitation by learning both independent and relative features for a pair of images. ESCNNs also have a feature augmentation architecture that exploits the multi-level features of the underlying SCNN. The results of feature visualization showed that the proposed ESCNN can encode relative and discriminative information for the two input images at multi-level scales. Finally, we applied an ESCNN model to a person verification problem, and the experimental results indicate that the ESCNN achived an accuracy of 97.7%, which outperformed an SCNN model with 91.4% accuracy. The results of ablation studies also showed that a small version of the ESCNN performed 5.6% better than an SCNN model.
Keywords: Discriminative feature, Feature augmentation, Object verification, Siamese convolutional neural network
Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.
The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as
Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.
Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.
Examples from the iLIDS–VID dataset.
Some example results: (a) positive and (b) negative samples.
ROC curves for the methods under consideration.
Table 2 . Comparison of the number of trainable weights.
Weight type | Standard Siamese model [12] | ESCNN & ESCNN-abs | ESCNN-tiny |
---|---|---|---|
Convolution | 144,540 | 612,540 | 62,820 |
Convolution (bias) | 300 | 1,100 | 350 |
Batch normalization | 800 | 2,400 | 900 |
Fully connected layer | 100,200 | 350,200 | 105,200 |
Fully connected layer (bias) | 102 | 102 | 102 |
Total number of weights | 245,942 | 966,342 | 169,372 |
Architecture of a conventional SCNN. The network is trained by contrastive loss in the training stage, whereas a distance function is used to compute the similarity metric in the testing stage.
|@|~(^,^)~|@|The proposed ESCNN architecture, which consists of three parts: (a) Siamese, (b) extension, and (c) decision parts. The feature dimensions are denoted as
Visualization of the features learned by the ESCNN: (a) positive and (b) negative samples.
|@|~(^,^)~|@|Training strategy of the proposed network. The network is optimized by a combination of two loss functions: 1) contrastive loss for the Siamese part and 2) cross-entropy loss for all parts, including the extension and decision parts.
|@|~(^,^)~|@|Examples from the iLIDS–VID dataset.
|@|~(^,^)~|@|Some example results: (a) positive and (b) negative samples.
|@|~(^,^)~|@|ROC curves for the methods under consideration.