search for


Auto-Encoder Variants for Solving Handwritten Digits Classification Problem
International Journal of Fuzzy Logic and Intelligent Systems 2020;20(1):8-16
Published online March 25, 2020
© 2020 Korean Institute of Intelligent Systems.

Muhammad Aamir1, Nazri Mohd Nawi2, Hairulnizam Bin Mahdin1, Rashid Naseem3, and Muhammad Zulqarnain1

1Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
2Soft Computing and Data Mining Center, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
3Department of IT and Computer Science, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur, Pakistan
Correspondence to: Muhammad Aamir (
Received November 30, 2019; Revised February 20, 2020; Accepted March 2, 2020.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Auto-encoders (AEs) have been proposed for solving many problems in the domain of machine learning and deep learning since the last few decades. Due to their satisfactory performance, their multiple variations have also recently appeared. First, we introduce the conventional AE model and its different variant for learning abstract features from data by using a contrastive divergence algorithm. Second, we present the major differences among the following three popular AE variants: sparse AE (SAE), denoising AE (DAE), and contractive AE (CAE). Third, the main contribution of this study is performing the comparative study of the aforementioned three AE variants on the basis of their mathematical modeling and experiments. All the variants of the standard AE are evaluated on the basis of the MNIST benchmark handwritten digit dataset for classification problem. The observed output reveals the benefit of using the AE model and its variants. From the experiments, it is concluded that CAE achieved better classification accuracy than those of SAE and DAE.

Keywords : Sparse auto-encoder (SAE), Denoising auto-encoder (DAE), Contractive auto-encoder (CAE), MNIST, Classification
1. Introduction

Auto-encoders (AEs) are unsupervised neural networks that apply the back-propagation behavior by setting up the high-dimensional input feature set to a low-dimensional output feature set and then recovering the original feature set from the output. The reduction procedure of high-dimensional data to low-dimensional data is known as encoding, while the reconstruction of the original data from the low-dimensional data is called decoding. AE was proposed to improve the reconstruction reliability of low-dimensional feature sets. Notably, feature-set selection is the most important step in solving big-data problems and processing complex data with high-dimensional attributes [1]. In addition, feature-selection is a valuable method for performing classification and prediction [2, 3]. It is a satisfactory procedure to split the useful and effective features from ineffective and useless ones within the feature representation space. However, if irrelevant raw features are provided as input, feature selection might fail [4, 5]. Using AEs is one of the most efficient approaches for feature reduction, as it results in satisfactory classification. It traditionally works in two phases, namely, encoding and decoding. The process of converting the input features to a new representation is called encoding, whereas that of converting this new representation back to the original inputs is called decoding. The primary aim of AEs is to identify more useful and informative features from a large amount of data without applying any feature-reduction technique [68].

To improve the efficiency of the standard AE, [9] proposed an AE-based learning approach with two-stage architecture. They stacked the AE by considering both intra- and inter-modal semantics and named it sparse AE (SAE). Similarly, [10] proposed a feature-learning approach by stacking the contractive AE (CAE) and naming it sCAE. Their research work focused on extracting the temporal feature difference of superpixel and noise suppression. To benefit from both the characteristics of AE and properties of conventional algorithms, [11, 12] merged the CAE with the convolutional neural network to form a deep architecture named DSCNN, to solve the problem of both the presence of speckle noise and absence of effective features in single-polarized SAR (synthetic aperture radar) data.

Notable, some popular AE variants are SAEs [13], denoising AEs (DAEs) [14], and CAEs [15]. In this study, we first introduce AE and then present its three popular variants, namely, SAEs, DAEs, and CAEs. We conducted some experiments to empirically compare their performances and properties with one another. A comprehensive analysis based on the experimental results is also given. Conclusion is presented in the last section.

2. Auto-Encoder

AEs are also called auto-associators. In the domain of machine learning, AEs comprise a complicated neural network with three layers and have been investigated by a many researchers since the 1990s [16]. The entire processing inside an AE proceeds in two phases, namely encoding and decoding, as depicted in Figure 1.

2.1 Encoding

The process of mapping the input feature set to transform it to give its intermediate representation to the hidden layer is called encoding, which is given as follows:


where f(x) denotes the outputs of the input layer that is given as inputs to the hidden layer, w the weights given to each input, and b the biasness value associated with the input feature set.

2.2 Decoding

The process of mapping the output of the hidden layer back into the input feature set is called decoding, which is given as follows:


where g(y) denotes the output of the hidden layer, w′ the weights given to the inputs of the hidden layer, and b′ the biasness of the inputs given to the hidden layer.

The terms se and sd denote the encoding- and decoding-activation functions, respectively, and are given by Eq. (3) for nonlinear representation (sigmoid function) and by Eq. (4) for linear representation (hyperbolic tangent function). One has the following:


The primary aim of the reconstruction is to generate the outputs that are similar to the original inputs to the maximum possible extent, by reducing the reconstruction error. The reconstruction layer uses the following parameter set to reconstruct the original inputs:


Let us assume we have the input feature set as Di, then the reconstruction error is minimized by inimizing the following cost function:

While Di = [x1, x2, x3, …, xn].


where R denotes the reconstruction error. In the case of linear representation, it is the Euclidean distance, whereas in the case of nonlinear representation, it is the cross-entropy loss. To avoid overfitting and penalizing the large weights due to Eq. (6), the following is the simplest form of Eq. (6):


where the relative importance of regularization is controlled via weight coefficient decay λ. The complete structure and step-by-step working mechanism of the proposed model is depicted in Figure 1 and explained in Algorithm 1.

3. Sparse Auto-Encoder

SAEs are some of the simplest AE variants that contain a sparsity term, which leads a network that comprises a single layer to learn a code dictionary that reduces both the reconstruction error and number of code-words necessary for reconstruction. The main purpose of adding a sparse penalty term to the SAE cost function is to limit the activation value at the average of the hidden-layer neuron [17]. Extracting sparse features from raw data is the main aim of SAEs. To achieve the sparsity of objects presentation, two methods can be used. One method is to penalize the hidden unit bias, and the other one is to directly penalize the hidden unit’s activation output, which has been discussed by [18]. Notably, [19] used sparse multi-layered AE framework for performing the auto-detection of nuclei on a set of 537 marked histopathological breast-cancer images. The entire architecture comprised one input layer, two hidden layers, and one output layer. They stacked the SAE and used Softmax for performing classification. In addition, 3,468 nodes served as input to the input layer, 400 nodes in the first hidden layer, and 255 nodes in the second hidden layer. The output from the second hidden layer was an input to the final Softmax layer, which was mapped to two classes, either 1 or 0. Generally, if the output value of a neuron is 1, it is indicated that the neuron is inactive. However, if the output value of a neuron is 1, it is indicated that the neuron is active. The aim of implementing sparsity is to bind the unwanted activation and the updated forms of Eqs. (6) and 7 forms SAE based on the following:


where n denotes the total number of neurons in the implicit layer and β the weight of the sparse penalty term. In addition, C(w, b) denotes the cost function, where w represents the weight matrix and b the deviation matrix.

4. Denoising Auto-Encoder

DAE, which are the modest alteration of standard AEs, are not trained for reconstructing their input but to denoise an artificially corrupted version of their input [20]. Whereas an over-complete regular AE can easily learn a useless identity mapping, a DAE must extract more useful features to solve the considerably more difficult denoising problem. The main role of DAEs is to reconstruct noisy data from the original data. DAEs are used for the following two purposes: encoding the noisy data and recovering the original input data from the reconstructed output data. When data encoders are stacked in different layers, they form stacked DAEs [21, 22]. Notably, [23] adopted a weighted reconstruction loss function to the conventional DAE for performing noise classification in a speech-enhancement system. They stacked several weighted DAEs to construct the model. In their experiments, they performed 50 steps with the number of input nodes ranging from 50 to 100. The un-noisy data comprised 8 languages, and a white noise with the signal-to-noise ratios (SNRs) of 6, 12, and 18 dB was selected from the NTT database to train the model. The model was trained on the data of length 1 hour, and it was tested on the data of length 8 minutes. In the literature, DAEs have been the center of interest for many researchers in different contexts, as per [12, 13]. Based on AEs and Eqs. 6 and 7, the DAE is given by the following:


A training input xDn is first corrupted by the additive Gaussian noise of covariance. The corrupted version r is encoded into a hidden representation xRdh via an affine mapping, followed by the nonlinearity of h = encode(r) = sigmoid(wr + b), where xRdh, h(0,1)hd, W is a dh * d matrix with bRdh. The hidden-layer representation h is decoded into reconstruction xr via affine mapping such that xr=decode(h)=WhT+c, where cRd. While θ = w, b, c is optimized to minimize the squared reconstruction error ||xrx||.

5. Contractive Auto-Decoder

CAEs, which are the popular and effective variants of AEs, are based on the unsupervised-learning approach for the production of valuable feature representations [24]. The trained models that learned feature representations using CAE are highly robust to minor noises and small changes within the training data. CAEs are considered the extended forms of the DAEs in which contractive penalty is added to the error function of reconstruction [25]. This penalty is, in turn, used to penalize the attribute sensitivity to the variations in the inputs. Based on learning a robust feature set, [26] proposed a CAE with an unconventional regularization yielding objective function. Based on Eqs. (6) and 7, the CAE is given by the following:


where f(x) denotes the Jacobian matrix of encoder f at x. The mapping of the feature set to be contractive in the local domain of the training data is facilitated by adding the penalty of the Frobenius norm of the encoder Jacobian, for example, the intermediary representation of features that are robust to minor variations or noise in the input data.

6. Experimental Setup

We performed several experiments to evaluate and then compare the feature-learning abilities of the three AEs variants, namely, SAE, DAE, and CAE. To conduct fair experiments and comparison, all the AE variants were assumed to have the same architecture with one input layer, two hidden layers, and a Softmax regression classifier in the last layer for final classification. In this study, all the experiments were conducted using an Intel core i3 CPU with 4 GB of RAM and Windows 8.1 operating system. The compiler and language used for developing and testing these algorithms was Python2.7. For fast implementation of the comparative approaches, an efficient numeric computational open-source library TensorFlow [27] was used. In our experiments, 12,000 images were considered for evaluation and comparison with some comparative models. We used 5,000 random images for validating each MNIST variation dataset. All the images comprised 0–9 digits with the size of 28 x 28 pixels. A few example images, which show the shape deformation and variation in digits appearance, are depicted in Figure 2. This variation makes it a challenging dataset to be used to test the models. The variation datasets of MNIST are standard (basic), random-noise background digits (bg-rand), and rotation and image background digits (bg-img-rot) [28]. The first MNIST basic (mnist-basic) comprises the standard MNIST image without any changes. In the bg-rand variant dataset, a background is randomly added to the digit image. The value of the background for every pixel is consistently produced in the fixed range of 0 to 255. In the last dataset, the variation factor of rotating digits and addition of random background is merged in a single dataset. In addition, six random images from each MNIST variant dataset are depicted in Figures 2, 3, and 4, respectively. All the comparative approaches are used in combination with features extractors, and the Softmax classifier is fitted in the last layer for performing final classification. In the experiments using each approach, we randomly selected sample images for half of the training and half of the testing for each run. Every instance of model running is independently repeated 10 times, and the mean value of the errors is reported in the results.

7. Results and Discussion

In experiments, the different AE variants leveraged a Softmax classifier to estimate the overall classification behavior of models based on MNIST benchmark variant datasets. Figures 5, 6, and 7 depict the results for SAE on three MNIST benchmark variant datasets. Similarly, Figures 8, 9, and 10 depict the results for DAE. Furthermore, Figures 11, 12, and 13 depict the output roc curve for CAE. Upon analyzing all the experiments, as depicted in these figures, a gradual decrease is seen in the overall classification accuracy of all the AE variants. The results of the AE variants are verified and validated in Table 1. Table 1 also shows that starting for MNIST basic subset until MNIST random background digits. This decrease is responsible for the gradual increase in the complexity and noisiness in the datasets. Other than this complexity, the CAE outperformed both SAE and DAE in terms of the class level accuracy.

8. Conclusions

We first introduced the conventional AE model and its most popular three variants, namely, SAE, DAE, and CAE. Notably, the AE is a three-layered neural-network model with one input layer having n number of units that are connected to m number of hidden layer units, such that m < n < 0. The hidden layer is then connected to the final layer that comprises nc number of units, such that n = nc. All the three AE variants possess specific properties, which generate the interests of researchers to use them for solving problems in the domain of modern machine learning. The main purpose of adding the sparse penalty term to the SAE cost function was to limit the activation value at the average of the hidden-layer neuron. The key purpose of DAE was to introduce the ability to reconstruct noisy data from the original data. The CAE-based training model could learn the feature representations that were highly robust to minor noises and small changes in the training data. The classification results of the comparative AE variants based on the MNIST dataset showed that all the AE models performed satisfactorily. Notably, SAE had faster execution time than those of DAE and CAE. The overall experimental results revealed that CAE outperformed SAE and DAE in terms of both training and testing errors.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.


The authors would like to thank the Ministry of Higher Education, Malaysia, and Universiti Tun Hussein Onn Malaysia for financially supporting this Research under Trans-disciplinary Research Grant vote no. T003.

Fig. 1.

Architecture of the standard AE.

Fig. 2.

MNIST small subset (basic).

Fig. 3.

MNIST random-noise background digits (bg-rand).

Fig. 4.

MNIST rotation and image background digits (bg-img-rot).

Fig. 5.

ROC for the SAE based on a small subset (basic).

Fig. 6.

ROC for the SAE based on random noise background digits (bg-rand).

Fig. 7.

ROC for the SAE based on rotation and image background digits (bg-img-rot).

Fig. 8.

ROC for the DAE based on MNIST small subset (basic).

Fig. 9.

ROC for the DAE based on random noise background digits (bg-rand).

Fig. 10.

ROC for the DAE based on rotation and image background digits (bg-img-rot).

Fig. 11.

ROC for the CAE based on a small subset (basic).

Fig. 12.

ROC for the CAE based on random noise background digits (bg-rand).

Fig. 13.

ROC for the CAE based on rotation and image background digits (bg-img-rot).


Table 1

Performance evaluation of SAE, DAE, and CAE based on MNIST benchmark datasets

ApproachExecution timeTraining errorTest errorExecution timeTraining errorTest errorExecution timeTraining errorTest error
SAE + Softmax24m 25s5.5913.5431m 25s8.9221.5440m 25s11.8726.58
DAE + Softmax31m 20s7.7315.7837m 20s10.3821.7845m20s17.7328.72
CAE + Softmax28m 55s3.3511.6532m 55s7.5820.6537m 55s11.0524.78

Algorithm 1


Parameters initialization

- No. of hidden layers: h

- Input feature set: [x1, x2, x3, …, xn]

- Encoding-activation function: EAF

- Decoding-activation function: DAF

- Inputs weights: Wi

- Biasness values: bi


- Compute encoded inputs f(w) by multiplying xn and Wi

- Compute biased inputs f(b) by adding bI to the encoded inputs

- Compute f(x) using Eq. (1) by applying f(w) and f(b)


- Compute decoded outputs f(w′) by multiplying xn and Wi

- Compute biased outputs f(b′) by adding bI to the decoded outputs

- Compute g(y) using Eq. (2) by applying f(w′) and f(b′)


- Optimize the value of Eq. (5)

  1. Kim, B, and Lee, J (2018). A deep-learning based model for emotional evaluation of video clips. International Journal of Fuzzy Logic and Intelligent Systems. 18, 245-253.
  2. Wahid, F, Alsaedi, AKZ, and Ghazali, R (2019). Using improved firefly algorithm based on genetic algorithm crossover operator for solving optimization problems. Journal of Intelligent & Fuzzy Systems. 36, 1547-1562.
  3. Palvanov, A, and Cho, YI (2018). Comparisons of deep learning algorithms for MNIST in real-time environment. International Journal of Fuzzy Logic and Intelligent Systems. 18, 126-134.
  4. Aamir, M, Nawi, NM, Shahzad, A, Mahdin, H, and Rehman, MZ 2017. A new argumentative based reasoning framework with rough set for decision making., Proceedings of 2017 6th ICT International Student Project Conference (ICT-ISPC), Skudai, Malaysia, Array, pp.1-4.
  5. Nawi, NM, Khan, A, and Rehman, MZ (2015). An accelerated particle swarm optimized back propagation algorithm. Jurnal Teknologi. 77, 49-53.
  6. Song, J, Zhang, H, Li, X, Gao, L, Wang, M, and Hong, R (2018). Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing. 27, 3210-3221.
    Pubmed CrossRef
  7. Aytekin, C, Ni, X, Cricri, F, and Aksu, E 2018. Clustering and unsupervised anomaly detection with L2 normalized deep auto-encoder representations., Proceedings of 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, Array, pp.1-6.
  8. Zulqarnain, M, Ishak, SA, Ghazali, R, Nawi, NM, Aamir, M, and Hassim, YMM (2020). An improved deep learning approach based on variant two-state gated recurrent unit and word embeddings for sentiment classification. International Journal of Advanced Computer Science and Applications. 11, 594-603.
  9. Liu, Y, Feng, X, and Zhou, Z (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing. 120, 761-766.
  10. Lv, N, Chen, C, Qiu, T, and Sangaiah, AK (2018). Deep learning and superpixel feature extraction based on contractive autoencoder for change detection in SAR images. IEEE Transactions on Industrial Informatics. 14, 5530-5538.
  11. Geng, J, Wang, H, Fan, J, and Ma, X (2017). Deep supervised and contractive neural network for SAR image classification. IEEE Transactions on Geoscience and Remote Sensing. 55, 2442-2459.
  12. Caterini, AL, Doucet, A, and Sejdinovic, D (2018). Hamiltonian variational auto-encoder. Advances in Neural Information Processing Systems. 31, 8167-8177.
  13. Ng, A. (2011) . Sparse autoencoder. CS294A Lecture notes. Available
  14. Vincent, P (2011). A connection between score matching and denoising autoencoders. Neural Computation. 23, 1661-1674.
    Pubmed CrossRef
  15. Rifai, S, Mesnil, G, Vincent, P, Muller, X, Bengio, Y, Dauphin, Y, and Glorot, X (2011). Higher order contractive autoencoder. Machine Learning and Knowledge Discovery in Databases. Heidelberg: Springer, pp. 645-660
  16. Wang, Y, Yao, H, and Zhao, S (2016). Auto-encoder based dimensionality reduction. Neurocomputing. 184, 232-242.
  17. Zhang, L, Lu, Y, Wang, B, Li, F, and Zhang, Z (2018). Sparse auto-encoder with smoothed L1 regularization. Neural Processing Letters. 47, 829-839.
  18. Liu, W, Ma, T, Tao, D, and You, J (2016). HSAE: a hessian regularized sparse auto-encoders. Neurocomputing. 187, 59-65.
  19. Xu, J, Xiang, L, Liu, Q, Gilmore, H, Wu, J, Tang, J, and Madabhushi, A (2016). Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Transactions on Medical Imaging. 35, 119-130.
    Pubmed KoreaMed CrossRef
  20. Chauhan, N, and Choi, BJ (2019). Denoising approaches using fuzzy logic and convolutional autoencoders for human brain MRI image. International Journal of Fuzzy Logic and Intelligent Systems. 19, 135-139.
  21. Chen, M, Weinberger, KQ, Xu, Z, and Sha, F (2015). Marginalizing stacked linear denoising autoencoders. The Journal of Machine Learning Research. 16, 3849-3875.
  22. Tong, C, Li, J, Lang, C, Kong, F, Niu, J, and Rodrigues, JJ (2018). An efficient deep model for day-ahead electricity load forecasting with stacked denoising auto-encoders. Journal of Parallel and Distributed Computing. 117, 267-273.
  23. Xia, B, and Bao, C (2014). Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Communication. 60, 13-29.
  24. Zhang, S, Yao, L, and Xu, X 2017. AutoSVD++: an efficient hybrid collaborative filtering model via contractive autoencoders., Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, Array, pp.957-960.
  25. Yang, Q, and Sun, F (2018). Small sample learning with high order contractive auto-encoders and application in SAR images. Science China Information Sciences. 61.
  26. Rifai, S, Vincent, P, Muller, X, Glorot, X, and Bengio, Y 2011. Contractive auto-encoders: explicit invariance during feature extraction., Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, pp.833-840.
  27. Abadi, M, Barham, P, Chen, J, Chen, Z, Davis, A, and Dean, J 2016. TensorFlow: a system for large-scale machine learning., Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, GA, pp.265-283.
  28. Aamir, M, Wahid, F, Mahdin, H, and Nawi, NM (2019). An efficient normalized restricted Boltzmann machine for solving multiclass classification problems. International Journal of Advanced Computer Science and Applications. 10, 416-426.

Muhammad Aamir has recently received his PhD in Information Technology from University Tunn Hussien Onn Malaysia. He did his Masters degree in Computer Science from City University of Science and Information Technology Pakistan. He had worked for two years in Xululabs LLC as data scientist. Currently he is working on research related to big data processing and data analysis. His fields of Interest are Data Science, Deep Learning, and Computer Programming.

Nazri Mohd Nazwi is Professor at the Faculty of Computer Science and Information Technology, University Tun Hussein Onn Malaysia. He obtained his PhD in Computer Science (Data mining) from Swansea University United Kingdom, his MSc in Computer Science from the University of Technology Malaysia. He has published more than 78 indexed journals and conference proceedings. His research interests are in the field of data analysis, database system, optimization methods and data mining techniques using Artificial Neural Network.

Hairulnizam Bin Mahdin is Associate Professor at the Faculty of Computer Science and Information Technology, University Tun Hussein Onn Malaysia. He obtained his PhD in Computer Science from Deakin University Australia, his MSc and BS in Computer Science from the University Putra Malaysia. His research interests are in the field of Data management, Database system, Internet of things and Big Data.

Rashid Naseem belongs to Landikotal, Khyber, KPK, Pakistan. He received the BCS degree in computer science from the University of Peshawar, Pakistan, in 2008 and the MPhil degree in computer science from the Quaid-i-Azam University, Pakistan, in 2011. He obtained PhD in Information Technology from the Universiti Tun Hussein Onn Malaysia in February 2017. He is currently Assistant Professor of Software Engineering at Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Mang Khanpur Road Haripur, Pakistan. His research interests include software modularization, architecture recovery, datamining and clustering techniques.

Muhammad Zulqarnain received his Bachelor and Master degree in Computer Science and Information Technology from The Islamia University of Bahawalpur (IUB), Pakistan. He received his M.Phil degree (Master of Philosophy) from National College of Business Administration and Economics, Lahore, Pakistan. He is currently pursuing Ph.D from University Tun Hussein Onn Malaysia. His research interest is Machine Learning and Deep learning for natural language processing and its application