International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 229-243
Published online September 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.3.229
© The Korean Institute of Intelligent Systems
Faculty of Information Technology, Hanoi Open University, Hanoi, Vietnam
Correspondence to :
Duong Thang Long (duongthanglong@hou.edu.vn)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multi-task convolutional neural networks (MTCNN) have garnered substantial attention for their impressive feats in image classification. However, MTCNNs tailored for the dual tasks of face recognition and facial expression recognition remain relatively scarce. These networks often entail complexity and substantial computational demands, rendering them unsuitable for practical systems constrained by computational limitations. Addressing this gap, the present paper proposes an efficient MTCNN model that capitalizes on cutting-edge architectures—residual networks and dense-connected networks—to adeptly execute face recognition and facial expression recognition tasks. The model’s efficacy is rigorously assessed across wellknown datasets (JAFFE, CK+, OuluCASIA, KDEF) as well as collected images from learners (HOUS22). The proposed model consistently attains high accuracy levels across all datasets, with a notable minimum accuracy of 99.55% on testing data. These outcomes stand as a testament to the model’s remarkable performance, particularly in relation to its streamlined design. Moreover, the model is seamlessly integrated with an online learning management system, furnishing a versatile means of monitoring learning activities to enhance the overall training quality.
Keywords: Multi-task convolutional neural network, Residual networks, Dense-connected networks, Face recognition, Facial expressions recognition, Online learning management systems
Face recognition (FR) and facial expression recognition (FER) pose significant challenges in the field of computer vision. Convolutional neural networks (CNN) have emerged as the go-to choice for addressing these challenges, with prominent models like VGGNet, GoogleNet, ResNet, SENet, and DenseNet attaining remarkable outcomes [1–6]. VGGNet presents a straightforward architecture, while ResNet incorporates residual blocks (depicted in Figure 1(a)) featuring shortcut connections that mitigate the vanishing gradient problem and empower the network to learn residual attributes [7]. InceptionNet adopts a modular architecture, employing multiple parallel branches to capture diverse scales and aspect ratios of features within input images. DenseNet employs densely connected blocks (illustrated in Figure 1(b)) that concatenate layer outputs, resulting in a comprehensive feature representation [8]. However, the challenges in FR and FER endure due to the inherent diversity in facial expressions among individuals and the temporal variability of such expressions. Moreover, present CNN models are often tailored separately for each problem and can entail computational complexity attributed to an extensive parameter count.
In recent times, e-learning has seen substantial growth and popularity due to its adaptable and personalized learning approach, cost-effectiveness, and proven efficacy compared with traditional education. However, ensuring the quality of online activities through monitoring and evaluation is crucial to reduce cheating and uphold a high-quality education system. Prior research has utilized biometrics for identification and monitoring during online learning and testing sessions [9–11], yet a system for face recognition and detecting facial expressions would be more user-friendly, considering that most computers or user devices come with built-in cameras for online learning. Moreover, the FR/FER system can consistently identify learners throughout their learning journey or real-time tracking of students’ learning interactions within the learning management system (LMS), enabling educators and administrators to more effectively address each learner’s training process [4, 12, 13].
This study introduces an innovative approach to tackle the challenging task of simultaneous face recognition and facial expressions recognition, achieved through a multi-task CNN (MTCNN) model. The proposed model blends the ResNet and DenseNet architectures, offering several advantages like reducing network parameters, enhancing generalization capabilities, capturing spatial and semantic image information, boosting accuracy, and achieving quicker and more stable convergence during training. The model’s performance is assessed on four well-known datasets: CK+ [14], OuluCASIA [15], JAFFE [16], KDEF [17] along with datacollected from learners at Hanoi Open University, HOUS22. The results convincingly showcase the effectiveness and superiority of the proposed model when compared to other existing approaches. Additionally, this study devises an integration of the model with an LMS, facilitating the monitoring and evaluation of online learning activities—an imperative aspect for upholding educational quality.
As we know, the process of feature extraction holds a pivotal role in tasks related to image recognition, exerting significant influence on recognition accuracy. Traditional techniques, such as histogram of oriented gradients (HOG), local binary pattern (LBP), Gabor features, and Haar features, are mentioned in [18, 19]. These methods exhibit commendable performance on uncomplicated and uniform datasets. Nevertheless, real-world datasets tend to be intricate and heterogeneous, encompassing diverse variations in facial expressions including tone, pose, angle, brightness, and more. These variations present considerable obstacles for conventional approaches. Consequently, CNN models have been extensively employed in research endeavors targeting both FR and FER challenges, owing to their elevated accuracy and versatility across various applications [1, 7, 18].
MTCNN stands as a category of deep learning CNN model that holds potential in enhancing recognition quality through the incorporation of several associated tasks [1, 20, 21]. These models implement parameter sharing to unearth shared features among recognition challenges via deep convolutional layers. In MTCNN models, there exist two forms of parameter sharing: hard-sharing and soft-sharing. In the hard-sharing approach, a shared backbone network architecture is employed for feature extraction, succeeded by independent classifiers for each recognition task (Figure 2(a)). The soft-sharing technique deploys distinct network architectures for the extraction of features specific to each problem, while interconnecting convolutional layers for parameter sharing (Figure 2(b)).
Ma et al. [22] designed a MTCNN with hard-sharing parameters using the VGG architecture comprising 11 convolutional layers to assign taxonomies to human viruses. The model performs dual tasks: taxonomic assignment and assigning the genomic region for the designated taxonomic label. The species label predicted by the first task is combined with the model’s final layer and employed as input for the second task. The model attains a flawless prediction accuracy of 100% when tested with simulated divergence from the HIV-1 dataset. Liu et al. [23] developed a CNN model with soft-sharing parameters and cross-layers connection for detecting maritime targets. This model utilizes 16 shared core convolutional layers for feature extraction, underpinned by the VGG network architecture, yielding a prediction accuracy of 92.3% for their MRSP-13 dataset. Viet and Bao [24] engineered an MTCNN with hard-sharing parameters to recognize gender, smiley status, and facial expressions, drawing on a VGG architecture incorporating 9 convolutional layers and 3 fully connected layers. Mao et al. [21] proposed another MTCNN model with hard-sharing parameters targeting facial attribute recognition, leveraging the ResNet architecture to achieve average accuracies of 91.7% and 86.56% on the Celeb-A and LFWA datasets, respectively.
Generally, researchers frequently employ hard-sharing parameters when crafting MTCNN models for image recognition, often integrating cutting-edge architectures as backbone networks for feature extraction. However, the intricacy of such models can lead to decreased recognition accuracy in certain scenarios [24]. To facilitate broad acceptance, it becomes crucial to develop efficient models that strike a balance between accuracy and practicality, particularly in contexts like monitoring online learning activities.
This section presents the design of the proposed MTCNN model, known as MTFace, which undertakes the tasks of FR and FER. The MTFace model encompasses two pivotal stages: (1) extracting features from images to depict facial identifiers and expressions, and (2) categorizing the extracted features into their respective labels for each task. The efficacy of recognition and computational intricacy of models typically hinge on the quantity of filters and network depth. Researchers often calibrate these factors to align with specific application requisites, aiming for heightened recognition accuracy and reasonable computational complexity. The MTFace model is fashioned with a moderate layer count and an appropriate filter tally in each convolutional layer, ensuring compatibility with computational resources and facilitating broad applicability.
The MTFace model employs residual architectures in initial layers, enhancing the network’s ability to preserve original image details and capture low-level features. These layers function as soft-sharing parameters, where the extracted low-level features from input images flow into subsequent layers for both FR and FER tasks. For subsequent layers, dense connection architectures are employed to capture high-level features. These layers bolster the richness and effectiveness of features [25, 26]. Hence, the MTFace model encompasses two feature extraction phases: the initial one involves raw and low-level feature extraction via residual blocks, while the subsequent one refines the extracted features, generating increasingly sophisticated high-level features through densely connected blocks. In the second phase, two consecutive blocks are formulated as hard-sharing parameters, signifying the separation of refined and high-level features for individual tasks.
In this model, residual blocks (RBs) are harnessed to amplify feature extraction capabilities. An RB comprises three convolutional layers, each followed by batch normalization (BN) and ReLU activation after each convolutional operation (CV). The central layer employs a 3 × 3 convolution, while the other two employ 1 × 1 convolutions. To align the input of the block with the number of filters in the last convolutional layer, a 1×1 convolutional operation is applied to the shortcut connection. Subsequently, an element-wise summation (⊕) combines the input and the output of the block, followed by ReLU activation. This auxiliary pathway conserves existing information in the inputs, channeling it into the network’s deeper layers. Figure 3 provides a visual representation of the RB design. The operation of this block can be represented formally as
For densely connected blocks (DBs), a sequence of convolutional layers using 1 × 1 and 3 × 3 kernels is employed. The initial layer in each pair contains four times the number of filters compared to the subsequent layer. Prior to each convolutional operation, the input feature maps undergo BN and ReLU activation. At the conclusion of each pair, a concatenation ([,]) of all preceding feature maps takes place, encompassing not only the feature maps from the immediate prior layer. To streamline network computational complexity, transition layers (TLs) are introduced following each DB, aimed at diminishing the filter count. These TLs encompass a 1 × 1 convolutional operation, ReLU activation, and 2 × 2 average pooling (P
where
In its entirety, the proposed MTFace model encompasses an initial trio of RBs for hard-sharing parameters, succeeded by two sequences of consecutive DBs. Each DB sequence caters to an independent FR/FER task as soft-sharing parameters. Notably, the preceding DBs in the two tasks are concatenated and linked to the ensuing DBs. These blocks exhibit variations in filter counts and convolutional layer repetitions within the DBs. To effectively capture raw and low-level features from the RBs, a higher count of filters is engaged in comparison to the DBs. The RBs employ 128, 256, and 512 filters, while the DBs employ 32, 64, and 128 filters. Subsequent refinement of the extracted features takes place within the DBs through pairs of convolutional layers, with repetitions of 2, 4, and 8. Importantly, the features derived from the RBs are preserved and propagated through the model’s end via concatenations within the DBs to facilitate classification. Within each task, global average pooling amalgamates and flattens the extracted features, followed by a ‘softmax’ activated fully connected layer at the model’s terminus, for corresponding FR or FER tasks.
As the network advances through its layers, it progressively uncovers abstract and representative features of higher-level structures within the input image. This hierarchical process of feature extraction stands as a pivotal attribute of deep CNN, underpinning their capacity to acquire robust representations for image classification objectives. The outset of the model employs a convolutional layer coupled with a pooling layer to directly draw raw and low-level features from input images. These initial layers typically deploy medium-sized filters and fewer filter counts, enabling the capture of features like edges, textures, and shapes. This primary block, referred to as the first block, employs 64 filters with dimensions of 7 × 7 for convolutional operations and 3 × 3 with a 2-stride for average pooling. Figure 5 offers an illustrative overview of the entire model, wherein numbers within RBs indicate filter counts, and numbers within DBs signify filter counts and repetitions, correspondingly.
The MTFace model encompasses a total of 72 convolutional layers, distributed across 10 blocks, functioning as feature extractors. Given the inclusion of two branches for distinct tasks, the model attains a depth of 42 convolutional layers. Despite its notable layer count, the model’s overall complexity remains relatively modest, with approximately 16.5 million parameters. This renders it less intricate compared to contemporary CNN models devised for image classification challenges. The model attains its low parameter count by employing compact kernel sizes for the convolutional operations. Detailed parameters for the model, tailored to a specific input size of height × width × channels (110 × 90 × 3), are outlined in Table 1. Here, the symbol ⊗ denotes repetition within DBs, “C” signifies convolutional operations with designated kernel size, strides, and filter count, “P” indicates pooling operations with specified window size and strides, and “ → ” symbolizes the forward linkage between two layers. Notably, parameters for operations within DBs are duplicated twice due to the twin sequences corresponding to the FR/FER tasks.
For the purpose of fortifying the model’s resilience and curbing overfitting during training, 2D image processing strategies are harnessed to augment the training images, aligning with recommendations from earlier studies [4, 13]. This augmentation entails incorporating various techniques such as noise addition, rotation, flipping, cropping, shifting, and color adjustments, among others. These maneuvers amplify the training data’s diversity and improve the model’s capacity to withstand disparities in input images, encompassing alterations in styles, lighting, orientations, and viewpoints. When presented with a face image
where denotes the augmentation operation involving parameters
Nonetheless, meticulous consideration is given to the selection of parameters for these augmented operations, ensuring the preservation of crucial facial expression and face-related information for feature extraction. Overly aggressive rotation or shifting of an image could potentially lead to the loss of vital facial expression details, impeding the model’s capacity to accurately extract pertinent features for recognition. Referring to Figure 6, the third augmented image in the final row showcases excessive shifting, while the initial augmented image in the middle row displays an excessive amount of noise, both of which might lead to information loss. In this study, parameter values are randomly chosen within appropriate ranges for each enhancement operation, and on occasion, an image might undergo multiple enhancement operations simultaneously.
In real-world scenarios, input images are frequently captured using user devices like webcams or cameras, and they may encompass backgrounds featuring multiple objects. Therefore, it becomes imperative to identify faces within the images and eliminate extraneous backgrounds. A widely recognized model for this purpose is MTCNN [27], a solution that finds extensive application.
In practical scenarios, an amalgamation of the MTFace model and an LMS is designed to oversee and appraise online learning undertakings. This integration was accomplished using realized through the utilization of application programming interfaces (APIs) bridging the two systems, as expounded upon in [4]. The MTFace model is introduced as a module that can be executed on client devices, such as web clients or mobile applications. Learners access the LMS by providing their learning account credentials (identifier and password), enabling authentication for learning activities. The system prompts them to activate their camera or webcam, capturing videos or snapshots of their learning engagements for both FR and FER tasks. While the FR task aims to ascertain and authenticate learners across the learning journey, the FER task assesses their emotional states.
A monitoring period, perhaps set at 5 seconds, can be established, during which a captured image is relayed to the MTFace model for FR and FER. To commence, learners are required to contribute face images to facilitate model training. Once trained, the model assumes the role of recognizing faces and facial expressions within images documenting online learning endeavors. The outcomes of these FR and FER are archived in a historical record, serving as valuable inputs for evaluating learning caliber, assessing educational content, and gauging lecturer performance. Informed by these outcomes, notifications can be disseminated to learners, lecturers, and administrators, furnishing insights to tailor their actions and enhance the quality of education.
To ensure a secure connection, the LMS initiates each session by transmitting a message containing the user ID and security code to the MTFace system. A depiction of this interconnected setup and the operational progression of this integrated system is depicted in Figure 7.
The integration of the system imposes minimal alterations on an existing LMS to establish a connection with the MTFace model, thus endowing it with a high degree of autonomy. The LMS retains the capability to function independently, devoid of obligatory ties to the model. Upon integration with the MTFace model, the LMS becomes the recipient of FR and FER outcomes throughout the entirety of the learning trajectory. These outcomes are subsequently employed to synthesize and evaluate online learning engagements, facilitating the dispatch of notifications to learners or educators as warranted, ultimately contributing to the enhancement of education quality. This seamless compatibility underscores the ease of incorporating the MTFace model within any pre-existing LMS framework.
This section encompasses the presentation of datasets and parameters employed for MTFace model training, coupled with an extensive analysis of training outcomes. Additionally, a real-time demonstration experiment is executed to showcase the effectiveness of the proposed model, drawing comparisons with other established methodologies within the domain. In the course of experimentation, five distinct datasets were brought into play: JAFFE [16], CK+ (Extended Cohn-Kanade) [14], OuluCASIA [15], KDEF [17], and a compilation of images sourced from Hanoi Open University students (HOUS22).
The JAFFE dataset incorporates 213 images featuring 10 Japanese women, each portraying 6 primary emotions alongside a ‘neutral’ emotion. Comprising 981 images from 118 subjects, the CK+ dataset illustrates 6 primary emotions along with the ‘contempt’ emotion. Presenting 1,440 images of 80 individuals, the OuluCASIA dataset captures 6 fundamental facial expressions against diverse illumination and head poses, rendered in color. Notably, both the JAFFE and CK+ datasets adopt a grayscale format. Housing 4,900 images, the KDEF dataset encompasses 6 core emotions plus the ‘neutral’ expression, captured from 5 distinct angles. It encompasses imagery of 70 individuals (35 females and 35 males) between the ages of 20 and 30, characterized by a lack of occlusions like mustaches, earrings, and eyeglasses. The compiled HOUS22 dataset showcases color images drawn from online learners at Hanoi Open University, captured under real-world conditions. This collection involves a total of 21 participants, spanning 13 males and 8 females, and features two facial expressions: “Positive” and “Negative”. A detailed breakdown of label numbers in the FR and FER tasks, total image counts, and image distribution across class categories for these datasets is available in Table 2. Among the array, CK+ and KDEF maintain balance, displaying an equal number of images per class, whereas JAFFE, OuluCASIA, and HOUS22 exhibit imbalance, with variances in image distribution per class. Table 2 also highlights the minimum, maximum, and average image counts per class across these datasets.
Figure 8 offers a selection of sample images sourced from the five datasets, serving to visually elucidate the context. The initial three datasets, namely JAFFE, CK+, and OuluCASIA, were amassed within controlled settings, rendering them comparatively straightforward to differentiate among classes in FER. The KDEF dataset, similarly procured under controlled conditions, introduces images captured from various angles, encompassing the left and right sides of the face, exemplified in the first, fourth, and fifth images of row Figure 8(d). In sharp contrast, the HOUS22 dataset was amassed sans specific constraints imposed on participants, thereby introducing instances where discerning between classes proves intricate. Notably, the second and third images in Figure 8(e) exemplify the challenge of distinguishing the “Positive” and “Negative” expressions intuitively within the HOUS22 dataset.
The experimental framework adopts a 5-fold cross-validation approach. The dataset is randomly partitioned into 5 equitably sized folds. Within each iteration, one fold assumes the role of the test set (
The experimentation took place on a computer system equipped with a TPU and 32 GB RAM. The formulation of the proposed model was executed through the Python programming language on the TensorFlow platform. Renowned for its robust capabilities in image processing and CNN modeling, Tensor-Flow is a prevalent deep learning framework.
Figure 9 showcases the evaluation of training and model validation accuracies across 5 runs. Each subfigure delineates the accuracies of both the training data (represented by a solid line) and the validation data (depicted by a dashed line). The distinctive traits of the JAFFE dataset, including the limited image count and facial expression similarity across classes, contributed to the observed fluctuations in validation accuracies during the training phase. Evidently, commendable accuracy levels were achieved on the training data starting from roughly the 20
The recognition outcomes on the testing data from the trained model were computed through averaging across 5 runs, as depicted in Table 4. Analogous to the training progression, the FER task consistently yielded higher results compared to the FR task across all datasets. This divergence could potentially stem from the greater number of classes within the FR task in contrast to the FER task, leading to a reduced quantity of images per class for the former. Notably, even in the case of the KDEF dataset, this difference reaches up to 20-fold. The OuluCASIA dataset emerged with the highest accuracy, reaching a perfect 100% for both tasks, whereas the HOUS22 dataset achieved matching accuracies of 99.69% for both tasks. Remaining datasets exhibited variances between the FR and FER tasks, with disparities spanning from a minimal 0.08% differential in KDEF to a maximum 0.45% variation in JAFFE. This insight indicates that neither the FR nor the FER task demonstrates notably heightened complexity compared to the other. Nonetheless, it does underscore the existence of certain images that pose considerable challenges in terms of recognition.
Figure 10 showcases nine instances of misclassified images extracted from the testing data of the JAFFE, CK+, KDEF, and HOUS22 datasets. Notably, the OuluCASIA dataset demonstrated error-free performance on the testing data for both the FR and FER tasks. Each image is accompanied by a title denoting its target (t#) and the class predicted (p#) by the model, with the symbol “
Figure 11 presents a visual representation of the model’s feature extraction process on input images through the utilization of the Grad-CAM method [28]. The heatmaps, emanating from the final convolutional layers of each task, are depicted in Figure 11(a)–(e) corresponding to the JAFFE, CK+, OuluCASIA, KDEF, and HOUS22 datasets, respectively. This gradient-oriented localization technique discerns pivotal regions within input images, such as the mouth, nose, and eye that hold significance for conveying facial expressions and facial features. The visualization underscores the MTFace model’s emphasis on these specific areas to aptly extract informative features. The absence of attention towards these regions can potentially engender difficulties in accurately identifying the visage and facial expression. Predominantly, the heatmaps converge on the mouth, nose, and eyes, exhibiting nuanced variations across the facial canvas. This distinct trend signifies that the convolutional layer within the MTFace model allocates priority to these regions for the extraction of pertinent features tied to facial expressions and facial features. Ignoring them can result in inaccurate recognition.
Table 5 furnishes a comprehensive juxtaposition of the experimental outcomes attained by the proposed model with those of other studies [1, 2, 4, 13, 29–31]. All entries in this comparison used CNN-based models, encompassing a spectrum of data scenarios, succinctly annotated within parentheses adjacent to the corresponding method. The optimal cases sourced from [1] are cherry-picked, having encompassed a multiplicity of experimental scenarios. The symbol “-” denotes instances wherein experimental results remain unreported. Notably, the model dimensions encompass both the convolutional layer count, signifying the model’s depth, and the tally of model parameters, juxtaposed via the “/” symbol. The preeminent accuracies within the FR and FER tasks are underscored in bold font. Remarkably, the MTFace model notched a flawless 100% accuracy in four specific scenarios: the FER task for JAFFE, CK+, and OuluCASIA datasets, alongside the FR task for the OuluCASIA dataset. Despite exhibiting a moderate complexity, characterizing it as the second least intricate among the array of methodologies considered, and featuring the fourth lowest parameter count, the MTFace model emerged triumphant across nine instances within both FR and FER tasks. Further reinforcing the model’s credentials, the cutting-edge ResNet50V2 architecture [32] also plays a role, serving as a feature extraction backbone with hard-sharing parameters deployed for both FR and FER tasks. The classification’s upper echelons mirror those within the MTFace model. Consequently, this model’s deployment mirrors the 5 folds of data, augmentation, and parameters consistent with the MTFace model’s framework.
Limited reporting of results was observed for the FR task across various datasets, with an emphasis on the single FER task by authors. Only [30] delved into reporting outcomes for both FR and FER tasks, showcasing performances exceeding or matching 99.00% in CK+ and OuluCASIA datasets. These outcomes, while ranking second in comparison, slightly trailed those of the proposed model, obtaining 99.00% and 99.14% accuracy for FR tasks in CK+ and OuluCASIA, respectively. Three instances spotlight the ResNet50V2 based model’s prowess, securing the second-highest accuracy: FR and FER tasks within the HOUS22 dataset and the FR task in KDEF dataset. The MTFace model reported its least accuracy at 99.55% for the FR task in JAFFE dataset, while the ResNet50V2 model demonstrated its lowest at 86.88% for FER task in KDEF dataset. Notably, the OuluCASIA dataset documented two instances of immaculate 100% FER task accuracy, for the proposed model and [4]. Evidently, the OuluCASIA dataset’s robust outcomes reflect stringent image capture conditions that mitigate pose and illumination deviations.
Table 5 discloses a compelling revelation; despite its 30.4% lower complexity relative to the ResNet50V2 based model (constituting approximately 23.7 million parameters), the MTFace model yields commensurate or superior recognition results on test data. This implies the proposed model’s efficacy in efficiently harnessing hard-sharing parameters for low-level feature extraction, while soft-sharing parameters delineate high-level feature extraction for distinct tasks. Furthermore, augmenting the training data with increased augmented images enhanced the model’s recognition accuracy. Despite the predominantly single-task facial expression focus, the MTFace model consistently attained superior accuracies, a feat possibly attributable to augmented training data fostering the development of more effective models boasting enhanced recognition capabilities.
In this study, a novel multi-task CNN-based model was introduced for FR and FER processes. Leveraging two cutting-edge architectures, namely residual networks and densely connected networks, the proposed model strives for heightened accuracy while maintaining a manageable level of complexity. The model comprises an initial stage featuring three RBs, catering to hard-sharing parameters for extracting low-level features. Subsequently, the second stage encompasses two distinct sequences, each comprising three DBs, geared towards extracting high-level features. With a total of 72 convolutional layers serving as adept feature extractors, the model achieves a balance between layers and parameters. Remarkably, despite its relatively generous layer count, the model’s parameter count remains modest, at approximately 16.5 million, achieved by judicious use of small kernel sizes for convolutional operations.
Experimental validation was conducted using prominent datasets alongside collected images, employing a 5-fold data scenario. The proposed model achieved remarkable accuracy, ranging from 99.55% to 100% on testing data for both FR and FER tasks across all datasets. This resounding performance positions it as the top contender in nine out of 14 instances, as demonstrated in Table 5. The practical viability of this accuracy is noteworthy, and while current computational constraints limited the training depth and dataset size, future iterations hold potential for even higher achievements. The proposed model holds promise for real-world application, boasting a suitable level of complexity that aligns well with systems constrained by computational resources. Its seamless integration into existing learning management systems offers flexibility and user-friendliness, facilitated by a client version that mitigates image transfers and network bandwidth demands. The resultant integrated system empowers real-time monitoring of online learners’ activities and emotional responses, equipping educators with actionable insights to tailor their approaches and elevate the learning experience.
Looking forward, future endeavors could prioritize refining and enhancing the efficiency of multi-task CNN-based models through the incorporation of established network architectures. Expanding the scope of the model to encompass supplementary tasks, such as head-pose estimation, presents an intriguing avenue for exploration. Moreover, an avenue of potential growth lies in evaluating the model’s performance on larger, more diverse datasets to unlock even more impressive results.
No potential conflict of interest relevant to this article was reported.
The accuracies of training data and validation data from training histories in (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
Misclassified images extracted from the testing data of the JAFFE, CK+, KDEF, and HOUS22 datasets.
Heatmaps of the last convolutional layers for both FR and FER tasks on images from (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
Table 1. Parameters of proposed MTFace model.
Layer/block | Operations (kernel size-stride, filter) | Parameter (thousand) |
---|---|---|
Input | - | - |
1st convolutional layer | C(7 × 7–2, 64) | 9.5 |
1st residual block | C(1 × 1–1, 32) → C(3 × 3–1, 32) → C(1 × 1–1, 128) | 2.1 → 9.3 → 4.2 |
Shortcut connection | C(1 × 1–1, 128) | 8.3 |
2nd residual block | C(1 × 1–1, 64) → C(3 × 3–1, 64) → C(1 × 1–1, 256) | 8.3 → 36.9→ 16.6 |
Shortcut connection | C(1 × 1–1, 256) | 33.0 |
3rd residual block | C(1 × 1–1, 128) → C(3 × 3–1, 128) → C(1 × 1–1, 512) | 32.9 → 147.6→ 66.1 |
Shortcut connection | C(1 × 1–1, 512) | 131.6 |
1st dense block | [C(1 × 1–1, 128) → C(3 × 3–1, 32)] ⊗ 2 | 205.2 × 2→ 213.4 × 2 |
Transition | C(1 × 1–1, 32) → P(2 × 2–2) | 18.5 × 2 |
2nd dense block | [C(1 × 1–1, 256) → C(3 × 3–1, 64)] ⊗ 4 | 132 × 2→ 590 × 2 |
Transition | C(1 × 1–1, 64) → P(2 × 2–2) | 18.5 × 2 |
3rd dense block | [C(1 × 1–1, 512) → C(3 × 3–1, 128)] ⊗ 8 | |
Flatten | Average global pooling | |
Classifier (FC layer) | softmax | 6.5 + 87.1 |
Total | 16.5M |
Table 2. Parameters for experimental running.
Dataset | #Images | FR | FER | ||||||
---|---|---|---|---|---|---|---|---|---|
#Participants | Min | Max | Avg. | #Participants | Min | Max | Avg. | ||
JAFFE | 213 | 10 | 20 | 23 | 21.3 | 7 | 29 | 32 | 30.4 |
CK+ | 981 | 118 | 3 | 18 | 8.3 | 7 | 54 | 249 | 140.1 |
OuluCASIA | 1,440 | 80 | 18 | 18 | 18.0 | 6 | 240 | 240 | 240.0 |
KDEF | 4,900 | 140 | 35 | 35 | 35.0 | 7 | 700 | 700 | 700.0 |
HOUS22 | 639 | 21 | 29 | 40 | 30.1 | 2 | 203 | 436 | 319.5 |
Table 3. Parameters for experimental running.
Parameter | Range value | |
---|---|---|
Data augmentation | Variance of Gaussian noise addition | [0, 0.1] |
Rotation relative to original image (radian, negative is counter-clockwise) | [−0.1 | |
Shifting relative to size of original image (percentage, negative is left or up shifting, both width and height) | [−10%, 10%] | |
Scaling relative to size of original image (negative is downscaling, both width and height) | [−10%, 10%] | |
Horizontal flipping image | Yes/No | |
Training model | Learning rate (lr) | 10−5 |
Batch size | 128 | |
Epochs | 150 |
Table 4. Testing accuracies in 5 folds on datasets of the MTFace model.
#Fold/running | JAFFE | CK+ | OuluCASIA | KDEF | HOUS22 | |||||
---|---|---|---|---|---|---|---|---|---|---|
FR | FER | FR | FER | FR | FER | FR | FER | FR | FER | |
1 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
2 | 97.73 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 98.45 | 98.45 |
3 | 100 | 100 | 99.49 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
4 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
5 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
Average | 99.55 | 100 | 99.90 | 100 | 100 | 100 | 99.84 | 99.92 | 99.69 | 99.69 |
Table 5. Comparison of accuracies on testing data.
Study (#folds) | Model size | JAFFE | CK+ | Oulu-CASIA | KDEF | HOUS22 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
FR | FER | FR | FER | FR | FER | FR | FER | FR | FER | ||
Li and Deng [1] (best cases) | - / - | - | 95.80 | - | 99.60 | - | 91.67 | - | - | - | - |
Devaram et al. [29] (5 folds) | 55 / 1.6 | - | 80.09 | - | 84.27 | - | - | - | 99.90 | - | - |
Bhatti et al. [2] (0.3T) | - / 0.005 | - | 96.80 | - | 86.50 | - | - | - | - | - | - |
Ming et al. [ | 60 / 39 | - | - | 99.00 | 99.50 | 99.14 | 89.60 | - | - | - | - |
Long [13] (10 folds) | 49 / 23.5 | - | 96.20 | - | 99.68 | - | 98.47 | - | - | - | - |
Long et al. [4] (10 folds) | 34 / 2.4 | - | 99.08 | - | 99.90 | - | - | - | - | - | |
Yu and Xu [31] (10 folds) | - / - | - | - | - | 98.33 | - | - | - | - | - | - |
ResNet50V2 based model (5 folds) | 50 / 23.7 | 93.50 | 93.50 | 96.12 | 99.39 | 96.74 | 97.99 | 96.37 | 86.88 | 95.46 | |
MTFace model (5 folds) | 42 / 16.5 |
The bold font indicates the best performance in each test..
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(3): 229-243
Published online September 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.3.229
Copyright © The Korean Institute of Intelligent Systems.
Faculty of Information Technology, Hanoi Open University, Hanoi, Vietnam
Correspondence to:Duong Thang Long (duongthanglong@hou.edu.vn)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multi-task convolutional neural networks (MTCNN) have garnered substantial attention for their impressive feats in image classification. However, MTCNNs tailored for the dual tasks of face recognition and facial expression recognition remain relatively scarce. These networks often entail complexity and substantial computational demands, rendering them unsuitable for practical systems constrained by computational limitations. Addressing this gap, the present paper proposes an efficient MTCNN model that capitalizes on cutting-edge architectures—residual networks and dense-connected networks—to adeptly execute face recognition and facial expression recognition tasks. The model’s efficacy is rigorously assessed across wellknown datasets (JAFFE, CK+, OuluCASIA, KDEF) as well as collected images from learners (HOUS22). The proposed model consistently attains high accuracy levels across all datasets, with a notable minimum accuracy of 99.55% on testing data. These outcomes stand as a testament to the model’s remarkable performance, particularly in relation to its streamlined design. Moreover, the model is seamlessly integrated with an online learning management system, furnishing a versatile means of monitoring learning activities to enhance the overall training quality.
Keywords: Multi-task convolutional neural network, Residual networks, Dense-connected networks, Face recognition, Facial expressions recognition, Online learning management systems
Face recognition (FR) and facial expression recognition (FER) pose significant challenges in the field of computer vision. Convolutional neural networks (CNN) have emerged as the go-to choice for addressing these challenges, with prominent models like VGGNet, GoogleNet, ResNet, SENet, and DenseNet attaining remarkable outcomes [1–6]. VGGNet presents a straightforward architecture, while ResNet incorporates residual blocks (depicted in Figure 1(a)) featuring shortcut connections that mitigate the vanishing gradient problem and empower the network to learn residual attributes [7]. InceptionNet adopts a modular architecture, employing multiple parallel branches to capture diverse scales and aspect ratios of features within input images. DenseNet employs densely connected blocks (illustrated in Figure 1(b)) that concatenate layer outputs, resulting in a comprehensive feature representation [8]. However, the challenges in FR and FER endure due to the inherent diversity in facial expressions among individuals and the temporal variability of such expressions. Moreover, present CNN models are often tailored separately for each problem and can entail computational complexity attributed to an extensive parameter count.
In recent times, e-learning has seen substantial growth and popularity due to its adaptable and personalized learning approach, cost-effectiveness, and proven efficacy compared with traditional education. However, ensuring the quality of online activities through monitoring and evaluation is crucial to reduce cheating and uphold a high-quality education system. Prior research has utilized biometrics for identification and monitoring during online learning and testing sessions [9–11], yet a system for face recognition and detecting facial expressions would be more user-friendly, considering that most computers or user devices come with built-in cameras for online learning. Moreover, the FR/FER system can consistently identify learners throughout their learning journey or real-time tracking of students’ learning interactions within the learning management system (LMS), enabling educators and administrators to more effectively address each learner’s training process [4, 12, 13].
This study introduces an innovative approach to tackle the challenging task of simultaneous face recognition and facial expressions recognition, achieved through a multi-task CNN (MTCNN) model. The proposed model blends the ResNet and DenseNet architectures, offering several advantages like reducing network parameters, enhancing generalization capabilities, capturing spatial and semantic image information, boosting accuracy, and achieving quicker and more stable convergence during training. The model’s performance is assessed on four well-known datasets: CK+ [14], OuluCASIA [15], JAFFE [16], KDEF [17] along with datacollected from learners at Hanoi Open University, HOUS22. The results convincingly showcase the effectiveness and superiority of the proposed model when compared to other existing approaches. Additionally, this study devises an integration of the model with an LMS, facilitating the monitoring and evaluation of online learning activities—an imperative aspect for upholding educational quality.
As we know, the process of feature extraction holds a pivotal role in tasks related to image recognition, exerting significant influence on recognition accuracy. Traditional techniques, such as histogram of oriented gradients (HOG), local binary pattern (LBP), Gabor features, and Haar features, are mentioned in [18, 19]. These methods exhibit commendable performance on uncomplicated and uniform datasets. Nevertheless, real-world datasets tend to be intricate and heterogeneous, encompassing diverse variations in facial expressions including tone, pose, angle, brightness, and more. These variations present considerable obstacles for conventional approaches. Consequently, CNN models have been extensively employed in research endeavors targeting both FR and FER challenges, owing to their elevated accuracy and versatility across various applications [1, 7, 18].
MTCNN stands as a category of deep learning CNN model that holds potential in enhancing recognition quality through the incorporation of several associated tasks [1, 20, 21]. These models implement parameter sharing to unearth shared features among recognition challenges via deep convolutional layers. In MTCNN models, there exist two forms of parameter sharing: hard-sharing and soft-sharing. In the hard-sharing approach, a shared backbone network architecture is employed for feature extraction, succeeded by independent classifiers for each recognition task (Figure 2(a)). The soft-sharing technique deploys distinct network architectures for the extraction of features specific to each problem, while interconnecting convolutional layers for parameter sharing (Figure 2(b)).
Ma et al. [22] designed a MTCNN with hard-sharing parameters using the VGG architecture comprising 11 convolutional layers to assign taxonomies to human viruses. The model performs dual tasks: taxonomic assignment and assigning the genomic region for the designated taxonomic label. The species label predicted by the first task is combined with the model’s final layer and employed as input for the second task. The model attains a flawless prediction accuracy of 100% when tested with simulated divergence from the HIV-1 dataset. Liu et al. [23] developed a CNN model with soft-sharing parameters and cross-layers connection for detecting maritime targets. This model utilizes 16 shared core convolutional layers for feature extraction, underpinned by the VGG network architecture, yielding a prediction accuracy of 92.3% for their MRSP-13 dataset. Viet and Bao [24] engineered an MTCNN with hard-sharing parameters to recognize gender, smiley status, and facial expressions, drawing on a VGG architecture incorporating 9 convolutional layers and 3 fully connected layers. Mao et al. [21] proposed another MTCNN model with hard-sharing parameters targeting facial attribute recognition, leveraging the ResNet architecture to achieve average accuracies of 91.7% and 86.56% on the Celeb-A and LFWA datasets, respectively.
Generally, researchers frequently employ hard-sharing parameters when crafting MTCNN models for image recognition, often integrating cutting-edge architectures as backbone networks for feature extraction. However, the intricacy of such models can lead to decreased recognition accuracy in certain scenarios [24]. To facilitate broad acceptance, it becomes crucial to develop efficient models that strike a balance between accuracy and practicality, particularly in contexts like monitoring online learning activities.
This section presents the design of the proposed MTCNN model, known as MTFace, which undertakes the tasks of FR and FER. The MTFace model encompasses two pivotal stages: (1) extracting features from images to depict facial identifiers and expressions, and (2) categorizing the extracted features into their respective labels for each task. The efficacy of recognition and computational intricacy of models typically hinge on the quantity of filters and network depth. Researchers often calibrate these factors to align with specific application requisites, aiming for heightened recognition accuracy and reasonable computational complexity. The MTFace model is fashioned with a moderate layer count and an appropriate filter tally in each convolutional layer, ensuring compatibility with computational resources and facilitating broad applicability.
The MTFace model employs residual architectures in initial layers, enhancing the network’s ability to preserve original image details and capture low-level features. These layers function as soft-sharing parameters, where the extracted low-level features from input images flow into subsequent layers for both FR and FER tasks. For subsequent layers, dense connection architectures are employed to capture high-level features. These layers bolster the richness and effectiveness of features [25, 26]. Hence, the MTFace model encompasses two feature extraction phases: the initial one involves raw and low-level feature extraction via residual blocks, while the subsequent one refines the extracted features, generating increasingly sophisticated high-level features through densely connected blocks. In the second phase, two consecutive blocks are formulated as hard-sharing parameters, signifying the separation of refined and high-level features for individual tasks.
In this model, residual blocks (RBs) are harnessed to amplify feature extraction capabilities. An RB comprises three convolutional layers, each followed by batch normalization (BN) and ReLU activation after each convolutional operation (CV). The central layer employs a 3 × 3 convolution, while the other two employ 1 × 1 convolutions. To align the input of the block with the number of filters in the last convolutional layer, a 1×1 convolutional operation is applied to the shortcut connection. Subsequently, an element-wise summation (⊕) combines the input and the output of the block, followed by ReLU activation. This auxiliary pathway conserves existing information in the inputs, channeling it into the network’s deeper layers. Figure 3 provides a visual representation of the RB design. The operation of this block can be represented formally as
For densely connected blocks (DBs), a sequence of convolutional layers using 1 × 1 and 3 × 3 kernels is employed. The initial layer in each pair contains four times the number of filters compared to the subsequent layer. Prior to each convolutional operation, the input feature maps undergo BN and ReLU activation. At the conclusion of each pair, a concatenation ([,]) of all preceding feature maps takes place, encompassing not only the feature maps from the immediate prior layer. To streamline network computational complexity, transition layers (TLs) are introduced following each DB, aimed at diminishing the filter count. These TLs encompass a 1 × 1 convolutional operation, ReLU activation, and 2 × 2 average pooling (P
where
In its entirety, the proposed MTFace model encompasses an initial trio of RBs for hard-sharing parameters, succeeded by two sequences of consecutive DBs. Each DB sequence caters to an independent FR/FER task as soft-sharing parameters. Notably, the preceding DBs in the two tasks are concatenated and linked to the ensuing DBs. These blocks exhibit variations in filter counts and convolutional layer repetitions within the DBs. To effectively capture raw and low-level features from the RBs, a higher count of filters is engaged in comparison to the DBs. The RBs employ 128, 256, and 512 filters, while the DBs employ 32, 64, and 128 filters. Subsequent refinement of the extracted features takes place within the DBs through pairs of convolutional layers, with repetitions of 2, 4, and 8. Importantly, the features derived from the RBs are preserved and propagated through the model’s end via concatenations within the DBs to facilitate classification. Within each task, global average pooling amalgamates and flattens the extracted features, followed by a ‘softmax’ activated fully connected layer at the model’s terminus, for corresponding FR or FER tasks.
As the network advances through its layers, it progressively uncovers abstract and representative features of higher-level structures within the input image. This hierarchical process of feature extraction stands as a pivotal attribute of deep CNN, underpinning their capacity to acquire robust representations for image classification objectives. The outset of the model employs a convolutional layer coupled with a pooling layer to directly draw raw and low-level features from input images. These initial layers typically deploy medium-sized filters and fewer filter counts, enabling the capture of features like edges, textures, and shapes. This primary block, referred to as the first block, employs 64 filters with dimensions of 7 × 7 for convolutional operations and 3 × 3 with a 2-stride for average pooling. Figure 5 offers an illustrative overview of the entire model, wherein numbers within RBs indicate filter counts, and numbers within DBs signify filter counts and repetitions, correspondingly.
The MTFace model encompasses a total of 72 convolutional layers, distributed across 10 blocks, functioning as feature extractors. Given the inclusion of two branches for distinct tasks, the model attains a depth of 42 convolutional layers. Despite its notable layer count, the model’s overall complexity remains relatively modest, with approximately 16.5 million parameters. This renders it less intricate compared to contemporary CNN models devised for image classification challenges. The model attains its low parameter count by employing compact kernel sizes for the convolutional operations. Detailed parameters for the model, tailored to a specific input size of height × width × channels (110 × 90 × 3), are outlined in Table 1. Here, the symbol ⊗ denotes repetition within DBs, “C” signifies convolutional operations with designated kernel size, strides, and filter count, “P” indicates pooling operations with specified window size and strides, and “ → ” symbolizes the forward linkage between two layers. Notably, parameters for operations within DBs are duplicated twice due to the twin sequences corresponding to the FR/FER tasks.
For the purpose of fortifying the model’s resilience and curbing overfitting during training, 2D image processing strategies are harnessed to augment the training images, aligning with recommendations from earlier studies [4, 13]. This augmentation entails incorporating various techniques such as noise addition, rotation, flipping, cropping, shifting, and color adjustments, among others. These maneuvers amplify the training data’s diversity and improve the model’s capacity to withstand disparities in input images, encompassing alterations in styles, lighting, orientations, and viewpoints. When presented with a face image
where denotes the augmentation operation involving parameters
Nonetheless, meticulous consideration is given to the selection of parameters for these augmented operations, ensuring the preservation of crucial facial expression and face-related information for feature extraction. Overly aggressive rotation or shifting of an image could potentially lead to the loss of vital facial expression details, impeding the model’s capacity to accurately extract pertinent features for recognition. Referring to Figure 6, the third augmented image in the final row showcases excessive shifting, while the initial augmented image in the middle row displays an excessive amount of noise, both of which might lead to information loss. In this study, parameter values are randomly chosen within appropriate ranges for each enhancement operation, and on occasion, an image might undergo multiple enhancement operations simultaneously.
In real-world scenarios, input images are frequently captured using user devices like webcams or cameras, and they may encompass backgrounds featuring multiple objects. Therefore, it becomes imperative to identify faces within the images and eliminate extraneous backgrounds. A widely recognized model for this purpose is MTCNN [27], a solution that finds extensive application.
In practical scenarios, an amalgamation of the MTFace model and an LMS is designed to oversee and appraise online learning undertakings. This integration was accomplished using realized through the utilization of application programming interfaces (APIs) bridging the two systems, as expounded upon in [4]. The MTFace model is introduced as a module that can be executed on client devices, such as web clients or mobile applications. Learners access the LMS by providing their learning account credentials (identifier and password), enabling authentication for learning activities. The system prompts them to activate their camera or webcam, capturing videos or snapshots of their learning engagements for both FR and FER tasks. While the FR task aims to ascertain and authenticate learners across the learning journey, the FER task assesses their emotional states.
A monitoring period, perhaps set at 5 seconds, can be established, during which a captured image is relayed to the MTFace model for FR and FER. To commence, learners are required to contribute face images to facilitate model training. Once trained, the model assumes the role of recognizing faces and facial expressions within images documenting online learning endeavors. The outcomes of these FR and FER are archived in a historical record, serving as valuable inputs for evaluating learning caliber, assessing educational content, and gauging lecturer performance. Informed by these outcomes, notifications can be disseminated to learners, lecturers, and administrators, furnishing insights to tailor their actions and enhance the quality of education.
To ensure a secure connection, the LMS initiates each session by transmitting a message containing the user ID and security code to the MTFace system. A depiction of this interconnected setup and the operational progression of this integrated system is depicted in Figure 7.
The integration of the system imposes minimal alterations on an existing LMS to establish a connection with the MTFace model, thus endowing it with a high degree of autonomy. The LMS retains the capability to function independently, devoid of obligatory ties to the model. Upon integration with the MTFace model, the LMS becomes the recipient of FR and FER outcomes throughout the entirety of the learning trajectory. These outcomes are subsequently employed to synthesize and evaluate online learning engagements, facilitating the dispatch of notifications to learners or educators as warranted, ultimately contributing to the enhancement of education quality. This seamless compatibility underscores the ease of incorporating the MTFace model within any pre-existing LMS framework.
This section encompasses the presentation of datasets and parameters employed for MTFace model training, coupled with an extensive analysis of training outcomes. Additionally, a real-time demonstration experiment is executed to showcase the effectiveness of the proposed model, drawing comparisons with other established methodologies within the domain. In the course of experimentation, five distinct datasets were brought into play: JAFFE [16], CK+ (Extended Cohn-Kanade) [14], OuluCASIA [15], KDEF [17], and a compilation of images sourced from Hanoi Open University students (HOUS22).
The JAFFE dataset incorporates 213 images featuring 10 Japanese women, each portraying 6 primary emotions alongside a ‘neutral’ emotion. Comprising 981 images from 118 subjects, the CK+ dataset illustrates 6 primary emotions along with the ‘contempt’ emotion. Presenting 1,440 images of 80 individuals, the OuluCASIA dataset captures 6 fundamental facial expressions against diverse illumination and head poses, rendered in color. Notably, both the JAFFE and CK+ datasets adopt a grayscale format. Housing 4,900 images, the KDEF dataset encompasses 6 core emotions plus the ‘neutral’ expression, captured from 5 distinct angles. It encompasses imagery of 70 individuals (35 females and 35 males) between the ages of 20 and 30, characterized by a lack of occlusions like mustaches, earrings, and eyeglasses. The compiled HOUS22 dataset showcases color images drawn from online learners at Hanoi Open University, captured under real-world conditions. This collection involves a total of 21 participants, spanning 13 males and 8 females, and features two facial expressions: “Positive” and “Negative”. A detailed breakdown of label numbers in the FR and FER tasks, total image counts, and image distribution across class categories for these datasets is available in Table 2. Among the array, CK+ and KDEF maintain balance, displaying an equal number of images per class, whereas JAFFE, OuluCASIA, and HOUS22 exhibit imbalance, with variances in image distribution per class. Table 2 also highlights the minimum, maximum, and average image counts per class across these datasets.
Figure 8 offers a selection of sample images sourced from the five datasets, serving to visually elucidate the context. The initial three datasets, namely JAFFE, CK+, and OuluCASIA, were amassed within controlled settings, rendering them comparatively straightforward to differentiate among classes in FER. The KDEF dataset, similarly procured under controlled conditions, introduces images captured from various angles, encompassing the left and right sides of the face, exemplified in the first, fourth, and fifth images of row Figure 8(d). In sharp contrast, the HOUS22 dataset was amassed sans specific constraints imposed on participants, thereby introducing instances where discerning between classes proves intricate. Notably, the second and third images in Figure 8(e) exemplify the challenge of distinguishing the “Positive” and “Negative” expressions intuitively within the HOUS22 dataset.
The experimental framework adopts a 5-fold cross-validation approach. The dataset is randomly partitioned into 5 equitably sized folds. Within each iteration, one fold assumes the role of the test set (
The experimentation took place on a computer system equipped with a TPU and 32 GB RAM. The formulation of the proposed model was executed through the Python programming language on the TensorFlow platform. Renowned for its robust capabilities in image processing and CNN modeling, Tensor-Flow is a prevalent deep learning framework.
Figure 9 showcases the evaluation of training and model validation accuracies across 5 runs. Each subfigure delineates the accuracies of both the training data (represented by a solid line) and the validation data (depicted by a dashed line). The distinctive traits of the JAFFE dataset, including the limited image count and facial expression similarity across classes, contributed to the observed fluctuations in validation accuracies during the training phase. Evidently, commendable accuracy levels were achieved on the training data starting from roughly the 20
The recognition outcomes on the testing data from the trained model were computed through averaging across 5 runs, as depicted in Table 4. Analogous to the training progression, the FER task consistently yielded higher results compared to the FR task across all datasets. This divergence could potentially stem from the greater number of classes within the FR task in contrast to the FER task, leading to a reduced quantity of images per class for the former. Notably, even in the case of the KDEF dataset, this difference reaches up to 20-fold. The OuluCASIA dataset emerged with the highest accuracy, reaching a perfect 100% for both tasks, whereas the HOUS22 dataset achieved matching accuracies of 99.69% for both tasks. Remaining datasets exhibited variances between the FR and FER tasks, with disparities spanning from a minimal 0.08% differential in KDEF to a maximum 0.45% variation in JAFFE. This insight indicates that neither the FR nor the FER task demonstrates notably heightened complexity compared to the other. Nonetheless, it does underscore the existence of certain images that pose considerable challenges in terms of recognition.
Figure 10 showcases nine instances of misclassified images extracted from the testing data of the JAFFE, CK+, KDEF, and HOUS22 datasets. Notably, the OuluCASIA dataset demonstrated error-free performance on the testing data for both the FR and FER tasks. Each image is accompanied by a title denoting its target (t#) and the class predicted (p#) by the model, with the symbol “
Figure 11 presents a visual representation of the model’s feature extraction process on input images through the utilization of the Grad-CAM method [28]. The heatmaps, emanating from the final convolutional layers of each task, are depicted in Figure 11(a)–(e) corresponding to the JAFFE, CK+, OuluCASIA, KDEF, and HOUS22 datasets, respectively. This gradient-oriented localization technique discerns pivotal regions within input images, such as the mouth, nose, and eye that hold significance for conveying facial expressions and facial features. The visualization underscores the MTFace model’s emphasis on these specific areas to aptly extract informative features. The absence of attention towards these regions can potentially engender difficulties in accurately identifying the visage and facial expression. Predominantly, the heatmaps converge on the mouth, nose, and eyes, exhibiting nuanced variations across the facial canvas. This distinct trend signifies that the convolutional layer within the MTFace model allocates priority to these regions for the extraction of pertinent features tied to facial expressions and facial features. Ignoring them can result in inaccurate recognition.
Table 5 furnishes a comprehensive juxtaposition of the experimental outcomes attained by the proposed model with those of other studies [1, 2, 4, 13, 29–31]. All entries in this comparison used CNN-based models, encompassing a spectrum of data scenarios, succinctly annotated within parentheses adjacent to the corresponding method. The optimal cases sourced from [1] are cherry-picked, having encompassed a multiplicity of experimental scenarios. The symbol “-” denotes instances wherein experimental results remain unreported. Notably, the model dimensions encompass both the convolutional layer count, signifying the model’s depth, and the tally of model parameters, juxtaposed via the “/” symbol. The preeminent accuracies within the FR and FER tasks are underscored in bold font. Remarkably, the MTFace model notched a flawless 100% accuracy in four specific scenarios: the FER task for JAFFE, CK+, and OuluCASIA datasets, alongside the FR task for the OuluCASIA dataset. Despite exhibiting a moderate complexity, characterizing it as the second least intricate among the array of methodologies considered, and featuring the fourth lowest parameter count, the MTFace model emerged triumphant across nine instances within both FR and FER tasks. Further reinforcing the model’s credentials, the cutting-edge ResNet50V2 architecture [32] also plays a role, serving as a feature extraction backbone with hard-sharing parameters deployed for both FR and FER tasks. The classification’s upper echelons mirror those within the MTFace model. Consequently, this model’s deployment mirrors the 5 folds of data, augmentation, and parameters consistent with the MTFace model’s framework.
Limited reporting of results was observed for the FR task across various datasets, with an emphasis on the single FER task by authors. Only [30] delved into reporting outcomes for both FR and FER tasks, showcasing performances exceeding or matching 99.00% in CK+ and OuluCASIA datasets. These outcomes, while ranking second in comparison, slightly trailed those of the proposed model, obtaining 99.00% and 99.14% accuracy for FR tasks in CK+ and OuluCASIA, respectively. Three instances spotlight the ResNet50V2 based model’s prowess, securing the second-highest accuracy: FR and FER tasks within the HOUS22 dataset and the FR task in KDEF dataset. The MTFace model reported its least accuracy at 99.55% for the FR task in JAFFE dataset, while the ResNet50V2 model demonstrated its lowest at 86.88% for FER task in KDEF dataset. Notably, the OuluCASIA dataset documented two instances of immaculate 100% FER task accuracy, for the proposed model and [4]. Evidently, the OuluCASIA dataset’s robust outcomes reflect stringent image capture conditions that mitigate pose and illumination deviations.
Table 5 discloses a compelling revelation; despite its 30.4% lower complexity relative to the ResNet50V2 based model (constituting approximately 23.7 million parameters), the MTFace model yields commensurate or superior recognition results on test data. This implies the proposed model’s efficacy in efficiently harnessing hard-sharing parameters for low-level feature extraction, while soft-sharing parameters delineate high-level feature extraction for distinct tasks. Furthermore, augmenting the training data with increased augmented images enhanced the model’s recognition accuracy. Despite the predominantly single-task facial expression focus, the MTFace model consistently attained superior accuracies, a feat possibly attributable to augmented training data fostering the development of more effective models boasting enhanced recognition capabilities.
In this study, a novel multi-task CNN-based model was introduced for FR and FER processes. Leveraging two cutting-edge architectures, namely residual networks and densely connected networks, the proposed model strives for heightened accuracy while maintaining a manageable level of complexity. The model comprises an initial stage featuring three RBs, catering to hard-sharing parameters for extracting low-level features. Subsequently, the second stage encompasses two distinct sequences, each comprising three DBs, geared towards extracting high-level features. With a total of 72 convolutional layers serving as adept feature extractors, the model achieves a balance between layers and parameters. Remarkably, despite its relatively generous layer count, the model’s parameter count remains modest, at approximately 16.5 million, achieved by judicious use of small kernel sizes for convolutional operations.
Experimental validation was conducted using prominent datasets alongside collected images, employing a 5-fold data scenario. The proposed model achieved remarkable accuracy, ranging from 99.55% to 100% on testing data for both FR and FER tasks across all datasets. This resounding performance positions it as the top contender in nine out of 14 instances, as demonstrated in Table 5. The practical viability of this accuracy is noteworthy, and while current computational constraints limited the training depth and dataset size, future iterations hold potential for even higher achievements. The proposed model holds promise for real-world application, boasting a suitable level of complexity that aligns well with systems constrained by computational resources. Its seamless integration into existing learning management systems offers flexibility and user-friendliness, facilitated by a client version that mitigates image transfers and network bandwidth demands. The resultant integrated system empowers real-time monitoring of online learners’ activities and emotional responses, equipping educators with actionable insights to tailor their approaches and elevate the learning experience.
Looking forward, future endeavors could prioritize refining and enhancing the efficiency of multi-task CNN-based models through the incorporation of established network architectures. Expanding the scope of the model to encompass supplementary tasks, such as head-pose estimation, presents an intriguing avenue for exploration. Moreover, an avenue of potential growth lies in evaluating the model’s performance on larger, more diverse datasets to unlock even more impressive results.
(a) Residual blocks and (b) densely connected blocks.
(a) Hard-sharing parameters and (b) soft-sharing parameters.
Structure of residual blocks (RBs).
Structure of densely connected blocks (DBs).
Structure of proposed multi-task CNN (MTFace) model for FR and FER problems.
Example of augmented images.
Diagram of connections between MTFace model and learning management system (LMS).
Example images from (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
The accuracies of training data and validation data from training histories in (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
Misclassified images extracted from the testing data of the JAFFE, CK+, KDEF, and HOUS22 datasets.
Heatmaps of the last convolutional layers for both FR and FER tasks on images from (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
Table 1 . Parameters of proposed MTFace model.
Layer/block | Operations (kernel size-stride, filter) | Parameter (thousand) |
---|---|---|
Input | - | - |
1st convolutional layer | C(7 × 7–2, 64) | 9.5 |
1st residual block | C(1 × 1–1, 32) → C(3 × 3–1, 32) → C(1 × 1–1, 128) | 2.1 → 9.3 → 4.2 |
Shortcut connection | C(1 × 1–1, 128) | 8.3 |
2nd residual block | C(1 × 1–1, 64) → C(3 × 3–1, 64) → C(1 × 1–1, 256) | 8.3 → 36.9→ 16.6 |
Shortcut connection | C(1 × 1–1, 256) | 33.0 |
3rd residual block | C(1 × 1–1, 128) → C(3 × 3–1, 128) → C(1 × 1–1, 512) | 32.9 → 147.6→ 66.1 |
Shortcut connection | C(1 × 1–1, 512) | 131.6 |
1st dense block | [C(1 × 1–1, 128) → C(3 × 3–1, 32)] ⊗ 2 | 205.2 × 2→ 213.4 × 2 |
Transition | C(1 × 1–1, 32) → P(2 × 2–2) | 18.5 × 2 |
2nd dense block | [C(1 × 1–1, 256) → C(3 × 3–1, 64)] ⊗ 4 | 132 × 2→ 590 × 2 |
Transition | C(1 × 1–1, 64) → P(2 × 2–2) | 18.5 × 2 |
3rd dense block | [C(1 × 1–1, 512) → C(3 × 3–1, 128)] ⊗ 8 | |
Flatten | Average global pooling | |
Classifier (FC layer) | softmax | 6.5 + 87.1 |
Total | 16.5M |
Table 2 . Parameters for experimental running.
Dataset | #Images | FR | FER | ||||||
---|---|---|---|---|---|---|---|---|---|
#Participants | Min | Max | Avg. | #Participants | Min | Max | Avg. | ||
JAFFE | 213 | 10 | 20 | 23 | 21.3 | 7 | 29 | 32 | 30.4 |
CK+ | 981 | 118 | 3 | 18 | 8.3 | 7 | 54 | 249 | 140.1 |
OuluCASIA | 1,440 | 80 | 18 | 18 | 18.0 | 6 | 240 | 240 | 240.0 |
KDEF | 4,900 | 140 | 35 | 35 | 35.0 | 7 | 700 | 700 | 700.0 |
HOUS22 | 639 | 21 | 29 | 40 | 30.1 | 2 | 203 | 436 | 319.5 |
Table 3 . Parameters for experimental running.
Parameter | Range value | |
---|---|---|
Data augmentation | Variance of Gaussian noise addition | [0, 0.1] |
Rotation relative to original image (radian, negative is counter-clockwise) | [−0.1 | |
Shifting relative to size of original image (percentage, negative is left or up shifting, both width and height) | [−10%, 10%] | |
Scaling relative to size of original image (negative is downscaling, both width and height) | [−10%, 10%] | |
Horizontal flipping image | Yes/No | |
Training model | Learning rate (lr) | 10−5 |
Batch size | 128 | |
Epochs | 150 |
Table 4 . Testing accuracies in 5 folds on datasets of the MTFace model.
#Fold/running | JAFFE | CK+ | OuluCASIA | KDEF | HOUS22 | |||||
---|---|---|---|---|---|---|---|---|---|---|
FR | FER | FR | FER | FR | FER | FR | FER | FR | FER | |
1 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
2 | 97.73 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 98.45 | 98.45 |
3 | 100 | 100 | 99.49 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
4 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
5 | 100 | 100 | 100 | 100 | 100 | 100 | 99.80 | 99.90 | 100 | 100 |
Average | 99.55 | 100 | 99.90 | 100 | 100 | 100 | 99.84 | 99.92 | 99.69 | 99.69 |
Table 5 . Comparison of accuracies on testing data.
Study (#folds) | Model size | JAFFE | CK+ | Oulu-CASIA | KDEF | HOUS22 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
FR | FER | FR | FER | FR | FER | FR | FER | FR | FER | ||
Li and Deng [1] (best cases) | - / - | - | 95.80 | - | 99.60 | - | 91.67 | - | - | - | - |
Devaram et al. [29] (5 folds) | 55 / 1.6 | - | 80.09 | - | 84.27 | - | - | - | 99.90 | - | - |
Bhatti et al. [2] (0.3T) | - / 0.005 | - | 96.80 | - | 86.50 | - | - | - | - | - | - |
Ming et al. [ | 60 / 39 | - | - | 99.00 | 99.50 | 99.14 | 89.60 | - | - | - | - |
Long [13] (10 folds) | 49 / 23.5 | - | 96.20 | - | 99.68 | - | 98.47 | - | - | - | - |
Long et al. [4] (10 folds) | 34 / 2.4 | - | 99.08 | - | 99.90 | - | - | - | - | - | |
Yu and Xu [31] (10 folds) | - / - | - | - | - | 98.33 | - | - | - | - | - | - |
ResNet50V2 based model (5 folds) | 50 / 23.7 | 93.50 | 93.50 | 96.12 | 99.39 | 96.74 | 97.99 | 96.37 | 86.88 | 95.46 | |
MTFace model (5 folds) | 42 / 16.5 |
The bold font indicates the best performance in each test..
Seokwon Yeom
International Journal of Fuzzy Logic and Intelligent Systems 2014; 14(4): 332-339 https://doi.org/10.5391/IJFIS.2014.14.4.332Lae-Jeong Park
Int. J. Fuzzy Log. Intell. Syst. 2010; 10(2): 95-100(a) Residual blocks and (b) densely connected blocks.
|@|~(^,^)~|@|(a) Hard-sharing parameters and (b) soft-sharing parameters.
|@|~(^,^)~|@|Structure of residual blocks (RBs).
|@|~(^,^)~|@|Structure of densely connected blocks (DBs).
|@|~(^,^)~|@|Structure of proposed multi-task CNN (MTFace) model for FR and FER problems.
|@|~(^,^)~|@|Example of augmented images.
|@|~(^,^)~|@|Diagram of connections between MTFace model and learning management system (LMS).
|@|~(^,^)~|@|Example images from (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
|@|~(^,^)~|@|The accuracies of training data and validation data from training histories in (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.
|@|~(^,^)~|@|Misclassified images extracted from the testing data of the JAFFE, CK+, KDEF, and HOUS22 datasets.
|@|~(^,^)~|@|Heatmaps of the last convolutional layers for both FR and FER tasks on images from (a) JAFFE, (b) CK+, (c) OuluCASIA, (d) KDEF, and (e) HOUS22 datasets.