search for


Feature Engineering Method Using Double-Layer Hidden Markov Model for Insider Threat Detection
International Journal of Fuzzy Logic and Intelligent Systems 2020;20(1):17-25
Published online March 25, 2020
© 2020 Korean Institute of Intelligent Systems.

Xiaoyun Ye , Sung-Sam Hong , and Myung-Mook Han

Department of Computer Science, Gachon University, Seongnam, Korea
Correspondence to: Myung-Mook Han (
Received December 31, 2019; Revised February 24, 2020; Accepted February 28, 2020.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the past, most Hidden Markov models based on time series only used the original HMM model. The single-layer models (HMMs) structure has a big problem, and it isn’t straightforward to play its due role when it is necessary to make fine adjustments to the scene. So it was impossible to entirely and flexibly perform user behavior. This paper performs feature extraction and analysis of user behavior data of time series. The data labels should be added after the parameters obtained by statistical methods for clustering to obtain the first hidden state, and the layers are further layered according to working hours and outside working hours. The experimental results show that the method has strong applicability and flexibility, and can quickly detect abnormal behavior.

Keywords : Hidden Markov Model (HMM), User behavior, Insider threat, Feature engineering, Anomaly detection.
1. Introduction

The rapidly developing network technology has not only changed the human lifestyle and brought great development opportunities, but also brought more significant challenges in terms of security. After the Edward Snowden incident, people are paying more and more attention to the security of privacy. Cyber-attacks have always existed, but in recent years, the losses caused by internal security risks have become increasingly dangerous. According to the IT Security Risks Survey conducted by Kaspersky Lab and B2B International, 73% of companies have been affected by both intentional and unintentional internal information security incidents. Out of those, a fifth (21%) of companies also lost valuable data that subsequently affected their business [1]. If we do not stop the insider threats, that high-risk situation should be more and more. But the insider threats can’t easily be considered as a data-driven problem. We can’t use the methods just like an external attack because the complexity of internal attacks is much higher than the outside’s attacks. Every employee’s daily work is different. The challenge of insider threat detection is to set up the profile of each employee’s behavior. Insider threat detection needs to focus on differentiating suspicious users from the other, but we can’t directly specify an employee is an attacker. So, we need as many parameters as possible to define the attack’s behavior. The hidden Markov model (HMM) is a statistical model used for representing probability distributions over sequences of observations [2]. This model has been used in several domains, such as speech recognition [3], text understanding [4], image image identification [5], and microbiology [6]. HMM has also been utilized in [7, 8] to train models using observed network traffic under normal network conditions and to detect diverting sequences of traffic observations. According to the paper by Kabir et al. [9], it shows that double-layer HMM can be used to build models in behavioral data. However, in the face of more important and comprehensive behavioral data, small-scale data models may not achieve better results. For multi-scenario situations, there are disadvantages that cannot be handled. The article only uses all the behaviors in the three rooms for statistics, but if the model is placed in a community, the prediction results will be produced unpredictable because of different human habits. The first layer of the double-layer HMM proposed in this paper is obtained by clustering using an unsupervised algorithm. Assuming that the user’s behavior is regular and can be analyzed, density-based spatial clustering of applications with noise (DBSCAN) is a useful division method. The Markov model needs to satisfy two points: continuous-time and discrete events [1013]. So, Carnegie Mellon University posted an intrusion detection dataset called “CERT” dataset [14]. This dataset satisfies the necessary conditions of the Markov Model. Then we can see the result in [15].

However, the disadvantage of this thesis [15, 16] is that the distinction between normal behavior and abnormal behavior is not clear enough.

So, our contribution in this paper is: (1) a new double-layer HMM model for identifying abnormal behavior, and (2) new feature extraction and modeling method using the hour system. The remainder of this paper is structured as follows: Section 2 presents the description of our related works. We will introduce the feature extraction model and the detection model in detail in Section 3. Section 4 presents the experimental results. Finally, Section 5 gives the concludes and our future work.

2. Related Works

2.1 Hidden Markov Model

An HMM is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states [2]. A Markov model can simple to understand as defining the probability of each behavior occurring and then identifying the product of the possibilities of all actions arising over a period. A simple Markov model’s structure is shown in Figure 1. By comparison, when the sequence is relatively long, it isn’t straightforward for the Markov chain to distinguish several behaviors. Because HMM has a hidden layer, it is easier to distinguish between normal and abnormal behavior than the Markov chain. The structure of HMM is shown in Figure 2.

An HMM can be considered as the most straightforward dynamic Bayesian network. The probabilistic parameters of an HMM are shown in Figure 3. We use an example to explain the process of HMM, just like the structure in Figure 4. There are three boxes: blue, orange, and green. And there are five types of balls: A, B, C, D, E. We can get the parameters like the follows:

  1. States = (Blue, Orange, Green)

  2. Observations = (A, B, C, D, E)

  3. Start probability (π) =

  4. Blue: 0.2, Orange: 0.3, Green: 0.5

  5. Transition probability (A) =

  6. Blue: (Blue: 0.3, Orange: 0.6, Green: 0.1),

  7. Orange: (Blue: 0.3, Orange: 0.4, Green: 0.3),

  8. Green: (Blue: 0.2, Orange: 0.7, Green: 0.1) )

  9. Observation probability (B) =

  10. Blue: (A: 0.1, B: 0.4, C: 0.5),

  11. Orange: (A: 0.2, B: 0.1, C: 0.7),

  12. Green: (B: 0.5, E: 0.5)

According to the timing, we randomly picked the box and randomly picked the ball inside the box like Figure 5. The first time, we got the blue box and the ‘C’ ball in the box. And next, we got a blue, orange, green box and the ‘A,’ ‘B,’ ‘E’ balls in their boxes.

So, this condition’s probability is (0.2 × 0.2) × (0.3 × 0.2) × (0.6 × 0.1) × (0.3 × 0.5). A complete HMM model requires a set of parameters: [A, B, π]. A is a transition probability matrix, B is the observation probability matrix, π is the initial state probability. When the feature set is extracted from the dataset, the feature space is obtained.

2.2 Double-Layer HMM

According to the paper by Kabir et al., The feasibility of using double-layer HMM in behavioral data can be shown, but in the face of more extensive and more comprehensive behavioral data, small-scale data models may not be able to obtain better results. For multi-scenario situations, There are obvious disadvantages of insufficient processing. The article only uses all the behaviors in the three rooms for statistics, but if the model is placed in a block or a community, the prediction results will be produced because of different human habits. There is a significant deviation. The first layer of the double-layer HMM proposed in this paper is obtained by clustering using an unsupervised algorithm. Assuming that the user’s behavior is regular and can be analyzed, clustering is a useful division method. Double-layer HMM is shown in Figure 5.

2.3 Viterbi Algorithm

The Viterbi algorithm is named after Andrew Viterbi, which was proposed in 1967. It is a decoding algorithm for convolutional codes on noisy digital communication links [17]. However, it has a history of multiple inventions with at least seven independent discoveries, including those by Viterbi, Nederman, and Winsch, and Wagner and Fischer [18]. “Viterbi path” and “Viterbi algorithm” have become standard terms for applying dynamic programming algorithms to problems involving probability maximization [17].

For example, we got an action shown in Figure 6: B to E. Depending on the situation, we can get three results.

  1. (1) B from blue box.


  2. (2) B from orange box.


  3. (3) B from green box.


So, we can calculate the probability as:


The Viterbi can choose the biggest probability for the model. So, the algorithm should choose as the probability of the action B to E.


DBSCAN is a data clustering algorithm proposed by Ester, et al. [19] in 1996. The method used by DBSCAN is straightforward. It arbitrarily selects a core object without a category as a seed and then finds all the sample sets whose core objects can reach a density, which is a cluster. Then continue to select another core object without a category to find a sample set with a dense density to obtain another clustering cluster. Run until all core objects have categories.

It can be easily seen from Figure 7. that the above definition is understood. In Figure 7, MinPts = 5, the red points are the core objects because their ε-neighborhood has at least 5 samples. The black samples are non-core objects. All samples with linear core object density are in the hypersphere with the red core object as the center. If they are not in the hypersphere, they cannot be directly dense. The core objects connected by the green arrows in Figure 7. form a sample sequence with a density. All samples in the ε-neighborhood of these dense sample sequences are densely connected. The main advantages of DBSCAN are:

  1. 1) You can cluster dense datasets of any shape. In contrast, clustering algorithms such as K-means are generally only suitable for convex datasets.

  2. 2) Anomaly points can be found during the clustering process, and they are not sensitive to abnormal points in the dataset.

  3. 3) There is no bias in the clustering results. In contrast, the initial value of clustering algorithms, such as K-means, has a significant influence on the clustering results.

DBSCAN’s main disadvantages are:

  1. 1) If the density of the sample set is not uniform and the clustering distances are very different, the clustering quality is poor. At this time, DBSCAN clustering is generally not suitable.

  2. 2) If the sample set is large, the cluster convergence time is longer. At this time, the size of the KD tree or the ball tree established when searching the nearest neighbor can be improved by limiting the size.

  3. 3) Tuning parameters is slightly more complicated than traditional clustering algorithms, such as K-means. It mainly needs to adjust the distance threshold ε and the neighborhood sample number threshold MinPts. Different parameter combinations have a more significant effect on the final clustering.

3. Method and Approach

3.1 Dataset

For the insider threat detection, there are not many suitable datasets. So, we have used the insider threat dataset published by CERT (Carnegie Mellon University) for our research [14]. We used the “” dataset for our analysis. This dataset consists of six types of data records: HTTP, logon, device, file, email, and psychometric. This dataset contains all the activities of 1,000 employees within 17 months. The HTTP records contain user, PC, URL, and content from the website with time stamps. “Logon.csv” consists of the user’s “Logon” and “Logoff” activities with their PC with time stamps. “Logon” and “Logoff” are the two kinds of activities that can be found in the data records. “Logon” activity corresponds to the event of each user’s login or an event of one user unlock their screen while the “Logoff” event corresponds to the user’s logoff event. But, screen locks do not appear in this dataset. The third part of the dataset “device.csv” is a collection of data records of removable media device usage. It indicates connection and disconnection actions with the relevant user, PC, and time stamps. “file.csv” contains the details of file copies with date, user, PC, filename, and content. For figure out the relationship between the employee, the CERT dataset provides email communication records including from, to, cc and bcc fields. In the last part of the dataset, “psychometric.csv” provides psychometric scores based on the Big 5 personality traits for the definition of pers[onality. Anomaly detection has been the main focus of many researchers’ due to its potential in detecting novel attacks. However, its adoption to real-world applications has been hampered due to system complexity as these systems require a substantial amount of testing, evaluation, and tuning prior to deployment. Running these systems over real labeled network traces with a comprehensive and extensive set of intrusions and abnormal behavior is the most idealistic methodology for testing in the 2013 IEEE Security and Privacy Workshops, a new dataset for internal intrusion detection was proposed in a paper published by researchers at Carnegie Mellon University [20]. The dataset has a complete time series and labels, and defines the scenarios for the red team’s abnormal behavior.

  1. 1. Users who don’t use mobile devices very often and get used to work normally suddenly start using mobile devices and start uploading data to WikiLeaks and leave in a short time.

  2. 2. Users start frequent visits to job search sites, and when they leave, they start to use mobile devices frequently (more frequently than usual) to transfer data.

  3. 3. The system administrator uses the mobile device to copy the keylogger to his supervisor’s computer due to negative emotions for the company or other reasons. The next day, through the collected information, log in as a supervisor and leave the company after sending a large number of panicked emails within the company.

3.2 Rule-based Feature Extraction

Eldardiry et al. [15] proposed a feature extraction method. Too many features may make the user probability unable to distinguish between normal behavior and abnormal behavior with insufficient data in [16]. This feature extraction method only takes the maximum and average values of some values and ignores the characteristics of the time series of the dataset. The right approach is to divide the behavior with as few features as possible clearly. Our feature extraction method is shown in Table 1.

For example, if the user logs in to the computer during the period from 8 to 9 o’clock, the user’s Feature is marked as 0. Because it is normal working hours, its Hidden Feature is 0. If the total activity of this user from 8 to 9 is N events, we can get data for N groups (Feature, Hidden Feature), and we calculate the proportion of each Feature in N events. Then, the probability of occurrence of event 0 is N0/N. If the probability is 0, we will use the following methods to fill in the probability; if there are 16 feature classes with features 0–15, and the feature 4’s probability is 0, then the probability of feature 4 is 1/(N +16). Similarly, the probability of feature 5 that is not 0 is calculated as (N5+16)/(N +16). We extracted his probability data from 8 o’clock to 9 o’clock every day for more than one year, as shown in Figure 8, and used it for further cluster analysis.

3.3 Proposed Modeling Method

In view of the characteristics of the double-layer HMM, our modeling process is as follows:

  1. 1. The five parts of the dataset are classified according to users and arranged in chronological order.

  2. 2. According to the feature map, we use rules to convert features to numbers from 0 to 15.

  3. 3. Calculate the B parameters required by HMM. If the probability is 0, then Laplace’s theorem is used for processing.

  4. 4. Cluster analysis was performed after adding hidden state work on and work off to the B parameter.

  5. 5. Model the calculated parameters A, B, π.

  6. 6. Test the model with attack data and get the results.

The flowchart of our system is shown in Figure 9.

4. Experiments

Through the process described in Section 3 of the article, we conducted experiments. The experimental environment used Intel’s 6-core CPU and 16 GB of memory, and the operating system used Ubuntu 18.04. Use Python as the language to write programs, and the IDE uses pycharm. The biggest advantage of Python is that it has a rich interface for scientific settlement. The event statistics of the dataset are shown in Table 2.

There are a total of 50 sets of attack data scenarios in the dataset. We extract the features in the form of a feature map and use it to identify each user after modeling.

The dataset contains a total of 1,000 users, and 50 of them have had abnormal behaviors for nearly one and a half years. Our model is to create a model for everyone to distinguish between users and other normal users, and between users and abnormal users. The experimental method is to test each other in a loop, so in order to show the results, we first find one of the representative users to explain. Our experimental goal is only one point, so that each user’s model is sufficiently stable to distinguish between its own behavior, abnormal behavior, and normal behavior mixed with abnormal behavior.

First, we select the user HSB0196 (user‘id), group the dataset, and calculate the hourly behavioral data model of this user from January 4, 2010 to May 17, 2011 based on hourly intervals. In other words, each user has 24 double-layer HMMs. As shown in Figure 8, all time points are integrated at 8 o’clock. The results after clustering using the DBSCAN algorithm are shown in Table 3. The epsilon is 0.9 and minPoints is 6.

This user’s 8 o’clock has a total of 346 times. There are 8 data that cannot be classified. See Figure 10 for details.

According to the ROC curve shown in Figure 11, the results are highly credible.

We will use 50 behavioral data to attack users, 50 normal data, and 50 normal data mixed with abnormal data for testing. All of the 1,000 users, except the users with severe data loss, showed the effect shown in Figure 12. The upper part of the way is normal data, the middle is normal data mixed with abnormal data, and the following is the probability fluctuation of abnormal data. After testing, the detection accuracy of all users is shown in Table 4.

5. Conclusions

This paper provides a new double-layer HMM structure to model user behavior. Our method belongs to anomaly detection. Compared with the Markov chain, our model is more efficient regarding probability. The experimental results show that when the amount of data becomes more significant, the result is more accurate. Compared with other papers, the algorithm is more accurate in distinguishing single users from other users. But a single clustering algorithm may not satisfy a wide variety of users. But the system has some disadvantages; when we face the malicious behavior of users without any data accumulation, we can do nothing about the attack. So, we need more experimental data to confirm our assumptions. In the future, we should consider using adaptive algorithms to model all users more accurately and solve the problem of new users or data-less users.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.


This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2018R1C1B3008718).

Fig. 1.

Markov model.

Fig. 2.

Hidden Markov model.

Fig. 3.

Probabilistic parameters of an HMM.

Fig. 4.

HMM simple.

Fig. 5.

Double-layer HMM.

Fig. 6.

HMM simple B to E.

Fig. 7.

DBSCAN simple.

Fig. 8.

Probability dataset for all events at the same time each day by a single user.

Fig. 9.

The flowchart of proposed model.

Fig. 10.

Visualization of clustering results.

Fig. 11.

ROC curve.

Fig. 12.

Experiment results.


Table 1

Feature map

Feature numberHidden Feature numberFeature nameDescription
0Work on: 0Logon (self pc)User logon user’s own computer
1Logoff (self pc)User logoff user’s own computer
2Logon (other pc)User logon other’s own computer
3Logoff (other pc)User logoff other’s own computer
4Connect (self pc)Device connected to user’s own computer
5Disconnect (self pc)Device disconnected to user’s own computer
6Connect (other pc)Device connected to other user’s own computer
7Disconnect (other pc)Device disconnected to other user’s own computer

8Work off: 1Email_InSend email in company
9Email_OutSend email in outside
10Normal-websiteDevice disconnected to user’s own computer
11Danger-websiteWebsites with a tendency to leak or provide high-risk software downloads
12Job-websiteWebsite for posting job information
13File_selfUser modifies files on their own computer
14File_otherUser modifies files on their other computer
15Empty Use15 instead when the number of actions in a fixed time is insufficient

Table 2

Data set event statistics

Partthe number of event

Table 3

The Result of DBSCAN

ClassNumber (%)
086 (25)
1125 (37)
2121 (36)
36 (2)

Table 4

Result of all users (n = 1, 000)

Normal user95094%
Abnormal user5099%

  1. Seals, T. (2015) . 75% of companies are insider threat victims. Available
  2. Ghahramani, Z (2001). An introduction to hidden Markov models and Bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence. 15, 9-42.
  3. Huang, XD, Ariki, Y, and Jack, MA (1990). Hidden Markov Models for Speech Recognition. Edinburgh, UK: Edinburgh University Press
  4. Wen, Y 2001. Text mining using HMM and PMM. Ph.D. dissertation. University of Waikato. Hamilton, New Zealand.
  5. Bunke, H, and Caelli, T (2001). Hidden Markov Models: Applications in Computer Vision. Singapore: World Scientific
  6. Boys, RJ, Henderson, DA, and Wilkinson, DJ (2000). Detecting homogeneous segments in DNA sequences by using hidden Markov models. Journal of the Royal Statistical Society: Series C (Applied Statistics). 49, 269-285.
  7. Anming, Z, and Chunfu, J 2004. Study on the applications of hidden Markov models to computer intrusion detection., Proceedings of the 5th World Congress on Intelligent Control and Automation (IEEE Cat No 04EX788), Hangzhou, China, Array, pp.4352-4356.
  8. Shin, S, Lee, S, Kim, H, and Kim, S (2013). Advanced probabilistic approach for network intrusion forecasting and detection. Expert Systems with Applications. 40, 315-322.
  9. Kabir, MH, Hoque, MR, Thapa, K, and Yang, SH (2016). Two-layer hidden Markov model for human activity recognition in home environments. International Journal of Distributed Sensor Networks.
  10. Parzen, E (1999). Stochastic Processes. Philadelphia, PA: Society for Industrial and Applied Mathematics
  11. Karlin, S, and Taylor, HE (2014). A First Course in Stochastic Processes. St. Louis, MO: Elsevier
  12. Lamperti, J (2012). Markov transition function. Stochastic Processes: A Survey of the Mathematical Theory. New York, NY: Springer Science & Business Media, pp. 106-131
  13. Baum, LE, and Petrie, T (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics. 37, 1554-1563.
  14. CERT Insider Threat Center. (2011) . The CERT Insider Threat Database. Available
  15. Eldardiry, H, Bart, E, Liu, J, Hanley, J, Price, B, and Brdiczka, O 2013. Multi-domain information fusion for insider threat detection., Proceedings of 2013 IEEE Security and Privacy Workshops, San Francisco, CA, Array, pp.45-51.
  16. Gamachchi, A, Sun, L, and Boztas, S 2017. Graph based framework for malicious insider threat detection., Proceedings of the 50th Hawaii International Conference on System Sciences (HICSS), Hilton Waikoloa Village, HI.
  17. Forney, GD. (2005) . The Viterbi algorithm: a personal history. Available
  18. Jurafsky, D, and Martin, JH (2014). Speech and Language Processing. Harlow, UK: Pearson
  19. Ester, M, Kriegel, HP, Sander, J, and Xu, X 1996. A density-based algorithm for discovering clusters in large spatial databases with noise., Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pp.226-231.
  20. Glasser, J, and Lindauer, B 2013. Bridging the gap: a pragmatic approach to generating insider threat data., Proceedings of 2013 IEEE Security and Privacy Workshops, San Francisco, CA, Array, pp.98-104.

Xiaoyun Ye was born in Yantai, China, in 1988. He received the master’s degree in Computer Science from Gachon University, pursuing a PhD at Gachon University. His research interests include machine learning, cyber security, intelligent security network, data mining, and pattern recognition.


Sung-Sam Hong was born in Seoul, Korea, in 1983. He received the master’s degree in Computer Science from Gachon University, Korea in 2011 and the Ph.D. degree in Computer Science from Gachon University, Korea in 2016. He is a researcher professor in the Department of Computer Engineering in Gachon University, Korea. His research interests include multimedia security, intelligent information security, machine learning, cryptology, data mining, and big data.


Myung-Mook Han received M.S. degree in computer science from New York Institute of Technology in 1987 and Ph.D. degree in information engineering from Osaka City University in 1997, respectively. From 2004 to 2005, he was a visiting professor at Georgia Tech Information Security Center (GTISC), Georgia Institute of Technology. Currently, He is a professor in the Department of Software, Gachon University, Korea. His research interests include information security, intelligent system, data mining, and big data.