search for


An Enhanced Deep Neural Network-Based Architecture for Joint Extraction of Entity Mentions and Relations
International Journal of Fuzzy Logic and Intelligent Systems 2020;20(1):69-76
Published online March 25, 2020
© 2020 Korean Institute of Intelligent Systems.

Elham Parsaeimehr, Mehdi Fartash, and Javad Akbari Torkestani

Department of Computer Engineering, Islamic Azad University - Arak Branch, Arak, Iran
Correspondence to: Mehdi Fartash (
Received February 7, 2020; Revised February 29, 2020; Accepted March 5, 2020.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Named entity recognition and relation extraction are two principal tasks in most natural language processing systems. The majority of methods used in the field implement these two issues independently, thus leading to possible problems such as error propagation from one component (entity detection) to another (relation extraction). To solve such problems, we propose a new architecture for joint identification of entity mentions and their relation by employing a deep neural network framework. The model not only overcomes the error propagation challenge but also improves the detection results of both tasks owing to the cooperation with each other. Experiments on publicly available sources demonstrate that our joint model surpasses competitors in terms of accuracy. The results highlight the improvement achieved by the proposed deep neural network framework for the entity mention and relation classification tasks. Furthermore, we tested the effect of increasing the sentence length and demonstrated its negative influence on the performance.

Keywords : Entity classification, Relation extraction, Joint model, LSTM-RNN, CNN, NLP
1. Introduction

With the growing amount of stored data on the internet, substantial efforts have been devoted to the field of information extraction, which relates to the extraction of knowledge from such data warehouses. In this field also, two processes, named entity recognition (NER) and relation extraction (RE), have attracted more attention in text mining researches. These two processes are mainly employed for natural language processing (NLP) applications such as knowledge base construction (KBC) and question answering (QA) systems in various domains and languages [1].

Traditionally, NER and RE work together in a pipelined model. However, this strategy has some major disadvantages, such as: (i) errors propagate from one process (NER) to the other (RE), (ii) the helpful information acquired from one process cannot be applied to another (e.g., recognition of born in relation for two entity mentions may be conductive for NER process to extract the two entities types: person, countries of birth, and vice versa). Alternatively, a few researchers have introduced models where these two processes jointly contribute by interacting and sharing their parameters. Such joint models can overcome the aforementioned problems and consequently achieve a high performance.

In this paper, we propose a novel architecture for entity and relation classification using a deep neural network framework. Our architecture incorporates the joint modeling of the NER and RE components in a single model by utilizing two different structures (sequential and tree) of LSTM-RNN (Long short term memory-Recurrent Neural Network) and CNN (Convolution Neural Network). More precisely, the novelty of our architecture is the contribution of two different neural network models (CNN and LSTM) in the NER component. This model enables the simultaneous classification of the entity mentions and their relation type with higher accuracy. In this neural network-based framework, the parameters are renewed via a classification feedback obtained from both the processes. More accurately, the model employs bidirectional LSTM-RNN (Bi-LSTM-RNN) and CNN models to automatically learn the features of an input sentence and then detect the relation between an entity pair by using a bidirectional tree-structured LSTM. Experiments on the ACE05 and KBP37 datasets prove that our architecture surpasses competitor models in performance.

2. Related Works

Multiple studies have been performed in recent years on entity pair detection and relation extraction. Based on their representative strategies, the studies can be divided into two main groups: where the NER and RE processes perform in a pipeline, and where the processes collaborate jointly.

The following subsections discuss previous works for the both tasks of NER and RE as well as their joint models.

2.1 Named Entity Recognition

In recent decade, several efforts have been devoted to study various methods for entity identification in unstructured documents across different fields. Most traditional methods, such as the maximum margin Markov network [2], conditional random field (CRF) [3], and support vector machine [4] need extreme computations for engineering and extracting the features. Recently, some architectures with RNN and CNN approaches have been partnered with the CRF algorithm for entity detection tasks [5, 6]. Such methods that do not employ manual feature extraction techniques have reported a noticeably good performance.

2.2 Relation Extraction

Relation extraction is another significant process in NLP that has attracted significant attention. Two approaches have been used to perform this task: manual feature-based approach and deep neural network approach. With regard to the first approach, machine learning supervised algorithms such as kernel methods have been employed for detection [7]. These type of algorithms use the syntactic and semantic relations of the parse tree, in addition to the lexical features, to capture the structure of the input sentence [8]. To alleviate the dependence on feature engineering, deep neural network-based models have been developed. Researchers have also been able to reach a better semantic understanding of the natural language with the automatic extraction of lexical and sentence-level features by employing CNN-based [9] and RNN models [10].

2.3 Associative Identification of Entity Type and Relation Class

Most recent researches on associative (joint) identification of entity mentions and the relation types have been feature based, such as the probabilistic graphical model [11], card-pyramid parsing model [12], and history-based structured model [13]. These systems depend on the extraction of handcrafted features by NLP methods (e.g., POS), although manual engineering of features is a time-consuming process and using NLP tools may lead to an increase in computation and error propagation.

In [1416], a deep neural network framework has been introduced that employs RNNs and CNNs to manage the issue in a joint structure. Specifically, Miwa and Bansal [14] employed bidirectional tree-structured LSTM-RNNs to formulate the dependency structure between a word pair. The researchers in [16] utilized the idea given in [14] for extraction of biomedical entities (drug and disease) and relations, reporting noticeable accuracy. The authors in [17] proposed a table-filling multitasking RNN that can model several relation instances. The study in [18] presented an attention-based LSTM that identifies the semantic relations between entity pairs without using a dependency tree.

Similar to Miwa and Bansal [14], we use a bidirectional tree LSTM for describing the dependency structure between the target entity pairs.

3. Proposed Architecture

In this section we describe our end-to-end (joint) architecture (Figure 1), which enables the simultaneous detection of the entity type and the relation class. Our model comprises four components: embedding layer, Bi-LSTM layer, entity type detection module, and relation extraction module. In the following sub-sections, these components are described in detail.

3.1 Embedding Layer

In this layer, a word with 1-hot format is transformed into an embedded representation, that is, each word wi is converted into a vector xi with real values. For this purpose, we look up the embedding matrix of Glove for each word in sentence s; where GloveRd×|v|, v is a vocabulary with fixed size and d is a hyper parameter that detects the dimension of word embedding. A word is transformed into its embedding format xi with a matrix-vector product, as shown in Eq. (1):


where wi represents the one-hot format of the word. Eventually, the input sentence in the embedded form, embs = {x1, x2, . . . , xl} ∈ Rl×d, is fed to the next layer.

3.2 Bi-LSTM

This layer models the sentential context information of the input sentence, as shown at the bottom of Figure 1.

The Bi-LSTM component [19] comprises two parallel layers: forward LSTM layer and backward LSTM layer. At time-step t, the LSTM unit consists of a collection of vectors with nlt dimension: an input gate it, a forget gate ft, an output gate ot, and a memory cell ct. The new vectors are calculated using equations shown in Eq. (2):


Here, σ is the logistic function and ⊙ performs as an element-wise multiplication. w, U, and b depict the weight matrixes and bias vector, respectively.

We utilize the Bi-LSTM to capture the past and the future information. Particularly, for a current word xt, in the forward LSTM layer, the input information is encoded from the past (x1) to the current time frame (xt) and presented as htf. Simultaneously, in the backward LSTM layer, input information is encoded from the future (xl) to the current time frame (xt) and depicted as htb. We then concatenate the two hidden state vectors of the current word (xt,ht=[htf;htb]), and pass the result to the next module.

3.3 Named Entity Recognition Module

The novelty of our architecture is demonstrated in this module, which concatenates the outputs of the LSTM and CNN. The working of this module is explained in more details below.

We use an encoding scheme BILOU for the entity detection process, where each entity tag indicates the class of the entity as well as the position of the word in the detected entity boundary (Begin, Inside, Last, Outside, Unit).

For example, in the phrase Anthony Minghella, which is detected as the entity type “Person”, as shown in Figure 1, we use a B-PER tag for exhibiting the beginning word and an L-PER tag for displaying the last word in the identified entity expression type. Our NER module comprises two phases: a convolutional phase and an output phase, which will be explained in detail in subsequent sections.

Convolutional phase

We extract two feature vectors for the two entities in the input sentence using a mixture CNN (Mix-CNN) [20]. As described in Figure 2, in this phase the semantic properties of the target entity pair are identified based on the textual words, or more precisely, an entity’s properties may be reflected by its surrounding words (the previous and the next words). As shown in Figure 2, CNN+-1 identifies the textual semantic features based on the words present in the range Anthony to was in the given sentence. Similarly, CNN+-j extracts the semantic information of the entity base on j * 2 words surrounding the target word Minghella (i.e., j words both before and after Minghella). WE1ji represents the i-th filter of CNN+-j in the Mix-CNN for extracting the entity e1 and WE2ji indicates the i-th filter of CNN+-j for entity e2. The feature extracted by WE1ji for entity e1 is represented as Zeji. Thus, the j-th contextual information of entity e1 is demonstrated as E1j=[Zej1,Zej2,,Zejne], in which where ne represents the number of filters in the Mix-CNN. Considering the differences in the dependencies of various entities on the textual words, a max-pooling process is applied to combine the features obtained with CNN+-(1,2,,j), as shown in Eq. (3), to improve them for better use in the subsequent computations.

E1s=(max(Ze11Ze21Zej1)                                          max(Ze1nZe2nZejn)).
Output phase

After obtaining the sentential context information from the Bi-LSTM layer and the semantic features of the two entities from the Mix-CNN, the collected information is concatenated to obtain f = [ht, E1s, E2s]. In this phase, a softmax classifier with dropout is utilized, as shown in Eq. (4).


Here, WRm×(2×l+2×ne), is the weight vector between the concatenated vector of f and the layer of labels; m is the total number of entity type classes and rR(2×l+2×ne) is a binary mask vector drawn from Bernoulli with the probability ρ. Dropout prevents overfitting and leads to a more robust model. In Eq. (4b)pi represents the detection probability of entity class i.

3.4 Relation Classification Module

This module recognizes a relation between the detected entities from the dependency tree. The shortest path is employed in this NLP process because it contains the summary of the significant words that represent the relation between two entities. For example, as Figure 2 depicts, the shortest path for the two entities Minghella and England contains the key phrase born in. We utilize the bidirectional tree LSTM [14] to extract a relation type by modeling the dependency structure of the entity pair and the words surrounding them in the input sentence. As explained in [14], this bidirectional structure propagates the information to each node from the leaves upward and also from the root downward. This bidirectional tree-structured LSTM shares the weight vectors of the children with similar type and furthermore permits variable number of children.


At time-step t, the vectors in the LSTM unit with C(t) children are calculated according to Eq. (5):


Here, m(·) is a type mapping function. The LSTM unit in the tree structure receives xtd=[ht;et;xt] as input, that is, the concatenation of the hidden state vectors ht in the Bi-LSTM layer, the predicted entity label (et), and the embedded form of the word (xt ) as input.

Relation classification

We use the last words of the detected type of entity phrase for relation classification; in other words, the words with L or U tags in the BILOU scheme are considered. For instance, as shown in Figure 1, we perform a relation classification using Minghella with L-PER (Person) tag and England with U-COUNTRY tag.

The relation candidate vector is formed as the concatenation dp = [↑ hp; ↓ hp1; ↓ hp2], where ↑ hp is the hidden state vector of the top LSTM unit in the bottom-up LSTM-RNN (representing the lowest common ancestor of the target two entities) and ↓ hp1, ↓ hp2 are the hidden state vectors of the two LSTM units representing the first and the second entities in the top-down LSTM-RNN. ↑ hp, ↓ hp1, and ↓ hp2 appear as arrows in Figure 1. According to Eq. (6), the output phase is a softmax classifier with weight matrices W and bias vectors b.


Here, wrh and brh are the weight matrices and bias vectors, respectively.

4. Experiments

In this section, the datasets we used in our experiments are introduced, and the values of the hyper parameters and evaluation metrics are presented.

4.1 Dataset

We conducted experiments on two public datasets: ACE05 and KBP37.


There are seven entity types and seven relation classes in this dataset. The entity types include: Organization, Location, Person, Vehicle, Geo-Political Entities, Facility, and Weapon. The relation classes are: Org-Affiliation, Physical, Artifact, Part-Whole, Person-Social, Gen-Affiliation, and Metonymy.


This corpus defines eight coarse-grained entity types and 37 relation classes. The entity types are Person, Organization, Country, City, State or Province, Affiliation, Nationality, and Number. We divided the data source into three groups: training (15,917), development (1,724), and test (3,405). The relation types of the dataset are shown in Table 1. As exhibited, there are 18 directional relations and one additional other relation, resulting in 37 relation classes.

4.2 Metrics

The metrics we used to evaluate the performance of our architecture include Precision, Recall, and F-Measure (F1). We considered a relation type for a word pair as correct when the type of both the entities and their relation type were classified correctly.

4.3 Hyper Parameters

We simulated our end-to-end architecture using powerful language (Python and Keras library). The hyper parameters used in the experiments are presented in Table 2.

4.4 Results


We compared the working of our model with that of other systems on the ACE05 and KBP37 datasets. The compared systems included: (i) NER without CNN, which is a proposed NER model without the CNN phase and concatenation operation given by Miwa & Bansal [14], and (ii) pipeline NER model and pipeline RE model, with no interaction.

The comparison results show that our model works better than that of Miwa and Bansal [14] on the ACE05 dataset (Table 3). It is evident that the novelty of our architecture (adding CNN in the NER module) is effective and this enhancement propagates to the next module. Thus, we can see an improvement of 1.3% and 1.4% in the F1-score for entity classification and relation identification, respectively. Therefore, this proves that the concatenation of the feature vectors of the Bi-LSTM-RNN and CNN obtains more effective information for entity extraction.

Table 4 displays the evaluation results of two different variations of our system on the KBP37 dataset: (i) the system without CNN and (ii) the system with a pipeline approach. The results prove that adding CNN in the proposed architecture increases the efficiency of the two processes by approximately 1% on this corpus. They also demonstrate that although the false entity tag assignment in the NER module has a negative effect on the performance of RE module in an end-to-end model, the RE module in our architecture provides better results than in the pipeline model. The reason is that both the modules (entity detection and relation extraction) in a pipeline model are trained independently without considering any interaction between them, such as sharing of the underlying Bi-LSTM-RNN layer. Conversely, in our model, both the NER and RE modules learn the entities and their relations interactively and simultaneously.

We can observe that better experimental results were obtained on ACE05 than on KBP37. This may be because the average length of the sentences in ACE05 is less, thus causing less complications and enabling higher accuracy. Another reason may be the number of relation types in the two corpuses; there are 37 relation classes in KBP37, which are significantly more than the ACE05 relation types. Therefore, the performance reduces by increasing the complexity.

Effect of sentence length on performance

Figure 3 depicts the performance of the models for relation extraction on sentences with different lengths on ACE05 and KBP37, respectively. The x-axis represents the sentence length and the y-axis indicates the F-measure values. The number of words in the sentences of both the datasets do not exceed 60. The F-measure metric is calculated on the average length value of the sentences in the range [m – 9, m], where m = {10, 20, ..., 60}.

Compared with the pipeline approach, our model displays higher accuracy and this improvement suggests that the interaction of the two modules is really beneficial (e.g., identifying a born in relation may help the NER module in detecting the type of two other entities, i.e., Person, Countries of birth, and vice versa).

In addition, Figure 3 demonstrates that increasing the sentence length has negative impact on F-measure. In other words, longer the sentence, lower the F-measure.

5. Conclusion

In this research, we propose a novel architecture for entity and relation classification using a deep neural network framework. Our architecture introduces joint modeling of the NER and RE components in a single model with the contribution of two different structures (sequential and tree) of LSTM-RNN and CNN. This model classifies entity mentions and their relation type simultaneously via a cooperative deep neural network. In this neural network-based framework, the parameters are renewed via a classification feedback received by both the processes. In particular, the model employs bidirectional LSTM-RNN and CNN models to automatically learn the features of the input sentence and then detects the relation between the pair entities by using a bidirectional tree-structured LSTM. The results of experiments on ACE05 and KBP37 datasets prove that our architecture surpasses competitor models in performance. The experiment results also show that the performance of the method deteriorates with increasing sentence length.

In future, we aim to work on Persian language texts and implement the NER and RE tasks on them. Furthermore, we would like to investigate sentences with more than two entity mentions and more relation types.

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Fig. 1.

The proposed architecture for joint entity and relation extraction.

Fig. 2.

The mixture CNN (Mix-CNN) module.

Fig. 3.

Results vs. sentence length.


Table 1

Relation types in KBP37 dataset

Relation typeRelation type
Person, alternate namesOrganization, alternate names
Person, originOrganization, subsidiaries
Person, spouseOrganization, top members/employees
Person, titleOrganization, founded
Person, employee ofOrganization, founded by
Person, countries of residenceOrganization, country of headquarters
Person, state, or provinces of residenceOrganization, state, or province of headquarters
Person, cities of residenceOrganization, city of headquarters
Person, country of birthOrganization, members
No relation-

Table 2

Hyper parameter list used in the proposed architecture

Hyper parameterExplanationValue
dDimension of word embedding300
nltThe number of hidden units of Bi-LSTM100
neThe number of CNN filters in NER module100
jThe number of CNNs in Mix-CNN layer3
ρThe ratio of dropout0.3

Table 3

Comparison of proposed architecture on ACE05 dataset


Precision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
Our architecture82.583.883.151.153.852.4
Miwa and Bansal [14]81.582.181.850.652.951.8

Table 4

Comparison of proposed architecture on KBP37 dataset

Precision (%)Recall (%)F1 (%)Precision (%)Recall (%)F1 (%)
Our architecture80.181.780.947.348.647.9
Our architecture - CNN79.080.279.646.147.846.9

  1. Kang, BY, and Kim, DW (2012). A multi-resolution approach to restaurant named entity recognition in Korean web. International Journal of Fuzzy Logic and Intelligent Systems. 12, 277-284.
  2. Taskar, B, Guestrin, C, and Koller, D (2003). Max-margin Markov networks. Advances in Neural Information Processing Systems. 16, 25-32.
  3. Lafferty, J, McCallum, A, and Pereira, FC 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data., Proceedings of the 18th International Conference on Machine Learning, San Francisco, CA, pp.282-289.
  4. Tsochantaridis, I, Hofmann, T, Joachims, T, and Altun, Y 2004. Support vector machine learning for interdependent and structured output spaces., Proceedings of the 21st International Conference on machine Learning, Helsinki, Finland, Array, pp.104-112.
  5. Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, and Kuksa, P (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research. 12, 2493-2537.
  6. Ma, X, and Hovy, E 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF., Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, Array, pp.1064-1074.
  7. Culotta, A, and Sorensen, J 2004. Dependency tree kernels for relation extraction., Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, Array, pp.423-429.
  8. Rink, B, and Harabagiu, S 2010. UTD: classifying semantic relations by combining lexical and semantic resources., Proceedings of the 5th International Workshop on Semantic Evaluation, Los Angeles, CA, pp.256-259.
  9. Zeng, D, Liu, K, Lai, S, Zhou, G, and Zhao, J 2014. Relation classification via convolutional deep neural network., Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp.2335-2344.
  10. Xu, Y, Mou, L, Li, G, Chen, Y, Peng, H, and Jin, Z 2015. Classifying relations via long short term memory networks along shortest dependency paths., Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, Array, pp.1785-1794.
  11. Yang, B, and Cardie, C 2013. Joint inference for fine-grained opinion extraction., Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp.1640-1649.
  12. Kate, RJ, and Mooney, RJ 2010. Joint entity and relation extraction using card-pyramid parsing., Proceedings of the 14th Conference on Computational Natural Language Learning, Uppsala, Sweden, pp.203-212.
  13. Singh, S, Riedel, S, Martin, B, Zheng, J, and McCallum, A 2013. Joint inference of entities, relations, and coreference., Proceedings of the 2013 Workshop on Automated knowledge Base Construction, San Francisco, CA, Array, pp.1-6.
  14. Miwa, M, and Bansal, M. (2016) . End-to-end relation extraction using LSTMs on sequences and tree structures. Available
  15. Zheng, S, Hao, Y, Lu, D, Bao, H, Xu, J, Hao, H, and Xu, B (2017). Joint entity and relation extraction based on a hybrid neural network. Neurocomputing. 257, 59-66.
  16. Li, F, Zhang, M, Fu, G, and Ji, D (2017). A neural joint model for entity and relation extraction from biomedical text. BMC Bioinformatics. 18. article no. 198
    Pubmed KoreaMed CrossRef
  17. Gupta, P, Schutze, H, and Andrassy, B 2016. Table filling multitask recurrent neural network for joint entity and relation extraction., Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp.2537-2547.
  18. Katiyar, A, and Cardie, C 2017. Going out on a limb: Joint extraction of entity mentions and relations without dependency trees., Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, Array, pp.917-928.
  19. Schuster, M, and Paliwal, KK (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 45, 2673-2681.
  20. Zheng, S, Xu, J, Zhou, P, Bao, H, Qi, Z, and Xu, B (2016). A neural network framework for relation extraction: learning entity semantic and relation pattern. Knowledge-Based Systems. 114, 12-23.

Elham Parsaeimehr is a Ph.D. student in Department of Computer Engineering, Arak Branch, Islamic Azad University. She received a master’s degree in Computer Engineering from Science and Research branch, Islamic Azad University, Iran. Her research interests are datamining, machine learning, and text mining.

Mehdi Fartash received his Ph.D. from Science and Research Branch, Islamic Azad University, Tehran, Iran. He is currently an Assistant Professor in Department of Computer Engineering, Arak Branch, Islamic Azad University. His research interests are datamining, bio-inspired speech and acoustic processing, computational intelligence and Distributed computing.

Javad Akbari Torkestani received the Ph.D. degree in Computer Engineering from Science and Research University, Iran, in 2009. Currently, he is an Associate Professor with Department of Computer Engineering, Arak Branch, Islamic Azad University, Arak, Iran. Prior to his current position, he was an Assistant Professor from 2010 to 2013. He joined the faculty of Computer Engineering Department at Arak Azad University as a lecturer in 2005. His research interests include wireless networks, Web engineering, fault tolerant systems, grid computing, learning systems, parallel algorithms, soft computing, and datamining.