Article Search
닫기

## Original Article

Split Viewer

Int. J. Fuzzy Log. Intell. Syst. 2017; 17(1): 26-34

Published online March 31, 2017

https://doi.org/10.5391/IJFIS.2017.17.1.26

© The Korean Institute of Intelligent Systems

## Plant Leaf Recognition Using a Convolution Neural Network

Wang-Su Jeon1, and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea, 2Department of Computer Engineering, Kyungnam University, Changwon, Korea

Correspondence to :
Sang-Yong Rhee (syrhee@kyungnam.ac.kr)

Received: February 1, 2017; Revised: February 23, 2017; Accepted: March 24, 2017

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

There are hundreds of kinds of trees in the natural ecosystem, and it can be very difficult to distinguish between them. Botanists and those who study plants however, are able to identify the type of tree at a glance by using the characteristics of the leaf. Machine learning is used to automatically classify leaf types. Studied extensively in 2012, this is a rapidly growing field based on deep learning. Deep learning is itself a self-learning technique used on large amounts of data, and recent developments in hardware and big data have made this technique more practical. We propose a method to classify leaves using the CNN model, which is often used when applying deep learning to image processing.

Keywords: Leaf, Classification, Visual system, CNN, GoogleNet

When leaving a town and entering the suburbs, we may encounter many kinds of trees. We may be able to identify those trees that often grow on urban streets, however most of the trees and plants found in city suburbs will be unknown to the majority of us. There are approximately 100,000 species of trees on earth, which account for about 25% of all plants. Many of the trees are in tropical regions, and because only limited botanical research has been carried out in these areas, it is believed that there are many undiscovered species [1]. It is clear that identifying large numbers of such trees is a complex process.

An example of the complexity of tree identification can be seen with plums and apricots. These are very similar in leaf shape, the shape of the tree, and even in the shape of the young fruit. The flower shape is also very similar, and the tree type can only be identified by determining whether the calyx is attached, or inverted relative to the petal. Additionally, some trees are not easily distinguishable except at particular times; for example, when they bloom or bear fruit. To identify trees like these, considerable information is required, including leaf shape, shape of the leaves that are directly attached to branches, branch shape, shape of the whole tree, tree size, flower shape, flowering time, and fruit.

When using branches of biology such as cell biology, molecular biology, phytochemistry, or morphologic anatomy, it may be possible to distinguish plants without time constraints. However, it is unrealistic for the general public to identify the names of trees or plants using these methods when, for example, they are walking in a woodland.

Since the public will usually distinguish trees by their appearance, studies have been carried out to recognize trees using this method alone. Du et al. [1] have converted a color input image into a binarized image to extract the outline, and the two dimensional features were then extracted using the outline image. These features were grouped using the Move Median Centers (MMC) classifier. This study showed faster execution speeds than those of previous studies, and generated accurate results using the combination of the characteristics. However, the recognition rate was only approximately 90%.

Nam and his colleague [2, 3] used a shape-based search method to distinguish plants. They tried to improve the accuracy by using not only the outline, but also the vein data of the leaves. The outline recognition was improved using the minimum perimeter polygons (MPP) algorithm, and the vein data were represented using extended curvature scale space (CSS) to extract the midrib, intersections and endpoints. A weighted graph was composed using the relationship between the feature points and the distance value, and the similarity was calculated using the value. A total of 1,032 plant leaf images were used from plant inscriptions, however, the exact recognition rate was not given because the research related to database searches rather than to plant identification. However, a result graph showed better search results than for the existing study.

Because recognition is a considerable challenge in the field of computer vision, further development using ImageNet Large Scale Visual Recognition Competition (ILSVRC), has been attempted; however, until 2011, the top five error rates of the most effective recognition technique were 25.8%. With the emergence of AlexNet [4] in 2012, the error rate has dropped sharply to 16.4%, and this dropped further to 2.99% at 2016. This is the result of improved performance when compared to traditional machine learning methods, which classify data after extracting features or preprocessing. In this paper, we study a method for learning and recognizing types of leaves using the convolution neural network (CNN) model, which is a deep learning technology.

The system proposed in this paper is constructed as shown in Figure 1. The method proposes to improve classification performance by using a CNN that extracts and learns feature points.

In Section 2, we examine existing leaf recognition research. In Section 3, we describe GoogleNet, a CNN that imitates human visual systems. Section 4 explains the leaf recognition system, while Section 5 describes the experiment and analyzes the results. Section 6 concludes the paper.

### 2.1 Feature Extraction

In previous studies, the leaf color, contour, texture, and shape were used to classify plants. As shown in Figure 2, the color image was transformed into a grayscale image by applying Eq. (1), the grayscale image was then converted to a binary one through binarization, and the contour then extracted. The features are extracted using the characteristics of the contour line [5]. Using these features, the recognition rate was 90% when classified through machine learning. Because the shape of the leaf outlines are similar to each other, the features alone make it difficult to classify the plant.

Grau=0.299×Ir+0.587×I0+0.114×I.

In addition, brightness or shape transformation features may be used with cumulative histogram operations. Typical methods are Histogram of Oriented Gradients (HOG) [6] and Scale-Invariant Feature Transform (SIFT) [7].

Disadvantages of these feature extraction algorithms are firstly that computation levels are high, and secondly that generalization is difficult due to the dependency on specific data.

### 2.2 Machine Learning

Machine learning is a method that classifies sample data after learning to use feature points. For better performance, generalization of the data is required. Typical machine learning models are AdaBoost [8] and support vector machine (SVM) [9], and the performance of these methods depends on the input feature points. The primary disadvantage of existing machine learning methods is that they cannot extract the optimized feature points, because the learning and classification processes are performed independently.

### 3.1 Visual System in Humans

Neural networks mimics the human visual processing neural structure, as shown in Figure 3 [10]. In the retina, the portions of an object that have the strongest difference in the intensity of reflected light are recognized as the edges of the object, and the result is sent to the lateral geniculate nucleus (LGN). The LGN neuron compresses the entire shape around the corners of the object and sends it to the primary visual cortex (V1). The V1 neuron then recognizes the corners, contour, and direction of motion of the object. It also recognizes the difference between the images reflected in the retina of the left and right eyes as distances, and the result is sent to the secondary visual cortex (V2). The V2 neurons recognize the overall shape of the object and the color difference between each part, and send this to the tertiary visual cortex (V3). The V3 neurons recognize the color of the whole object, and the overall shape and color of the object are recognized at the lateral occipital cortex (LOC). As shown in Figure 3, the CNN is the neural network model that implements functions closest to the human visual structure. The first CNN model was designed by Yann LeCun in 1998. Called LeNet [11], it was derived from an experiment using the optic nerve of a cat brain, and showed that the neurons did not react at the same time when a picture was displayed; instead, only some neurons responded.

In a CNN, the convolution and pooling layers replicate the LGN to V3 paths in the visual system structure, and extract feature points from the image. The fully connected layer acts in the same way as the LOC in a human visual system to recognize the image.

As shown in Figure 4, the CNN structure extracts features by performing the convolution operation on the input image, extracts the maximum or average feature values on the pooling layer, and then classifies them in the fully connected layer.

The CNN model used in this study is GoogleNet. With the advent of this model, researchers have developed deepened network structures that do not increase computational complexity. GoogleNet uses inception modules that use multiple convolutions in parallel, to extract various feature points. As shown in Figure 5, in the inception module a 1×1 convolution functions as a cascade. If using a 1 × 1 and 3 × 3 convolution, or a 1 × 1 and 5 × 5 convolution, it is possible to reduce the number of parameters and deepen the network [12].

To reduce the number of parameters, a team at Oxford University conducted a deep network study and developed the VGGNet model [13]. This model factorizes the convolutional filter, which means that a deep network using a several small layers is constructed.

Factorizing convolution can reduce the parameters by about 30%, by dividing the 5×5 filter into 3×3 and 3×3, as shown in Figure 6; this can also effectively extract feature points.

As shown in Figure 7, the GoogleNet model consists of a deep network with 22 layers of inception modules, with softmax functions used last. The vanishing gradient problem is an issue caused by the deepening of the network, and may lead to slow learning or overfitting. To avoid the overfitting problem, GoogleNet suggested auxiliary classifiers named ‘super vision’ [14, 15], as shown in Figure 7. The vanishing gradient problem is solved by storing the optimal values and adding the results of the auxiliary classifier using the backpropagation algorithm. This can result in stable learning results. At the end of the learning process, the auxiliary classifiers disappear and are not used at the test stage.

GoogleNet uses batch normalization instead of dropout. As shown in Figure 8, batch normalization is a method by which, by adding the batch normalization layer, the results generated before the layer is modified are used as the input to the activation function. In the batch normalization layer, Eqs. (2) and (3) are used at each layer to obtain the mean and variance [16]. Using the obtained mean and variance, the input is normalized as shown in Eq. (4). The denominator of Eq. (4) is the sum of the variance, and the constant and numerator are normalized by dividing the input value minus the mean.

μB=1mii=1mxi,σB2=1mi=1m(xi=μB)2,x^i=xi-μβσB2+ɛ,BN(xi)=γ(xi-μβσB2+ɛ)+β.

The nonlinearity can be obtained by multiplying and adding the scale factor and the shift factor to the normalized value, as shown in Eq. (5). Batch normalization solves the overfitting problem by normalizing the inputs to each layer, which allows the learning speed to be fast and achieves regularization.

### 4.1 Image Cropping

Image cropping reduces the amount of computation used by the GPU to minimize the foreground portion. Figure 9(a) shows the input image used for learning, Figure 9(b) shows the result of cropping using the input image, and Figure 9(c) shows an image obtained by resizing the cropped image to 229 × 229 pixel. The adjusted images were used as experimental images.

### 4.2 Multi-scale Technique

Multi-scale is a learning process that randomly deforms several sizes by using the minimum and maximum sizes, as shown in Figure 10. By using this method, it is possible to prevent the overfitting phenomena arising as a result of less learning data.

### 4.3 Learning Using the CNN Model

For leaf recognition, a basic and modified structure of the GoogleNet model are used. The basic structure is as shown in Table 1, and the structure of the inception module used is shown in Figure 11. The inceptive module shown in Figure 11 adopts the factorizing convolution method in the incidence module, and the convolution that is a conventional linear filter has a nonlinear structure. The modified GoogleNet model is based on the structure type shown in Table 2, and on the two additional modules shown in Figure 10(a).

The model in Table 1 does not initially include the inception module. To begin with, the size of the input image is adjusted to 229 × 229, and a 3 × 3 stride 2, and 3 × 3 stride 1, of the convolution is performed. A 3 × 3 padded convolution operation is also performed to reduce data loss before pooling. After pooling, a 3 × 3 stride 1, 3 × 3 stride 2, and 3 × 3 stride 1 convolution are executed. After passing through the three inception modules shown in Figure 11(a), the five inception modules shown in Figure 11(b), and the two inception modules shown in Figure 11(c), an 8 × 8 pooling operation is processed. The effect of linear reduction using linear activation can be seen. The softmax classifier is used in the last stage. The model in Table 2 was used in the experiment described in Table 1, with the two additional incidence modules in Figure 11(a).

### 5.1 Experimental Environment

This paper uses the leaf sample data from the Flavia dataset [17], and the common leaf types shown in Figure 12. As shown in Table 3, the eight leaf types are lanceolate, light oval, acupuncture, linear, long oval, elongated, heart, and long leaf. The details of each type are shown in Figure 13. The training images were divided into eight types of 3,767 images, and the test images were chosen by randomly selected 100 images as like Figure 12(b).

We constructed the following experimental environment for learning and testing. The operating system used was Linux CentOS 7.0, and the CPU an Intel i7-6770k. The main memory size was 32 GB, and two parallel processing boards were used with an NVIDIA Maxwell TITAN graphics card. The deep learning framework used was TensorFlow r0.10.

### 5.2 Experiment Method

Two CNN models were selected and tested. The chosen two models were GoogleNet and a variant of GoogleNet, and changes in performance were checked when the layers were added. The size of the each image used in the experiment was adjusted from 1600 × 1200 to 229 × 229 to fit the model. We also tested color changing or deforming of leaves by creating leaves that were cut or pitted randomly, as is common in nature. The leaf images used in the test are shown in Figures 14 and 15. Figure 14 shows the discoloration ratio of the input leaf images. Figure 15 shows images of damaged leaves.

The images in the Flavia dataset are displayed vertically, horizontally, and at an angle of 45°, which are all angles not necessarily found in nature. We therefore examined all possible leaf directions by rotating them by 90°. Using the methods described above, 10,000 training sessions were conducted and the performance of the two models was compared.

### 5.3 Experiment Results

The two models described above were tested, and Model 2 demonstrated advantages over Model 1. The effect of increasing the number of inception modules in Model 2 to slightly increase performance, is shown in Table 2. However, as shown in Table 4, the difference between Model 1 and Model 2 is small. Experimental images were obtained by using the discolored images in Figure 14 and the distorted images in Figure 15, using different angles.

The discolored 100 images were prepared and tested as shown in Figure 14. Testing of the discolored images shows that the recognition rate degrades as the discoloration ratio of the leaves is increased; However, the ratio of degradation was not severe. Table 5 shows that Model 2 is slightly better than Model 1.

Table 6 shows that the recognition rate of Model 2 is slightly better than that of Model 1, even where with the leaf image contained 50 holes. According to the above results, the recognition rate of our system was above 94% when using the CNN, even when 30% of the leaf was damaged. Our system therefore improves upon previous studies, which achieved a recognition rate of approximately 90%.

In this paper, we proposed a new method to classify leaves using the CNN model, and created two models by adjusting the network depth using GoogleNet. We evaluated the performance of each model according to the discoloration of, or damage to, leaves. The recognition rate achieved was greater than 94%, even when 30% of the leaf was damaged.

In future research we will attempt to recognize leaves attached to branches, in order to develop a visual system that can replicate the method used by humans to identify plant types.

This work was supported by the Ministry of Education (MOE) and the National Research Foundation of Korea (NRF), through the Human Resource Training Project for Regional Innovation (No. 2015-HICLA1035526).

Fig. 1.

System composition.

Fig. 2.

Example of leaf contour extraction.

(a) Input image, (b) gray scale image,

(c) binary image, and (d) contour extraction.

Fig. 3.

Human visual system structure.

Fig. 4.

Basic structure of a convolution neural network.

Fig. 5.

Inception module structure.

Fig. 6.

Factorizing convolution used in the VGGNet model.

Fig. 7.

GoogleNet structure and auxiliary classifier units.

Fig. 8.

Batch normalization method.

Fig. 9.

Leaf image cropping and resize example. (a) Input image, (b) cropping image, (c) 229×229 image.

Fig. 10.

Multi-scale image.

Fig. 11.

Factorizing convolution applied in the inception module.

Fig. 12.

(a) Flavia image dataset and (b) natural leaves.

Fig. 13.

Leaf shapes: (a) lanceolate, (b) oval, (c) acicular, (d) linear, (e) reniform, (f) kidney-shaped, (g) cordate, heart-shaped, and (h) palmate leaf.

Fig. 14.

Color change: (a) input image, (b) discoloration 5%, (c) discoloration 10%, (d) discoloration 30%, (e) discoloration 50%, and (f) discoloration 60%.

Fig. 15.

Leaf damage: (a) damage 5%, (b) damage 10%, (c) damage 15%, and (d) damage 30%.

Table. 1.

Table 1. GoogleNet basic structure [Model 1].

Type Filter size / stride Input size
Conv3 × 3 / 2222 × 229
Conv3 × 3 / 1 149 × 149 × 32
Conv padded 3 × 3 / 1147 × 147 × 32
Ppool3 × 3 / 2147 × 147 × 64
Conv3 × 3 / 173 × 73 × 64
Conv3 × 3 / 271 × 71 × 80
Conv3 × 3 / 135 × 35 × 192
3×InceptionFigure 10(a)35 × 35 × 288
5×InceptionFigure 10(b)17 × 17 × 768
2×InceptionFigure 10(c)8 × 8 × 1280
Pool8 × 88 × 8 × 2048
LinearLogits1 × 1 × 2048
SoftmaxClassifier1 × 1 × 1000

Conv: convolution..

Table. 2.

Table 2. Modified GoogleNet structure [Model 2].

Type Filter size / stride Input size
Conv3 × 3 / 2222 × 229
Conv3 × 3 / 1 149 × 149 × 32
Conv padded 3 × 3 / 1147 × 147 × 32
Pool3 × 3 / 2147 × 147 × 64
Conv3 × 3 / 173 × 73 × 64
Conv3 × 3 / 271 × 71 × 80
Conv3 × 3 / 135 × 35 × 192
5×InceptionFigure 10(a)35 × 35 × 288
5×InceptionFigure 10(b)17 × 17 × 768
2×InceptionFigure 10(c)8 × 8 × 1280
Pool8 × 88 × 8 × 2048
LinearLogits1 × 1 × 2048
SoftmaxClassifier1 × 1 × 1000

Table. 3.

Table 3. Type and number of leaves.

Leaf type Number of images
Lanceolate568
Oval554
Acicular612
Linear439
Oblong374
Reniform, kidney-shaped 580
Cordate, heart-shaped379
Palmate leaf361

Table. 4.

Table 4. Model performance evaluation.

Model 1  Model 2
Image size229 × 229229 × 229
Training time 8h 43m9h 18m
Accuracy99.6%99.8%

Table. 5.

Table 5. Accuracy rate (%) in relation to discoloration.

Model 1  Model 2
Image in Figure 14(b)99.599.65
Image in Figure 14(c)99.299.3
Image in Figure 14(d)98.899.1
Image in Figure 14(e)98.598.9
Image in Figure 14(f)98.298.6

Table. 6.

Table 6. Accuracy rate (%) in relation to damage.

Model 1  Model 2
Image in Figure 15(a)97.498.4
Image in Figure 15(b)96.898
Image in Figure 15(c)96.297.6
Image in Figure 15(d)94.495

1. Friis, I, and Balslev, H 2003. Plant diversity and complexity patterns: local, regional, and global dimensions., Proceedings of an International Symposium, Held at the Royal Danish Academy of Sciences and Letters, Copenhagen, Denmark, pp.25-28.
2. Nam, Y, and Hwang, E (2005). A representation and matching method for shape-based leaf image retrieval. Journal of KIISE: Software and Applications. 32, 1013-1021.
3. Nam, Y, Park, J, Hwang, E, and Kim, D (2006). Shape-based leaf image retrieval using venation feature. Proceedings of 2006 Korea Computer Congress. 33, 346-348.
4. Krizhevsky, A, Sutskever, I, and Hinton, GE 2012. ImageNet classification with deep convolutional neural networks., Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, pp.1097-1105.
5. Du, JX, Wang, XF, and Zhang, GJ (2007). Leaf shape based plant species recognition. Applied Mathematics and Computation. 185, 883-893.
6. Dalal, N, and Triggs, B 2005. Histograms of oriented gradients for human detection., Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, Array, pp.886-893.
7. Lowe, DG (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 60, 91-110.
8. Freund, Y, and Schapire, RE (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the 13th International Conference. 96, 148-156.
9. Cortes, C, and Vapnik, V (1995). Support-vector networks. Machine Learning. 20, 273-297.
10. Chum, E. Drones and artificial intelligence–AI. Available http://www.focus.kr/view.php?key=2016041200101856440
11. LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P (2002). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86, 2278-2324.
12. Szegedy, C, Liu, W, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, V, and Rabinovich, A 2015. Going deeper with convolutions., Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, Array.
13. Simonyan, K, and Zisserman, A. (2015) . Very deep convolutional networks for large-scale image recognition. Available https://arxiv.org/cs/1409.1556
14. Wang, L, Lee, CY, Tu, Z, and Lazebnik, S. (2015) . Training deeper convolutional networks with deep supervision. Available https://arxiv.org/cs/1505.02496
15. Lee, CY, Xie, S, Gallagher, P, Zhang, Z, and Tu, Z. (2014) . Deeply-supervised nets. Available https://arxiv.org/stat/1409.5185
16. Ioffe, S, and Szegedy, C 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift., Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.448-456.
17. ,. Leaf recognition algorithm for plant classification using probabilistic neural network. Available http://flavia.sourceforge.net/

Wang-Su Jeon received his B.S. degree in Computer Engineering from Kyungnam University, Changwon, Korea, in 2016, and is currently pursuing an M.S. degree in IT Convergence Engineering at Kyungnam University, Changwon, Korea. His present interests include computer vision, pattern recognition and machine learning.

E-mail: jws2218@naver.com

Sang-Yong Rhee received his B.S. and M.S. degrees in Industrial Engineering from Korea University, Seoul, Korea, in 1982 and 1984, respectively, and his Ph.D. degree in Industrial Engineering from Pohang University of Science and Technology, Pohang, Korea. He is currently a professor in the Department of Computer Engineering, Kyungnam University, Changwon, Korea. His research interests include computer vision, augmented reality, neuro-fuzzy, and human-robot interfaces.

E-mail: syrhee@kyungnam.ac.kr

### Article

#### Original Article

Int. J. Fuzzy Log. Intell. Syst. 2017; 17(1): 26-34

Published online March 31, 2017 https://doi.org/10.5391/IJFIS.2017.17.1.26

## Plant Leaf Recognition Using a Convolution Neural Network

Wang-Su Jeon1, and Sang-Yong Rhee2

1Department of IT Convergence Engineering, Kyungnam University, Changwon, Korea, 2Department of Computer Engineering, Kyungnam University, Changwon, Korea

Correspondence to: Sang-Yong Rhee (syrhee@kyungnam.ac.kr)

Received: February 1, 2017; Revised: February 23, 2017; Accepted: March 24, 2017

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

There are hundreds of kinds of trees in the natural ecosystem, and it can be very difficult to distinguish between them. Botanists and those who study plants however, are able to identify the type of tree at a glance by using the characteristics of the leaf. Machine learning is used to automatically classify leaf types. Studied extensively in 2012, this is a rapidly growing field based on deep learning. Deep learning is itself a self-learning technique used on large amounts of data, and recent developments in hardware and big data have made this technique more practical. We propose a method to classify leaves using the CNN model, which is often used when applying deep learning to image processing.

Keywords: Leaf, Classification, Visual system, CNN, GoogleNet

### 1. Introduction

When leaving a town and entering the suburbs, we may encounter many kinds of trees. We may be able to identify those trees that often grow on urban streets, however most of the trees and plants found in city suburbs will be unknown to the majority of us. There are approximately 100,000 species of trees on earth, which account for about 25% of all plants. Many of the trees are in tropical regions, and because only limited botanical research has been carried out in these areas, it is believed that there are many undiscovered species [1]. It is clear that identifying large numbers of such trees is a complex process.

An example of the complexity of tree identification can be seen with plums and apricots. These are very similar in leaf shape, the shape of the tree, and even in the shape of the young fruit. The flower shape is also very similar, and the tree type can only be identified by determining whether the calyx is attached, or inverted relative to the petal. Additionally, some trees are not easily distinguishable except at particular times; for example, when they bloom or bear fruit. To identify trees like these, considerable information is required, including leaf shape, shape of the leaves that are directly attached to branches, branch shape, shape of the whole tree, tree size, flower shape, flowering time, and fruit.

When using branches of biology such as cell biology, molecular biology, phytochemistry, or morphologic anatomy, it may be possible to distinguish plants without time constraints. However, it is unrealistic for the general public to identify the names of trees or plants using these methods when, for example, they are walking in a woodland.

Since the public will usually distinguish trees by their appearance, studies have been carried out to recognize trees using this method alone. Du et al. [1] have converted a color input image into a binarized image to extract the outline, and the two dimensional features were then extracted using the outline image. These features were grouped using the Move Median Centers (MMC) classifier. This study showed faster execution speeds than those of previous studies, and generated accurate results using the combination of the characteristics. However, the recognition rate was only approximately 90%.

Nam and his colleague [2, 3] used a shape-based search method to distinguish plants. They tried to improve the accuracy by using not only the outline, but also the vein data of the leaves. The outline recognition was improved using the minimum perimeter polygons (MPP) algorithm, and the vein data were represented using extended curvature scale space (CSS) to extract the midrib, intersections and endpoints. A weighted graph was composed using the relationship between the feature points and the distance value, and the similarity was calculated using the value. A total of 1,032 plant leaf images were used from plant inscriptions, however, the exact recognition rate was not given because the research related to database searches rather than to plant identification. However, a result graph showed better search results than for the existing study.

Because recognition is a considerable challenge in the field of computer vision, further development using ImageNet Large Scale Visual Recognition Competition (ILSVRC), has been attempted; however, until 2011, the top five error rates of the most effective recognition technique were 25.8%. With the emergence of AlexNet [4] in 2012, the error rate has dropped sharply to 16.4%, and this dropped further to 2.99% at 2016. This is the result of improved performance when compared to traditional machine learning methods, which classify data after extracting features or preprocessing. In this paper, we study a method for learning and recognizing types of leaves using the convolution neural network (CNN) model, which is a deep learning technology.

The system proposed in this paper is constructed as shown in Figure 1. The method proposes to improve classification performance by using a CNN that extracts and learns feature points.

In Section 2, we examine existing leaf recognition research. In Section 3, we describe GoogleNet, a CNN that imitates human visual systems. Section 4 explains the leaf recognition system, while Section 5 describes the experiment and analyzes the results. Section 6 concludes the paper.

### 2.1 Feature Extraction

In previous studies, the leaf color, contour, texture, and shape were used to classify plants. As shown in Figure 2, the color image was transformed into a grayscale image by applying Eq. (1), the grayscale image was then converted to a binary one through binarization, and the contour then extracted. The features are extracted using the characteristics of the contour line [5]. Using these features, the recognition rate was 90% when classified through machine learning. Because the shape of the leaf outlines are similar to each other, the features alone make it difficult to classify the plant.

$Grau=0.299×Ir+0.587×I0+0.114×I.$

In addition, brightness or shape transformation features may be used with cumulative histogram operations. Typical methods are Histogram of Oriented Gradients (HOG) [6] and Scale-Invariant Feature Transform (SIFT) [7].

Disadvantages of these feature extraction algorithms are firstly that computation levels are high, and secondly that generalization is difficult due to the dependency on specific data.

### 2.2 Machine Learning

Machine learning is a method that classifies sample data after learning to use feature points. For better performance, generalization of the data is required. Typical machine learning models are AdaBoost [8] and support vector machine (SVM) [9], and the performance of these methods depends on the input feature points. The primary disadvantage of existing machine learning methods is that they cannot extract the optimized feature points, because the learning and classification processes are performed independently.

### 3.1 Visual System in Humans

Neural networks mimics the human visual processing neural structure, as shown in Figure 3 [10]. In the retina, the portions of an object that have the strongest difference in the intensity of reflected light are recognized as the edges of the object, and the result is sent to the lateral geniculate nucleus (LGN). The LGN neuron compresses the entire shape around the corners of the object and sends it to the primary visual cortex (V1). The V1 neuron then recognizes the corners, contour, and direction of motion of the object. It also recognizes the difference between the images reflected in the retina of the left and right eyes as distances, and the result is sent to the secondary visual cortex (V2). The V2 neurons recognize the overall shape of the object and the color difference between each part, and send this to the tertiary visual cortex (V3). The V3 neurons recognize the color of the whole object, and the overall shape and color of the object are recognized at the lateral occipital cortex (LOC). As shown in Figure 3, the CNN is the neural network model that implements functions closest to the human visual structure. The first CNN model was designed by Yann LeCun in 1998. Called LeNet [11], it was derived from an experiment using the optic nerve of a cat brain, and showed that the neurons did not react at the same time when a picture was displayed; instead, only some neurons responded.

In a CNN, the convolution and pooling layers replicate the LGN to V3 paths in the visual system structure, and extract feature points from the image. The fully connected layer acts in the same way as the LOC in a human visual system to recognize the image.

As shown in Figure 4, the CNN structure extracts features by performing the convolution operation on the input image, extracts the maximum or average feature values on the pooling layer, and then classifies them in the fully connected layer.

The CNN model used in this study is GoogleNet. With the advent of this model, researchers have developed deepened network structures that do not increase computational complexity. GoogleNet uses inception modules that use multiple convolutions in parallel, to extract various feature points. As shown in Figure 5, in the inception module a 1×1 convolution functions as a cascade. If using a 1 × 1 and 3 × 3 convolution, or a 1 × 1 and 5 × 5 convolution, it is possible to reduce the number of parameters and deepen the network [12].

To reduce the number of parameters, a team at Oxford University conducted a deep network study and developed the VGGNet model [13]. This model factorizes the convolutional filter, which means that a deep network using a several small layers is constructed.

Factorizing convolution can reduce the parameters by about 30%, by dividing the 5×5 filter into 3×3 and 3×3, as shown in Figure 6; this can also effectively extract feature points.

As shown in Figure 7, the GoogleNet model consists of a deep network with 22 layers of inception modules, with softmax functions used last. The vanishing gradient problem is an issue caused by the deepening of the network, and may lead to slow learning or overfitting. To avoid the overfitting problem, GoogleNet suggested auxiliary classifiers named ‘super vision’ [14, 15], as shown in Figure 7. The vanishing gradient problem is solved by storing the optimal values and adding the results of the auxiliary classifier using the backpropagation algorithm. This can result in stable learning results. At the end of the learning process, the auxiliary classifiers disappear and are not used at the test stage.

GoogleNet uses batch normalization instead of dropout. As shown in Figure 8, batch normalization is a method by which, by adding the batch normalization layer, the results generated before the layer is modified are used as the input to the activation function. In the batch normalization layer, Eqs. (2) and (3) are used at each layer to obtain the mean and variance [16]. Using the obtained mean and variance, the input is normalized as shown in Eq. (4). The denominator of Eq. (4) is the sum of the variance, and the constant and numerator are normalized by dividing the input value minus the mean.

$μB=1mi∑i=1mxi,$$σB2=1m∑i=1m(xi=μB)2,$$x^i=xi-μβσB2+ɛ,$$BN(xi)=γ (xi-μβσB2+ɛ)+β.$

The nonlinearity can be obtained by multiplying and adding the scale factor and the shift factor to the normalized value, as shown in Eq. (5). Batch normalization solves the overfitting problem by normalizing the inputs to each layer, which allows the learning speed to be fast and achieves regularization.

### 4.1 Image Cropping

Image cropping reduces the amount of computation used by the GPU to minimize the foreground portion. Figure 9(a) shows the input image used for learning, Figure 9(b) shows the result of cropping using the input image, and Figure 9(c) shows an image obtained by resizing the cropped image to 229 × 229 pixel. The adjusted images were used as experimental images.

### 4.2 Multi-scale Technique

Multi-scale is a learning process that randomly deforms several sizes by using the minimum and maximum sizes, as shown in Figure 10. By using this method, it is possible to prevent the overfitting phenomena arising as a result of less learning data.

### 4.3 Learning Using the CNN Model

For leaf recognition, a basic and modified structure of the GoogleNet model are used. The basic structure is as shown in Table 1, and the structure of the inception module used is shown in Figure 11. The inceptive module shown in Figure 11 adopts the factorizing convolution method in the incidence module, and the convolution that is a conventional linear filter has a nonlinear structure. The modified GoogleNet model is based on the structure type shown in Table 2, and on the two additional modules shown in Figure 10(a).

The model in Table 1 does not initially include the inception module. To begin with, the size of the input image is adjusted to 229 × 229, and a 3 × 3 stride 2, and 3 × 3 stride 1, of the convolution is performed. A 3 × 3 padded convolution operation is also performed to reduce data loss before pooling. After pooling, a 3 × 3 stride 1, 3 × 3 stride 2, and 3 × 3 stride 1 convolution are executed. After passing through the three inception modules shown in Figure 11(a), the five inception modules shown in Figure 11(b), and the two inception modules shown in Figure 11(c), an 8 × 8 pooling operation is processed. The effect of linear reduction using linear activation can be seen. The softmax classifier is used in the last stage. The model in Table 2 was used in the experiment described in Table 1, with the two additional incidence modules in Figure 11(a).

### 5.1 Experimental Environment

This paper uses the leaf sample data from the Flavia dataset [17], and the common leaf types shown in Figure 12. As shown in Table 3, the eight leaf types are lanceolate, light oval, acupuncture, linear, long oval, elongated, heart, and long leaf. The details of each type are shown in Figure 13. The training images were divided into eight types of 3,767 images, and the test images were chosen by randomly selected 100 images as like Figure 12(b).

We constructed the following experimental environment for learning and testing. The operating system used was Linux CentOS 7.0, and the CPU an Intel i7-6770k. The main memory size was 32 GB, and two parallel processing boards were used with an NVIDIA Maxwell TITAN graphics card. The deep learning framework used was TensorFlow r0.10.

### 5.2 Experiment Method

Two CNN models were selected and tested. The chosen two models were GoogleNet and a variant of GoogleNet, and changes in performance were checked when the layers were added. The size of the each image used in the experiment was adjusted from 1600 × 1200 to 229 × 229 to fit the model. We also tested color changing or deforming of leaves by creating leaves that were cut or pitted randomly, as is common in nature. The leaf images used in the test are shown in Figures 14 and 15. Figure 14 shows the discoloration ratio of the input leaf images. Figure 15 shows images of damaged leaves.

The images in the Flavia dataset are displayed vertically, horizontally, and at an angle of 45°, which are all angles not necessarily found in nature. We therefore examined all possible leaf directions by rotating them by 90°. Using the methods described above, 10,000 training sessions were conducted and the performance of the two models was compared.

### 5.3 Experiment Results

The two models described above were tested, and Model 2 demonstrated advantages over Model 1. The effect of increasing the number of inception modules in Model 2 to slightly increase performance, is shown in Table 2. However, as shown in Table 4, the difference between Model 1 and Model 2 is small. Experimental images were obtained by using the discolored images in Figure 14 and the distorted images in Figure 15, using different angles.

The discolored 100 images were prepared and tested as shown in Figure 14. Testing of the discolored images shows that the recognition rate degrades as the discoloration ratio of the leaves is increased; However, the ratio of degradation was not severe. Table 5 shows that Model 2 is slightly better than Model 1.

Table 6 shows that the recognition rate of Model 2 is slightly better than that of Model 1, even where with the leaf image contained 50 holes. According to the above results, the recognition rate of our system was above 94% when using the CNN, even when 30% of the leaf was damaged. Our system therefore improves upon previous studies, which achieved a recognition rate of approximately 90%.

### 6. Conclusion

In this paper, we proposed a new method to classify leaves using the CNN model, and created two models by adjusting the network depth using GoogleNet. We evaluated the performance of each model according to the discoloration of, or damage to, leaves. The recognition rate achieved was greater than 94%, even when 30% of the leaf was damaged.

In future research we will attempt to recognize leaves attached to branches, in order to develop a visual system that can replicate the method used by humans to identify plant types.

### Acknowledgements

This work was supported by the Ministry of Education (MOE) and the National Research Foundation of Korea (NRF), through the Human Resource Training Project for Regional Innovation (No. 2015-HICLA1035526).

### Fig 1.

Figure 1.

System composition.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 2.

Figure 2.

Example of leaf contour extraction.

(a) Input image, (b) gray scale image,

(c) binary image, and (d) contour extraction.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 3.

Figure 3.

Human visual system structure.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 4.

Figure 4.

Basic structure of a convolution neural network.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 5.

Figure 5.

Inception module structure.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 6.

Figure 6.

Factorizing convolution used in the VGGNet model.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 7.

Figure 7.

GoogleNet structure and auxiliary classifier units.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 8.

Figure 8.

Batch normalization method.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 9.

Figure 9.

Leaf image cropping and resize example. (a) Input image, (b) cropping image, (c) 229×229 image.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 10.

Figure 10.

Multi-scale image.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 11.

Figure 11.

Factorizing convolution applied in the inception module.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 12.

Figure 12.

(a) Flavia image dataset and (b) natural leaves.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 13.

Figure 13.

Leaf shapes: (a) lanceolate, (b) oval, (c) acicular, (d) linear, (e) reniform, (f) kidney-shaped, (g) cordate, heart-shaped, and (h) palmate leaf.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 14.

Figure 14.

Color change: (a) input image, (b) discoloration 5%, (c) discoloration 10%, (d) discoloration 30%, (e) discoloration 50%, and (f) discoloration 60%.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

### Fig 15.

Figure 15.

Leaf damage: (a) damage 5%, (b) damage 10%, (c) damage 15%, and (d) damage 30%.

The International Journal of Fuzzy Logic and Intelligent Systems 2017; 17: 26-34https://doi.org/10.5391/IJFIS.2017.17.1.26

Type Filter size / stride Input size
Conv3 × 3 / 2222 × 229
Conv3 × 3 / 1 149 × 149 × 32
Conv padded 3 × 3 / 1147 × 147 × 32
Ppool3 × 3 / 2147 × 147 × 64
Conv3 × 3 / 173 × 73 × 64
Conv3 × 3 / 271 × 71 × 80
Conv3 × 3 / 135 × 35 × 192
3×InceptionFigure 10(a)35 × 35 × 288
5×InceptionFigure 10(b)17 × 17 × 768
2×InceptionFigure 10(c)8 × 8 × 1280
Pool8 × 88 × 8 × 2048
LinearLogits1 × 1 × 2048
SoftmaxClassifier1 × 1 × 1000

Conv: convolution..

Type Filter size / stride Input size
Conv3 × 3 / 2222 × 229
Conv3 × 3 / 1 149 × 149 × 32
Conv padded 3 × 3 / 1147 × 147 × 32
Pool3 × 3 / 2147 × 147 × 64
Conv3 × 3 / 173 × 73 × 64
Conv3 × 3 / 271 × 71 × 80
Conv3 × 3 / 135 × 35 × 192
5×InceptionFigure 10(a)35 × 35 × 288
5×InceptionFigure 10(b)17 × 17 × 768
2×InceptionFigure 10(c)8 × 8 × 1280
Pool8 × 88 × 8 × 2048
LinearLogits1 × 1 × 2048
SoftmaxClassifier1 × 1 × 1000

Type and number of leaves.

Leaf type Number of images
Lanceolate568
Oval554
Acicular612
Linear439
Oblong374
Reniform, kidney-shaped 580
Cordate, heart-shaped379
Palmate leaf361

Model performance evaluation.

Model 1  Model 2
Image size229 × 229229 × 229
Training time 8h 43m9h 18m
Accuracy99.6%99.8%

Accuracy rate (%) in relation to discoloration.

Model 1  Model 2
Image in Figure 14(b)99.599.65
Image in Figure 14(c)99.299.3
Image in Figure 14(d)98.899.1
Image in Figure 14(e)98.598.9
Image in Figure 14(f)98.298.6

Accuracy rate (%) in relation to damage.

Model 1  Model 2
Image in Figure 15(a)97.498.4
Image in Figure 15(b)96.898
Image in Figure 15(c)96.297.6
Image in Figure 15(d)94.495

### References

1. Friis, I, and Balslev, H 2003. Plant diversity and complexity patterns: local, regional, and global dimensions., Proceedings of an International Symposium, Held at the Royal Danish Academy of Sciences and Letters, Copenhagen, Denmark, pp.25-28.
2. Nam, Y, and Hwang, E (2005). A representation and matching method for shape-based leaf image retrieval. Journal of KIISE: Software and Applications. 32, 1013-1021.
3. Nam, Y, Park, J, Hwang, E, and Kim, D (2006). Shape-based leaf image retrieval using venation feature. Proceedings of 2006 Korea Computer Congress. 33, 346-348.
4. Krizhevsky, A, Sutskever, I, and Hinton, GE 2012. ImageNet classification with deep convolutional neural networks., Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, pp.1097-1105.
5. Du, JX, Wang, XF, and Zhang, GJ (2007). Leaf shape based plant species recognition. Applied Mathematics and Computation. 185, 883-893.
6. Dalal, N, and Triggs, B 2005. Histograms of oriented gradients for human detection., Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, Array, pp.886-893.
7. Lowe, DG (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 60, 91-110.
8. Freund, Y, and Schapire, RE (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the 13th International Conference. 96, 148-156.
9. Cortes, C, and Vapnik, V (1995). Support-vector networks. Machine Learning. 20, 273-297.
10. Chum, E. Drones and artificial intelligence–AI. Available http://www.focus.kr/view.php?key=2016041200101856440
11. LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P (2002). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86, 2278-2324.
12. Szegedy, C, Liu, W, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, V, and Rabinovich, A 2015. Going deeper with convolutions., Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, Array.
13. Simonyan, K, and Zisserman, A. (2015) . Very deep convolutional networks for large-scale image recognition. Available https://arxiv.org/cs/1409.1556
14. Wang, L, Lee, CY, Tu, Z, and Lazebnik, S. (2015) . Training deeper convolutional networks with deep supervision. Available https://arxiv.org/cs/1505.02496
15. Lee, CY, Xie, S, Gallagher, P, Zhang, Z, and Tu, Z. (2014) . Deeply-supervised nets. Available https://arxiv.org/stat/1409.5185
16. Ioffe, S, and Szegedy, C 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift., Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp.448-456.
17. ,. Leaf recognition algorithm for plant classification using probabilistic neural network. Available http://flavia.sourceforge.net/