International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 223-232
Published online September 25, 2022
https://doi.org/10.5391/IJFIS.2022.22.3.223
© The Korean Institute of Intelligent Systems
Dheo Prasetyo Nugroho, Sigit Widiyanto, and Dini Tri Wardani
Department of Information System Management, Gunadarma University, Depok, Indonesia
Correspondence to :
Dheo Prasetyo Nugroho (dheoprasetyo.dp@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Examination of the technological development in agriculture reveals that not many applications use cameras to detect tomato ripeness; therefore, tomato maturity is still determined manually. Currently, technological advances and developments are occurring rapidly, and are, therefore, also inseparable from the agricultural sector. Object detection can help determining tomato ripeness. In this research, faster region-based convolutional neural network (Faster R-CNN), single shot multibox detector (SSD), and you only look once (YOLO) models were tested to recognize or detect tomato ripeness using input images. The model training process required 5 hours and produced a total loss value <0.5, and as the total loss became smaller, the predicted results improved. Tests were conducted on a training dataset, and average accuracy values of 99.55%, 89.3%, and 94.6% were achieved using the Faster R-CNN, SSD, and YOLO models, respectively.
Keywords: SSD, Faster R-CNN, YOLO, Object detection
Currently, technological advances and developments are occurring rapidly, making work easier and providing many benefits in various fields. Technological progress is also inseparable from the agricultural sector, and is having a positive impact on it. A technological development that can be applied in agriculture is in the detection of tomato maturity. This will help and make it easier to detect and select tomatoes to be harvested subsequently. In a greenhouse Citayam, tomato ripeness is still measured manually. The weakness of this method is the varying level of accuracy. Digital image utilization is crucial to determining tomato maturity. Using digital images, tomato ripeness can be obtained based on color. Without such a system, checking is still conducted manually and time-consuming, and something may be missed and the effectiveness is low, causing tomatoes to subsequently rot.
To obtain best results, a detection process to determine tomato maturity is to classify ripe and unripe tomatoes and subsequently conduct training, yielding an overview of features that can be used as markers or indicators of tomato ripeness. Such a detection process can be developed by comparing tomatoes with images that have been classified.
Image classification is closely related to object detection. The former is the categorization of an image into certain categories. Object detection is a technology to detect an object in an image or a video. It can also be used to identify two or more objects that appear similar to each other. The accuracy of detecting similar objects can be very high if the detection process is accompanied with adequate training data. There are several methods to detect objects in an image, one of which is faster region-based convolutional neural network (Faster R-CNN) [1]. Faster R-CNN is used to detect and recognize objects in an image.
Considering the technological developments in agriculture, not many studies have used a camera to detect tomato maturity, particularly in a greenhouse Citayam, where tomato maturity is currently determined manually. To detect tomato ripeness effectively and efficiently for a greenhouse Citayam, based on the above background, this study intended to design a program and analyze the results of the comparison or the levels of accuracy obtained using Faster R-CNN, SSD, and YOLO model.
A Faster R-CNN is used to detect objects in an image. It uses the fast R-CNN and region proposal network (RPN) methods as its main architecture. A Faster R-CNN is the same as Faster R-CNN except that an RPN remodel places the selective search section of Faster R-CNN. The RPN is placed after the CNN layer [2]. The RPN is fed to the region of interest (ROI) pooling layer followed by a classifier and a bounding box regressor. The Faster R-CNN architecture is shown in Figure 1.
An SSD recognizes or detects an object in an image using a single deep neural network, and it is faster and significantly more accurate than the previous state-of-the-art for single-shot detectors (YOLO). An SSD is as accurate as slower techniques that perform detailed region proposals and pooling (including Faster R-CNN). It only needs an input image and a ground truth box for each object during training.
Figure 2 shows that an SSD has two network architectures: the VGG-16 network and a network in the extra feature layers. The VGG-16 network is used as the base network because VGG-16 performs strongly for high-quality images. SSDs add layers of convolutional features to the end of the base network that predicts different aspect ratios [3].
YOLO uses a new approach for object detection. Object detection is performed by framing detected objects as regression problems into spatially separated bounding boxes and associated class probabilities. It uses a single neural network to predict bounding boxes and class probabilities directly from an entire image in one evaluation. Because the entire detection pipeline is a single network, it can be optimized end-to-end directly based on the detection performance. The YOLO architecture consists of 24 convolutional layers with 2 fully connected layers (Figure 3). An alternating 1 × 1 convolutional layer reduces the feature space from the preceding layers [4].
Object detection is used to determine the existence of objects in an image or a video (object localization) and categorize each object (object classification). Object detection can be further classified into soft and hard detection. Soft detection only detects an object, whereas hard detection detects an object and its location [5]. It identifies the location of an object in an image and draws a bounding box around it. It typically involves two processes: object classification and drawing boxes around objects or bounding boxes.
Image classification is the classification of an image into a specific category. Image localization is the development of a task, with the resulting output being the location of the class object in the image, generally shown as a bounding box. Object detection comprises classification and localization to detect several objects in an image (Figure 4).
Several previous studies have used similar methods in agriculture to detect fruits. One of the five earlier studies was on the real-time detection of fruit ripeness using the YOLOv4 algorithm [6]. It succeeded in detecting banana ripeness with an accuracy of 87.6% during testing.
The second study used the deep learning architecture of Faster R-CNN for quality classification of nutmeg (
The third study used an SSD for robust cherry tomato detection in a greenhouse. The SSD was selected because of its excellent anti-interference ability and self-learning from datasets. By analyzing the experimental results, it was found that the Inception-v2 network was a better feature extractor for the SSD network for cherry tomato detection than the VGG16 and MobileNet networks [5].
In a further study [8], the Faster R-CNN method and an intuitionistic fuzzy set were used in the automatic detection of single ripe tomatoes on a plant. With accurate classification, the trained Faster R-CNN rapidly localized candidate ripe tomato regions. In a recent study, detecting tomatoes in greenhouses using SSD and YOLO models showed that the MobileNet-v2 SSD model outperforms YOLOv4. In addition, it has the lowest false positive ratio and performs computations rapidly. Although the inferring time is not sufficiently short, the YOLOv4 Tiny model also yielded excellent results and processed images in approximately 5 ms [9].
In general, the research method used in this study was divided into four stages (Figure 5). The first stage is pre-processing, which involves data collection followed by data annotation. The second stage is training, in which a model is constructed, the config file is configured, and the class of tomato ripeness is defined. The training process uses the TensorFlow object detection API framework and the Faster R-CNN, SSD, and YOLO methods. Subsequently, the introduction of tomato ripeness using training data trained on image training data itself is tested. In the last stage, the testing results are analyzed.
Tomato ripeness detection requires a tomato dataset. In this study, the tomato fruit dataset was collected from a smartphone camera and Kaggle. The collected tomato fruit ripeness image data comprised 400 images (Table 1). The data were divided into two data groups: training and test data.
An example of the image dataset is shown in Figure 6.
The next stage was pre-processing and selection of images of tomatoes. The collected pictures of tomatoes were placed into separate folders: ripe, half-ripe, half-unripe, and unripe tomatoes. The percentages of the number of images used as training and test data were determined. The data groups were divided in a 70:30 proportion into training and test data, respectively. The training and test data used were 280 and 120 images of tomatoes, respectively. The image data used as the training and test data were annotated using the LabelImg application. Image annotation is labelling of images of a dataset to train a model to learn and provide results based on the quality of the given data. Here, labels were annotated with ripe, half-ripe, half-unripe, and unripe tomatoes. The annotation process is presented in Figure 7.
Image annotation was conducted using the LabelImg application by creating a box around an object contained in an image, which was labelled subsequently (Figure 8). Following this, the object that was boxed was defined and labelled according to its classification class (ripe, half-ripe, half-unripe, and unripe tomatoes).
The saved annotation file had an .xml extension in the well-known VOC Pascal format and contained the class name and the
Figure 10 displays the flow of the training process for the Faster R-CNN method using a dataset input by the user. More details can be observed in Figure 11.
The flowchart shows that the training process begins by inputting an image dataset that is labelled or annotated with four class classifications: ripe, half-ripe, half-unripe, and unripe tomatoes. Subsequently, the training configuration process is conducted by entering the batch size, steps, and path configuration. The batch size is 1, the number of steps to be taken in the training process is 200,000, and for the adjustment path, all files on the computer or laptop are adjusted. Furthermore, initialization of the pipeline training Faster R-CNN model is performed in the training process. For transfer learning and fine tuning, the model used is Faster R-CNN Inception-v2. After completion of the training process on a dataset using a compiled model and configuration, the results are saved in a .ckpt file. They are automatically saved every 10 minute in an inspection step while the training process is being implemented. Finally, the .ckpt file is converted into a graphics file in a ProtoBuf format, to be implemented for the detection and classification processes of the system.
The pictures in the dataset are annotated or labeled before entering the convolution process in the CNN model. This process involves rectified linear unit and max polling activation functions with 3 × 3 and 2 × 2 grids in the first and second layers, respectively. The CNN results are input in the RPN process. The RPN works by sliding the window using nine anchors to search for regions that contain objects. It separates the background from an object, to produce a set of areas that are predicted to be the object. The ROI pooling process reduces the size of the area on the feature maps from the RPN and CNN processes. The results of the ROI pooling process are subsequently classified, and a bounding box is created around the location of the tomato object along with the definition of the object.
The flowchart of the SSD training process on a dataset input by the user is the same as that of the Faster R-CNN training process shown in Figure 10, except the configuration is different.
As shown in Figure 12, previously annotated images are convoluted on the MobileNet network. MobileNet also functions as a feature extractor for the SSD. From the input photos, several feature maps with different scales are created. Furthermore, the convolutional filter predicts bounding boxes and classifications. The non-maximum suppression method eliminates the bounding boxes with the smallest Jaccard value, leaving only one bounding box from the existing objects.
Figure 13 displays the flow of the training process of YOLO on a dataset input by a user. More details can be observed in Figure 14.
The flowchart shows the training process on a dataset begins by inputting an image dataset already labeled or annotated with four class classifications: ripe, half-ripe, half-unripe, and unripe tomatoes. Subsequently, the training configuration process is conducted by entering the batch size, steps, and path configuration. The batch size is 64, subdivision is 16, width is 416, height is 416, and maximum batches is 8, 000. The YOLOv4 model is used for transfer learning and fine tuning. After the training process on the dataset is completed using a compiled model and configuration, the results are saved in a .weight file. This file is automatically saved every 100 iterations while the training process is being implemented. Finally, the .weight file is implemented for the detection and classification processes of the system.
As shown in Figure 14, the previously annotated images are converted into a 416 × 416 grid, following which the YOLO model divides them into an S × S grid. Each grid predict bounding boxes and confidence scores. The YOLO model uses a single neural network to predict bounding boxes and probability classes in one evaluation directly. The detection results produce many bounding boxes, and to obtain the correct bounding box, non-maximum suppression is conducted to remove the bounding boxes with low confidence values.
Figure 15 shows the flow process of image prediction. More details can be observed in Figure 16.
Based on Figure 15, the image and video prediction process begins with inputting the images used to recognize or detect tomato ripeness. The system reads the input images for further processing to identify the objects in the images. The Faster R-CNN and SSD models use the results of the training process that are converted into a ProtoBuf file and use labelmap to obtain the label definitions for object prediction. The YOLO requires the results of the training process in the form of trained weights, YOLO config file, and obj data. Furthermore, the system reads the frame on the input images and displays the prediction results of the detected object.
This section discusses the configuration conducted before the training process. The model resizes a training image to be small with a maximum size of 1024 pixel, and many objects are classified into as many as four classes. Training config is useful for setting and training model parameters. Path fine tuning is adjusted to the faster rcnn inception v2 coco directory. The number of training processes is denoted as num step and set as 200, 000.
The model resizes a training image to be small, with a maximum size of 300 pixels, and many objects are classified into as many as four classes. Training config is useful for setting and training the parameters. Path fine tuning is adjusted to the ssd mobilenet v2 coco coco directory. The number of training processes is marked with a num step, which is set as 200, 000.
Several lines were adjusted according to the conditions of the data to be trained. The batch was changed to 64, subdivisions to 16, width and height changed to 416, and max batch value was 8, 000. This number was chosen because the number of classes is four; therefore, the maximum batch calculation was class ×2, 000, which is < the number of training images or < 6, 000. This suggests that the literacy trains the data as much as 8, 000 times. The steps included 80% and 90% max batches; therefore, the values were 6, 400 and 7, 200. Subsequently, the convolution layer was set, and the classes were used. Four classes were used to mean the dataset to be trained on four classifications, which also affects the value of the filter. The filter value was taken from the filters= (classes +5) × 3; therefore, the filter value was 27.
Data training was conducted at Google Colaboratory, which is the cloud service of Google. Google Colaboratory is accessible with a 12-hour time limit. In conducting Faster R-CNN and SSD training, checkpoint files were saved automatically every 10 minutes, whereas with YOLO, they were saved automatically after 100 literations. Using a GPU can increase the speed of training compared to that using the CPU of a computer. The training process is shown in Figure 17.
The training data produced loss values relative to the training results of each step, and yielded the average precision (AP) and mean AP (mAP) values. A loss shows inaccuracy of the model predictions. The optimal value of the loss must be at least below 0.5. A graph of loss was obtained by accessing the tensorboard data via the command prompt, as shown in Figure 17, and the results of the mAP are summarized in Table 2.
The training process required 5 hours. On the Faster R-CNN, SSD, and YOLO methods, it involved 51, 036, 14, 624, and 1, 500 iterations, respectively. The model results trained on the Faster R-CNN and SSD models were subsequently converted into graphics files and stored in the inference graph directory. The conversion process used the last checkpoint file in the training process.
At this stage, the success or accuracy of the classification results obtained from the program were tested. Accuracy testing was conducted using images from the training data taken from a smartphone camera. Objects detected by the program were marked with bounding boxes including visible objects. An example detection result is shown in
The accuracy of using five images each for the four annotations in the training data obtained from the dataset for each object was tested using the Faster R-CNN, SSD, and YOLO methods. The results of the training are summarized in Table 3.
It can be seen that the identification accuracies of half ripe, half-unripe, ripe, and unripe tomatoes are high. The average accuracy values of the Faster R-CNN, SSD, and YOLO models are 99.55%, 89.3%, and 94.6%, respectively.
In addition to testing using images, testing was conducted real-time using a webcam, to determine how its speed and application in daily life. The tests were conducted using a webcam with two different resolutions, and they refer to the difference in frames per second (fps) caused by each resolution, which can affect each detection process. An example test result obtained using a camera is shown in Figure 20.
The results of this test are fps values of 1280 × 720 and 720 × 480 resolutions for each model. The overall test results are shown in Table 4.
It can be concluded that the Faster R-CNN method requires a device with a higher level than the SSD and YOLO methods. Moreover, the use of a GPU is recommended to obtain improved results, and for real-time use, using YOLO and SSD methods are recommended; however, the results are not as accurate as those using the Faster R-CNN method.
This research identified objects in mages using deep learning methods. Specifically, the Faster R-CNN, SSD, and YOLO models were implemented into Jupyter Notebook. The program was written using Python and TensorFlow libraries.
The results obtained from training using a dataset with 400 images and a training model required 5 hours to produce a graph with a total loss < 2. Moreover, it was shown that the total loss value reduced, improving the prediction results. The results of the training process on several images from the training data yielded relatively good average accuracies of 99.55%, 89.3%, and 94.6% on the Faster R-CNN, SSD, and YOLO models, respectively.
For further research, using a high GPU is recommended to improve the process, effectiveness, and efficiency of the training and identification stages. The models were implemented on real-time cases (for example, in a greenhouse), requiring embedding the deep learning algorithms in the IoT.
No potential conflict of interest relevant to this article was reported.
Table 1. Image data collection.
Image | Number of images |
---|---|
Ripe tomatoes | 100 |
Half-ripe tomatoes | 100 |
Half-unripe tomatoes | 100 |
Unripe tomatoes | 100 |
Table 3. Results of training.
Image | Faster R-CNN | SSD | YOLO |
---|---|---|---|
Half-ripe1 | 100 % | 78% | 100% |
Half-ripe2 | 100% | 97% | 100% |
Half-ripe3 | 100% | 100% | 99% |
Half-ripe4 | 99% | 65% | 100% |
Half-ripe5 | 100% | 99% | 99% |
Half-unripe1 | 98% | 62% | 63% |
Half-unripe2 | 100% | 97% | 93% |
Half-unripe3 | 100% | 80% | 53% |
Half-unripe4 | 98% | 98% | 99% |
Half-unripe5 | 100% | 98% | 95% |
Ripe1 | 100% | 100% | 100% |
Ripe2 | 100% | 97% | 99% |
Ripe3 | 100% | 98% | 97% |
Ripe4 | 100% | 98% | 100% |
Ripe5 | 99% | 51% | 98% |
Unripe1 | 99% | 83% | 100% |
Unripe2 | 100% | 100% | 98% |
Unripe3 | 99% | 100% | 100% |
Unripe4 | 99% | 87% | 100% |
Unripe5 | 100% | 98% | 99% |
Table 4. Detection speed.
Resolution | Faster R-CNN | SSD | YOLO |
---|---|---|---|
1024 × 720 | 0.58 fps | 4.53 fps | 2.23 fps |
720 × 480 | 0.63 fps | 5.24 fps | 2.67 fps |
E-mail: dheoprasetyo.dp@gmail.com
E-mail: dinitri@staff.gunadarma.ac.id
International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 223-232
Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.223
Copyright © The Korean Institute of Intelligent Systems.
Dheo Prasetyo Nugroho, Sigit Widiyanto, and Dini Tri Wardani
Department of Information System Management, Gunadarma University, Depok, Indonesia
Correspondence to:Dheo Prasetyo Nugroho (dheoprasetyo.dp@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Examination of the technological development in agriculture reveals that not many applications use cameras to detect tomato ripeness; therefore, tomato maturity is still determined manually. Currently, technological advances and developments are occurring rapidly, and are, therefore, also inseparable from the agricultural sector. Object detection can help determining tomato ripeness. In this research, faster region-based convolutional neural network (Faster R-CNN), single shot multibox detector (SSD), and you only look once (YOLO) models were tested to recognize or detect tomato ripeness using input images. The model training process required 5 hours and produced a total loss value <0.5, and as the total loss became smaller, the predicted results improved. Tests were conducted on a training dataset, and average accuracy values of 99.55%, 89.3%, and 94.6% were achieved using the Faster R-CNN, SSD, and YOLO models, respectively.
Keywords: SSD, Faster R-CNN, YOLO, Object detection
Currently, technological advances and developments are occurring rapidly, making work easier and providing many benefits in various fields. Technological progress is also inseparable from the agricultural sector, and is having a positive impact on it. A technological development that can be applied in agriculture is in the detection of tomato maturity. This will help and make it easier to detect and select tomatoes to be harvested subsequently. In a greenhouse Citayam, tomato ripeness is still measured manually. The weakness of this method is the varying level of accuracy. Digital image utilization is crucial to determining tomato maturity. Using digital images, tomato ripeness can be obtained based on color. Without such a system, checking is still conducted manually and time-consuming, and something may be missed and the effectiveness is low, causing tomatoes to subsequently rot.
To obtain best results, a detection process to determine tomato maturity is to classify ripe and unripe tomatoes and subsequently conduct training, yielding an overview of features that can be used as markers or indicators of tomato ripeness. Such a detection process can be developed by comparing tomatoes with images that have been classified.
Image classification is closely related to object detection. The former is the categorization of an image into certain categories. Object detection is a technology to detect an object in an image or a video. It can also be used to identify two or more objects that appear similar to each other. The accuracy of detecting similar objects can be very high if the detection process is accompanied with adequate training data. There are several methods to detect objects in an image, one of which is faster region-based convolutional neural network (Faster R-CNN) [1]. Faster R-CNN is used to detect and recognize objects in an image.
Considering the technological developments in agriculture, not many studies have used a camera to detect tomato maturity, particularly in a greenhouse Citayam, where tomato maturity is currently determined manually. To detect tomato ripeness effectively and efficiently for a greenhouse Citayam, based on the above background, this study intended to design a program and analyze the results of the comparison or the levels of accuracy obtained using Faster R-CNN, SSD, and YOLO model.
A Faster R-CNN is used to detect objects in an image. It uses the fast R-CNN and region proposal network (RPN) methods as its main architecture. A Faster R-CNN is the same as Faster R-CNN except that an RPN remodel places the selective search section of Faster R-CNN. The RPN is placed after the CNN layer [2]. The RPN is fed to the region of interest (ROI) pooling layer followed by a classifier and a bounding box regressor. The Faster R-CNN architecture is shown in Figure 1.
An SSD recognizes or detects an object in an image using a single deep neural network, and it is faster and significantly more accurate than the previous state-of-the-art for single-shot detectors (YOLO). An SSD is as accurate as slower techniques that perform detailed region proposals and pooling (including Faster R-CNN). It only needs an input image and a ground truth box for each object during training.
Figure 2 shows that an SSD has two network architectures: the VGG-16 network and a network in the extra feature layers. The VGG-16 network is used as the base network because VGG-16 performs strongly for high-quality images. SSDs add layers of convolutional features to the end of the base network that predicts different aspect ratios [3].
YOLO uses a new approach for object detection. Object detection is performed by framing detected objects as regression problems into spatially separated bounding boxes and associated class probabilities. It uses a single neural network to predict bounding boxes and class probabilities directly from an entire image in one evaluation. Because the entire detection pipeline is a single network, it can be optimized end-to-end directly based on the detection performance. The YOLO architecture consists of 24 convolutional layers with 2 fully connected layers (Figure 3). An alternating 1 × 1 convolutional layer reduces the feature space from the preceding layers [4].
Object detection is used to determine the existence of objects in an image or a video (object localization) and categorize each object (object classification). Object detection can be further classified into soft and hard detection. Soft detection only detects an object, whereas hard detection detects an object and its location [5]. It identifies the location of an object in an image and draws a bounding box around it. It typically involves two processes: object classification and drawing boxes around objects or bounding boxes.
Image classification is the classification of an image into a specific category. Image localization is the development of a task, with the resulting output being the location of the class object in the image, generally shown as a bounding box. Object detection comprises classification and localization to detect several objects in an image (Figure 4).
Several previous studies have used similar methods in agriculture to detect fruits. One of the five earlier studies was on the real-time detection of fruit ripeness using the YOLOv4 algorithm [6]. It succeeded in detecting banana ripeness with an accuracy of 87.6% during testing.
The second study used the deep learning architecture of Faster R-CNN for quality classification of nutmeg (
The third study used an SSD for robust cherry tomato detection in a greenhouse. The SSD was selected because of its excellent anti-interference ability and self-learning from datasets. By analyzing the experimental results, it was found that the Inception-v2 network was a better feature extractor for the SSD network for cherry tomato detection than the VGG16 and MobileNet networks [5].
In a further study [8], the Faster R-CNN method and an intuitionistic fuzzy set were used in the automatic detection of single ripe tomatoes on a plant. With accurate classification, the trained Faster R-CNN rapidly localized candidate ripe tomato regions. In a recent study, detecting tomatoes in greenhouses using SSD and YOLO models showed that the MobileNet-v2 SSD model outperforms YOLOv4. In addition, it has the lowest false positive ratio and performs computations rapidly. Although the inferring time is not sufficiently short, the YOLOv4 Tiny model also yielded excellent results and processed images in approximately 5 ms [9].
In general, the research method used in this study was divided into four stages (Figure 5). The first stage is pre-processing, which involves data collection followed by data annotation. The second stage is training, in which a model is constructed, the config file is configured, and the class of tomato ripeness is defined. The training process uses the TensorFlow object detection API framework and the Faster R-CNN, SSD, and YOLO methods. Subsequently, the introduction of tomato ripeness using training data trained on image training data itself is tested. In the last stage, the testing results are analyzed.
Tomato ripeness detection requires a tomato dataset. In this study, the tomato fruit dataset was collected from a smartphone camera and Kaggle. The collected tomato fruit ripeness image data comprised 400 images (Table 1). The data were divided into two data groups: training and test data.
An example of the image dataset is shown in Figure 6.
The next stage was pre-processing and selection of images of tomatoes. The collected pictures of tomatoes were placed into separate folders: ripe, half-ripe, half-unripe, and unripe tomatoes. The percentages of the number of images used as training and test data were determined. The data groups were divided in a 70:30 proportion into training and test data, respectively. The training and test data used were 280 and 120 images of tomatoes, respectively. The image data used as the training and test data were annotated using the LabelImg application. Image annotation is labelling of images of a dataset to train a model to learn and provide results based on the quality of the given data. Here, labels were annotated with ripe, half-ripe, half-unripe, and unripe tomatoes. The annotation process is presented in Figure 7.
Image annotation was conducted using the LabelImg application by creating a box around an object contained in an image, which was labelled subsequently (Figure 8). Following this, the object that was boxed was defined and labelled according to its classification class (ripe, half-ripe, half-unripe, and unripe tomatoes).
The saved annotation file had an .xml extension in the well-known VOC Pascal format and contained the class name and the
Figure 10 displays the flow of the training process for the Faster R-CNN method using a dataset input by the user. More details can be observed in Figure 11.
The flowchart shows that the training process begins by inputting an image dataset that is labelled or annotated with four class classifications: ripe, half-ripe, half-unripe, and unripe tomatoes. Subsequently, the training configuration process is conducted by entering the batch size, steps, and path configuration. The batch size is 1, the number of steps to be taken in the training process is 200,000, and for the adjustment path, all files on the computer or laptop are adjusted. Furthermore, initialization of the pipeline training Faster R-CNN model is performed in the training process. For transfer learning and fine tuning, the model used is Faster R-CNN Inception-v2. After completion of the training process on a dataset using a compiled model and configuration, the results are saved in a .ckpt file. They are automatically saved every 10 minute in an inspection step while the training process is being implemented. Finally, the .ckpt file is converted into a graphics file in a ProtoBuf format, to be implemented for the detection and classification processes of the system.
The pictures in the dataset are annotated or labeled before entering the convolution process in the CNN model. This process involves rectified linear unit and max polling activation functions with 3 × 3 and 2 × 2 grids in the first and second layers, respectively. The CNN results are input in the RPN process. The RPN works by sliding the window using nine anchors to search for regions that contain objects. It separates the background from an object, to produce a set of areas that are predicted to be the object. The ROI pooling process reduces the size of the area on the feature maps from the RPN and CNN processes. The results of the ROI pooling process are subsequently classified, and a bounding box is created around the location of the tomato object along with the definition of the object.
The flowchart of the SSD training process on a dataset input by the user is the same as that of the Faster R-CNN training process shown in Figure 10, except the configuration is different.
As shown in Figure 12, previously annotated images are convoluted on the MobileNet network. MobileNet also functions as a feature extractor for the SSD. From the input photos, several feature maps with different scales are created. Furthermore, the convolutional filter predicts bounding boxes and classifications. The non-maximum suppression method eliminates the bounding boxes with the smallest Jaccard value, leaving only one bounding box from the existing objects.
Figure 13 displays the flow of the training process of YOLO on a dataset input by a user. More details can be observed in Figure 14.
The flowchart shows the training process on a dataset begins by inputting an image dataset already labeled or annotated with four class classifications: ripe, half-ripe, half-unripe, and unripe tomatoes. Subsequently, the training configuration process is conducted by entering the batch size, steps, and path configuration. The batch size is 64, subdivision is 16, width is 416, height is 416, and maximum batches is 8, 000. The YOLOv4 model is used for transfer learning and fine tuning. After the training process on the dataset is completed using a compiled model and configuration, the results are saved in a .weight file. This file is automatically saved every 100 iterations while the training process is being implemented. Finally, the .weight file is implemented for the detection and classification processes of the system.
As shown in Figure 14, the previously annotated images are converted into a 416 × 416 grid, following which the YOLO model divides them into an S × S grid. Each grid predict bounding boxes and confidence scores. The YOLO model uses a single neural network to predict bounding boxes and probability classes in one evaluation directly. The detection results produce many bounding boxes, and to obtain the correct bounding box, non-maximum suppression is conducted to remove the bounding boxes with low confidence values.
Figure 15 shows the flow process of image prediction. More details can be observed in Figure 16.
Based on Figure 15, the image and video prediction process begins with inputting the images used to recognize or detect tomato ripeness. The system reads the input images for further processing to identify the objects in the images. The Faster R-CNN and SSD models use the results of the training process that are converted into a ProtoBuf file and use labelmap to obtain the label definitions for object prediction. The YOLO requires the results of the training process in the form of trained weights, YOLO config file, and obj data. Furthermore, the system reads the frame on the input images and displays the prediction results of the detected object.
This section discusses the configuration conducted before the training process. The model resizes a training image to be small with a maximum size of 1024 pixel, and many objects are classified into as many as four classes. Training config is useful for setting and training model parameters. Path fine tuning is adjusted to the faster rcnn inception v2 coco directory. The number of training processes is denoted as num step and set as 200, 000.
The model resizes a training image to be small, with a maximum size of 300 pixels, and many objects are classified into as many as four classes. Training config is useful for setting and training the parameters. Path fine tuning is adjusted to the ssd mobilenet v2 coco coco directory. The number of training processes is marked with a num step, which is set as 200, 000.
Several lines were adjusted according to the conditions of the data to be trained. The batch was changed to 64, subdivisions to 16, width and height changed to 416, and max batch value was 8, 000. This number was chosen because the number of classes is four; therefore, the maximum batch calculation was class ×2, 000, which is < the number of training images or < 6, 000. This suggests that the literacy trains the data as much as 8, 000 times. The steps included 80% and 90% max batches; therefore, the values were 6, 400 and 7, 200. Subsequently, the convolution layer was set, and the classes were used. Four classes were used to mean the dataset to be trained on four classifications, which also affects the value of the filter. The filter value was taken from the filters= (classes +5) × 3; therefore, the filter value was 27.
Data training was conducted at Google Colaboratory, which is the cloud service of Google. Google Colaboratory is accessible with a 12-hour time limit. In conducting Faster R-CNN and SSD training, checkpoint files were saved automatically every 10 minutes, whereas with YOLO, they were saved automatically after 100 literations. Using a GPU can increase the speed of training compared to that using the CPU of a computer. The training process is shown in Figure 17.
The training data produced loss values relative to the training results of each step, and yielded the average precision (AP) and mean AP (mAP) values. A loss shows inaccuracy of the model predictions. The optimal value of the loss must be at least below 0.5. A graph of loss was obtained by accessing the tensorboard data via the command prompt, as shown in Figure 17, and the results of the mAP are summarized in Table 2.
The training process required 5 hours. On the Faster R-CNN, SSD, and YOLO methods, it involved 51, 036, 14, 624, and 1, 500 iterations, respectively. The model results trained on the Faster R-CNN and SSD models were subsequently converted into graphics files and stored in the inference graph directory. The conversion process used the last checkpoint file in the training process.
At this stage, the success or accuracy of the classification results obtained from the program were tested. Accuracy testing was conducted using images from the training data taken from a smartphone camera. Objects detected by the program were marked with bounding boxes including visible objects. An example detection result is shown in
The accuracy of using five images each for the four annotations in the training data obtained from the dataset for each object was tested using the Faster R-CNN, SSD, and YOLO methods. The results of the training are summarized in Table 3.
It can be seen that the identification accuracies of half ripe, half-unripe, ripe, and unripe tomatoes are high. The average accuracy values of the Faster R-CNN, SSD, and YOLO models are 99.55%, 89.3%, and 94.6%, respectively.
In addition to testing using images, testing was conducted real-time using a webcam, to determine how its speed and application in daily life. The tests were conducted using a webcam with two different resolutions, and they refer to the difference in frames per second (fps) caused by each resolution, which can affect each detection process. An example test result obtained using a camera is shown in Figure 20.
The results of this test are fps values of 1280 × 720 and 720 × 480 resolutions for each model. The overall test results are shown in Table 4.
It can be concluded that the Faster R-CNN method requires a device with a higher level than the SSD and YOLO methods. Moreover, the use of a GPU is recommended to obtain improved results, and for real-time use, using YOLO and SSD methods are recommended; however, the results are not as accurate as those using the Faster R-CNN method.
This research identified objects in mages using deep learning methods. Specifically, the Faster R-CNN, SSD, and YOLO models were implemented into Jupyter Notebook. The program was written using Python and TensorFlow libraries.
The results obtained from training using a dataset with 400 images and a training model required 5 hours to produce a graph with a total loss < 2. Moreover, it was shown that the total loss value reduced, improving the prediction results. The results of the training process on several images from the training data yielded relatively good average accuracies of 99.55%, 89.3%, and 94.6% on the Faster R-CNN, SSD, and YOLO models, respectively.
For further research, using a high GPU is recommended to improve the process, effectiveness, and efficiency of the training and identification stages. The models were implemented on real-time cases (for example, in a greenhouse), requiring embedding the deep learning algorithms in the IoT.
Faster R-CNN architecture.
SSD architecture.
YOLO architecture.
Image classification, localization, and object detection.
Research methodology.
Image dataset.
Flowchart annotation process.
Annotation process.
Image annotation results using Faster R-CNN, SSD (left), and YOLO (right).
Flowchart of Faster RCNN training process.
Architecture of Faster R-CNN.
SSD architecture.
Flowchart of YOLO training process.
YOLO architecture.
Image prediction flowchart.
Training process on Google Colaboratory.
Total losses using (a) Faster R-CNN, (b) SSD, and (c) YOLO models.
Detection result.
Test result using webcam.
Table 1 . Image data collection.
Image | Number of images |
---|---|
Ripe tomatoes | 100 |
Half-ripe tomatoes | 100 |
Half-unripe tomatoes | 100 |
Unripe tomatoes | 100 |
Table 2 . mAP results.
Model | mAP |
---|---|
Faster R-CNN | 0.908 |
SSD | 0.858 |
YOLO | 0.778 |
Table 3 . Results of training.
Image | Faster R-CNN | SSD | YOLO |
---|---|---|---|
Half-ripe1 | 100 % | 78% | 100% |
Half-ripe2 | 100% | 97% | 100% |
Half-ripe3 | 100% | 100% | 99% |
Half-ripe4 | 99% | 65% | 100% |
Half-ripe5 | 100% | 99% | 99% |
Half-unripe1 | 98% | 62% | 63% |
Half-unripe2 | 100% | 97% | 93% |
Half-unripe3 | 100% | 80% | 53% |
Half-unripe4 | 98% | 98% | 99% |
Half-unripe5 | 100% | 98% | 95% |
Ripe1 | 100% | 100% | 100% |
Ripe2 | 100% | 97% | 99% |
Ripe3 | 100% | 98% | 97% |
Ripe4 | 100% | 98% | 100% |
Ripe5 | 99% | 51% | 98% |
Unripe1 | 99% | 83% | 100% |
Unripe2 | 100% | 100% | 98% |
Unripe3 | 99% | 100% | 100% |
Unripe4 | 99% | 87% | 100% |
Unripe5 | 100% | 98% | 99% |
Table 4 . Detection speed.
Resolution | Faster R-CNN | SSD | YOLO |
---|---|---|---|
1024 × 720 | 0.58 fps | 4.53 fps | 2.23 fps |
720 × 480 | 0.63 fps | 5.24 fps | 2.67 fps |
Laily Nur Qomariyati, Nurul Jannah, Suryo Adhi Wibowo, and Thomhert Suprapto Siadari
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 194-202 https://doi.org/10.5391/IJFIS.2024.24.3.194Donho Nam and Seokwon Yeom
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(1): 43-51 https://doi.org/10.5391/IJFIS.2020.20.1.43Min-Hyuck Lee, and Seokwon Yeom
International Journal of Fuzzy Logic and Intelligent Systems 2018; 18(3): 182-189 https://doi.org/10.5391/IJFIS.2018.18.3.182Faster R-CNN architecture.
|@|~(^,^)~|@|SSD architecture.
|@|~(^,^)~|@|YOLO architecture.
|@|~(^,^)~|@|Image classification, localization, and object detection.
|@|~(^,^)~|@|Research methodology.
|@|~(^,^)~|@|Image dataset.
|@|~(^,^)~|@|Flowchart annotation process.
|@|~(^,^)~|@|Annotation process.
|@|~(^,^)~|@|Image annotation results using Faster R-CNN, SSD (left), and YOLO (right).
|@|~(^,^)~|@|Flowchart of Faster RCNN training process.
|@|~(^,^)~|@|Architecture of Faster R-CNN.
|@|~(^,^)~|@|SSD architecture.
|@|~(^,^)~|@|Flowchart of YOLO training process.
|@|~(^,^)~|@|YOLO architecture.
|@|~(^,^)~|@|Image prediction flowchart.
|@|~(^,^)~|@|Training process on Google Colaboratory.
|@|~(^,^)~|@|Total losses using (a) Faster R-CNN, (b) SSD, and (c) YOLO models.
|@|~(^,^)~|@|Detection result.
|@|~(^,^)~|@|Test result using webcam.