Novel green-fruit detection algorithm based on D2D framework

: In the complex orchard environment, the efficient and accurate detection of object fruit is the basic requirement to realize the orchard yield measurement and automatic harvesting. Sometimes it is hard to differentiate between the object fruits and the background because of the similar color, and it is challenging due to the ambient light and camera angle by which the photos have been taken. These problems make it hard to detect green fruits in orchard environments. In this study, a two-stage dense to detection framework (D2D) was proposed to detect green fruits in orchard environments. The proposed model was based on multi-scale feature extraction of target fruit by using feature pyramid networks MobileNetV2 +FPN structure and generated region proposal of target fruit by using Region Proposal Network (RPN) structure. In the regression branch, the offset of each local feature was calculated, and the positive and negative samples of the region proposals were predicted by a binary mask prediction to reduce the interference of the background to the prediction box. In the classification branch, features were extracted from each sub-region of the region proposal, and features with distinguishing information were obtained through adaptive weighted pooling to achieve accurate classification. The new proposed model adopted an anchor-free frame design, which improves the generalization ability, makes the model more robust, and reduces the storage requirements. The experimental results of persimmon and green apple datasets show that the new model has the best detection performance, which can provide theoretical reference for other green object detection. Novel green-fruit detection algorithm D2D framework.


Introduction
At present, in the production of fruit and vegetable industry, picking operation as an important link of its production management is still based on manual picking, which makes it become the most time-consuming and laborious link in the whole fruit and vegetable production cycle [1,2] . Fruit and vegetable picking robots can effectively alleviate the high labor costs, low efficiency in the process of manual picking common problems. As an important part of the fruit picking robot, the accuracy, efficiency, and robustness of the visual recognition system in fruit detection will greatly affect the picking quality of robots [3,4] . However, in a complex orchard environment, the visual recognition system can be influenced by many factors such as light intensity, angle of the image taken, leaf occlusion, fruit color, background, and so on. These factors can highly influence the visual recognition system and bring more challenges in detecting targeted fruits. Therefore, fruit-harvesting robots equipped with stable visual recognition systems will become the key to realizing the efficient detection of the target fruit and the intelligent management of the orchard.
Traditional machine learning has laid the foundation for the research of computer vision, and the current machine learning method has been quite mature. Its simple workflow is favored by researchers. Traditional machine learning plays an important role in the field of object detection [5][6][7] and has achieved gratifying results in green fruit detection.
Arefi [8] first removed the background in Red-Green-Blue (RGB) space, then extracted the ripe tomato area by combining RGB, Hue-Saturation-Intensity (HSI) and YIQ space. Finally, the shape features were used to locate the fruit area, and the overall accuracy of the algorithm reached 96.36%. Linker [9] proposed a "four-step" strategy to realize the prediction aimed at estimating orchard yield. The method was mainly based on fruit color, texture, edge shape and other feature information, and the recognition rate of green apples could reach 95% under natural light conditions. Based on RGB color space, Liao [10] used the Otsu threshold segmentation algorithm to remove the influence of branches in green apple images, extract the gray scale and texture features of leaves and apples, and establish a random forest recognition model. The recognition accuracy of green apples reached 88%. Tian [11] proposed a target fruit localization method based on depth image, and located the center and radius of the apple circle respectively through depth image and its corresponding RGB spatial information, so as to fit the target area. Li [12] combined applying saliency detection and Gaussian curve fitting algorithm, a novel algorithm is used to detect green apples in natural scenes, the experimental results indicated that it was effective and feasible. Machine learning methods are mostly based on the texture features and color of the target fruit. When the fruit is affected by light intensity, occlusion and the color similarity between the target fruit and the leaf, the texture features of the target fruit are not obvious and the shape is missing. The above methods are difficult to meet the requirements of accuracy and speed when intelligent technologies are deployed to practical application.
In recent years, with the development of deep learning and CNN, the accuracy of image recognition has been greatly improved. Its advantages of end-to-end automatic detection process and deep extraction of image features eliminate many complex operations of traditional visual algorithms, which attract many researchers to apply it to target fruit location recognition [13,14] . Bargoti [15] et al. first used multi-scale multi-layer perceptron and CNN to segment apple images and extract apple targets in the images. Then, watershed algorithm and circular Hough transform were used to identify and count apple targets. Kang [16] obtained DSSNet-V2 by improving DASNet on the basis of achieving the class-level segmentation of target fruits, further realized the instant-level segmentation of target fruits and the class-level segmentation of branches and leaves, and solved the problem of identifying a cluster of fruits as the same region caused by overlapping factors in DSSNet. Jia [17] improved the instance segmentation model Mask RCNN to adapt to the detection of apple targets. By combining the residual network (ResNet) and densely connected convolutional networks (DenseNet) as the feature extraction network of the original model, the detection accuracy of apple targets under overlapping and foliage occlusion environments was greatly improved. Wang [18] proposed an apple recognition model based on R-FCN, by means of ResNet, RPN, and ROI sub-net module, the target fruits were detected in two stages. The recognition accuracy of this method is 95.1% on the test set containing occlusion, blur, and overlapping apple targets. The accuracy and applicability of the above vision model based on deep learning are better than the traditional machine learning methods. But these methods need a lot of computing and storage resources, picking efficiency cannot meet the needs of picking robots in real-time. In addition, the power consumption and stability of the picking robot should be considered when it is deployed in the real environment.
In order to improve the accuracy of target fruit recognition and enable the robot to meet the real-time operation requirements in the complex orchard environment, an object detection model optimized by dense to detection framework (D2D) was proposed. The new model uses the lightweight network MobileNetV2 [19] as the backbone network for feature extraction, which solves the problem of high demand for computing and storage resources. The backbone network connects feature pyramid networks (FPN) [20] to realize multi-scale feature fusion, and RPN [21] is added to realize ROI region extraction. It can enhance the feature information and improve the anti-interference ability of the model to the complex environment of orchard. Moreover, multiple local regressions and discriminative ROI pooling of the new model make target regression and classification more accurate, and improve the real-time efficiency of intelligent picking. The new model solves the problem of low accuracy of fruit recognition by robots in the past, and further promotes the development of agricultural intelligent picking technology. Through the model comparison experiment, it can be proved that the accuracy of the new model has great advantages, which further promotes the development of intelligent picking technology and cross domain target detection.

Datasets collection and labeling
In this study, immature persimmons and green apples were selected as research objects, and the fruits presented green spherical, which met the research requirements.
Image collection location: The mountain behind Shandong Normal University (Changqing campus) and the southern mountainous area of Jinan City, China.
Image collection equipment: Canon EOS 80D SLR camera. The camera used CMOS (complex metal oxide semiconductor) image sensor. The image resolution was 6000 pixels×4000 pixels, saved in JPG format, 24-bit color images.
Image collection conditions and setup: The images were taken from multiple angles in the real scene and the natural orchard environment, including perspective, close view, side view, front view, overlapping, occlusion, and other different angles. The image collection time was divided into the early morning, noon and night. In the early morning, fruit images were collected under the condition of soft light, so the frost and dew might appear on the fruit. At noon, fruit images were collected under the strong light environment (including the situation of forwarding light and backward light), and at the night, fruit images were collected under the LED artificial auxiliary light environment. The collected fruit images fully considered the complexity of the orchard environment and had randomness and representation, which can maximize the real-time operation requirements of agricultural equipment.
A total of 568 images of persimmons and 1361 images of apples were collected in the experiment, which were used as datasets after post-processing. As shown in Figures 1a and 1b, there are many complex situations, such as overlap, backlight, occlusion, direct light, distant view, night view, and so on. More representative and convincing results can be obtained by using data from many complex environments.
In order to meet the requirements of orchard real-time object detection and reduce the subsequent experiment time, the image resolution of 6000×4000 pixels was compressed into 600×400 pixels. Before production of the datasets, it is necessary to preprocess these images, specifically including the normalization, clipping, flipping, smoothing, and other operations of data.
As shown in Figures 1a and 1b, when the fruit is shaded by the leaves and branches or the fruit overlaps, the outline of the fruit is easy to be not clear and complete. At night or in rainy weather, the fruit detection accuracy will always decrease because of the change in light and raindrops on the fruit surface.
First, LabelMe software was used to mark the green fruit images, the contour of fruit was marked as a connected area, and the category information of fruit was marked. The coordinates of the marking points and the labeled category would generate the corresponding JSON file. Then, the JSON file was converted into datasets in COCO format by Label Me software. The persimmon dataset in this study contains a total of 568 images of persimmons, including 412 images in the training set and 156 images in the test set. The apple dataset contains a total of 1361 images of apples, including 953 images in the training set and 408 images in the test set. Vol

D2D object detection network for fruit detection
Inspired by D2D [22] model and considering the complex orchard environment, a two-stage anchor-free D2D detection model was proposed to realize the efficient and accurate detection of green targets. The accuracy of object detection can be improved by using the two-stage method. The D2D model generates the target bounding box in the first stage and identifies the target category of the bounding box in the second stage. Figure 2 shows the overall structure of the new model. The lightweight MobileNetV2 structure was selected as the backbone network in our proposed model. The use of MobileNetV3 [23] cannot greatly improve the accuracy and will reduce the robustness of the model. Therefore, the MobileNetV2 network with the best comprehensive effect is selected. The image features were extracted by the method of the first dimension increasing and then dimension decreasing. Then, the RPN structure was used to generate region proposals of green target fruits. In the process of object detection, classification and regression were divided into two independent parallel branches in the D2D network structure due to their different sensitive regions to feature space. The regression branch was used to accurately locate the target of the input ROI feature and generated a detection box containing green fruit; the classification branch classified the input proposal accurately, then generated the classification label and the classification confidence. The new model finally integrated the output results of the classification and regression branch, and then output detection boxes with classification labels and confidence.
The regression branch calculated the offsets of k×k local features of ROI features to the Ground Truth box, and selected whether each local feature belonged to the positive sample (the local features of ROI>0.5 belonged to the positive sample). Finally, the average value of all positive sample offsets was obtained as the global offset. The regression calculation of dense local features made the target location not limited to the coordinate of a central point (for example, Faster RCNN), which made the target location more accurate, reduced the dependence on a certain point, and improved the detection accuracy of green target fruit of the robot in the complex orchard environment.
The ROI Align [24] of 2 2 k k × size was selected for the classification branch, followed by the fully connected layer to realize the lightweight weight offsets prediction and obtain the ROI feature of 2k×2k size. The four sampling points of each sub-region of the ROI feature were allocated different weights adaptively through convolution operation. The sampling points with discriminative features were allocated higher weights to obtain more effective feature information and improve the accuracy of classification.
Note: RPN: region proposal network; Conv: convolution; ROI: regions of interest; Fc layers: fully connected layers; l i , r i , t i and b i respectively represent the offset from the i-th local feature to the left, right, up, and down of GT; n i has only 1 and 0 values, which respectively represent that the i-th local feature belongs to the ground truth bounding box or background. Figure 2 Overall structure of the flow chart of D2D

Backbone network
In the process of automatic orchard harvesting and yield measurement, the traditional network model chose ResNet-101 as the backbone network. In order to improve the operation efficiency of equipment, the backbone network of the new model is suitable for choosing a lightweight convolutional neural network with fewer parameters, less computation, and fast operation speed: MobileNetV2. The module is suitable for mobile devices (such as robot harvesting). The implementation process reduces the need for embedded hardware devices to access main memory, as shown in Figure 3. In contrast to the traditional Residual Network [25] structure, MobileNetV2 is a backward residual structure with a linear bottleneck, which extracts image features by first increasing dimension and then decreasing dimension to obtain more channel information.
Note: ReLU: Rectified Linear Units. In the first part, the expansion layer uses the size of 1×1 convolutional layer to map low dimensional space to high dimensional space. The second part is the structure of the depthwise separable convolution, which is used to extract features. In the third part, the projection layer also uses a convolutional layer to map high dimensional space to low dimensional space. MobileNetV2 belongs to the lightweight convolutional neural network, which has low dimension, a small amount of computation and high speed in the convolutional layer. However, the accuracy of lightweight network is relatively low. In order to balance the relationship between efficiency and accuracy, MobileNetV2 also implements high-dimensional feature extraction on the basis of optimized speed. MobileNetV2 can improve the recognition accuracy by inserting a linear bottleneck after the convolution module to capture the features of ROI. MobileNetV2 implements lightweight high-dimensional feature extraction, which reduces the memory capacity requirements of the model. MobileNetV2 is followed by an FPN to achieve multi-scale feature fusion. The D2D model uses the backbone network of MobileNetV2+FPN to realize the high dimension extraction of image features, which improves the accuracy of the model.

Regression branch
Aiming at the multiple green target fruit region proposals generate by RPN structure, the ROI features with k×k adjacent local feature spaces are obtained by constructing ROI Align mapping, and the offsets of local features are regressed by using the fully convolutional network. The calculation method of local feature offsets in the regression branch is shown in Figure 4a. GT (Ground Truth) box represents the real target box, and the ROI feature is divided into k×k local features, let p i (x i , y i ), i∈ [  From the perspective of the regression branch, the detection accuracy and efficiency of the new method were improved. The regression branch calculates the offsets based on each local feature of the region proposal of the target fruit. Compared with the Faster RCNN [26] which takes the center point of the region proposal as the calculation basis, the accuracy of the target region obtained by the new model was higher. In the process of regression modeling, the new method is based on the offsets of local features. Compared with the Fully Convolutional One-Stage Object Detection (FCOS) [27] network represented by the offsets of pixels, the computation amount of the new method was greatly reduced and the training efficiency was effectively improved. The new method balances the detection accuracy and efficiency well and is beneficial to the real-time operation of the orchard field.

Classification branch
Inspired by Deformable ROI Pooling [28] , the classification branch of D2D model uses fully connected layer (fc layer) to predict offsets of four sampling points in each sub-region of ROI, and then uses adaptive weights to assign higher weights to discriminative sampling points on ROI to obtain more accurate features. The overall structure of classification branches is shown in Figure 2, the overall structure can be divided into two parts: ROI Align and adaptive weight allocation.
Compared with the Deformable ROI Pooling structure, the overall framework of classification branch ROI Align is similar. However, the D2D model uses lighter weight to predict the offsets of sampling points. The target fruit region proposals use the size of 2 2 k k × ROI Align operation to segment the ROI feature into 2 2 k k × size, and parameters are only 1 4 of the standard offset prediction. Different from the integer operation of ROI Pooling, the ROI Align feature mapping is retained to the decimal number, and the bi-linear interpolation method is adopted to calculate the final result. Moreover, the two-time quantization process is canceled to avoid the regression error caused by quantization.
The fully connected layers are adopted to learn the offsets, and the ROI feature is migrated horizontally and vertically according to the ratio of length to width. By increasing offsets of the sampling points on the convolution kernel of the feature map, the size of the receptive field can be changed to make the convolution kernel into a polygon, and more effective feature information can be obtained. Finally, ROI Align unifies the feature maps corresponding to ROIs of different sizes to a fixed size of 2k×2k.
Note: "•" represents the sampling points; f 1 represents the sampling points after pooling. The target fruit region proposal and the sampling points with offsets are used as the input of the convolutional layer, and the ROI features with discriminative features are obtained by adaptive weight pooling, as shown in Figure 5. In the ROI feature with the size of 2k×2k, sampling points are set as 4 in each sub-region. At the center point of the sampling point, the pixel at each center point is calculated by bi-linear interpolation. M represents the feature in ROI, that is M∈ROI 2k×2k , the weight of each sampling point is predicted by a convolutional network, denoted as W(M)∈ROI 2k×2k . W(M) represents the discrimination ability of sampling points in the sub-region of 2k×2k, spaces, and sampling points with discriminative features will be given higher weight.

M
represents the weighted ROI feature as follows: where, ⊙ is called Hadamard product, the weight W(M) corresponding to each sampling point was obtained from M, and M was operated by average pooling with a step size of 2, and the size of the ROI feature is re-mapped back to k×k. Finally, two fully connected layers are connected as classifiers to obtain the classification score of the region proposal. In this study, the method can adaptively assign higher weights to the more discriminative sampling points, and get the highly discriminative features. It has high accuracy and improves the detection efficiency in the target fruit binary classification.

Loss function
The quality of the loss function design can directly affect the performance of the model and plays an important role in the iterative optimization process of the model. When training the model, it is necessary to define the loss function first. Then a prediction test is obtained according to the forward propagation, and the test value is obtained by comparing it with the real sample. Finally, the back propagation is used to update the weight, and the loss function is iterated to the minimum to obtain the ideal detection model. The overall loss function of the D2D model is composed of the loss function of regression and classification modules, which are respectively regression loss and classification loss. When the predicted value is close to the true value, the loss function is low, and when the difference between predicted value and true value is close to 1, the loss function value is high. The D2D model uses cross entropy loss function in the classification process and binary cross entropy loss function in the regression process, the total loss function of the new model is defined as where, L D2D represents the overall loss function of D2D model; L cls represents the loss function of classification branch; L reg represents the loss function of regression branch; x i represents the probability that the prediction is a positive sample; y i represents the label of sample i, positive sample is 1, negative sample is 0.
In the binary classification model, the output of the binary classification prediction problem is often not the standard 1 and 0, and the original output of the neural network is not a probability value. Instead, the sigmoid function is used for activation processing, and then the probability of mapping the samples to positive and negative samples is mapped. Sigmoid function normalizes output value, and the range of value is between [0,1]. The advantage of using the cross entropy loss function is that the sigmoid function can avoid the reduction of the learning rate of the Mean Square Erroe loss function when the gradient descents. Because the learning rate can be controlled by the output error. The equation of the sigmoid function is as follows: where, z represents the input of sigmoid function, z∈(−∞, +∞).

Model training
The whole process of the experiment includes the processing of the data set, model training, model testing, and other main parts. The whole flow of the experiment is shown in Figure 6. In the experimental process, the setting of model parameters is also an important process, which directly determines the optimization of the model. The D2D model adopted the initial weight of preliminary training based on COCO datasets, which helps to stabilize the loss function and improve the training accuracy. The initial learning rate was set to 0.0025 during model training, with 24 of batch size, 150 times maximum iteration, 0.0001 weight attenuation and 0.9 momenta. In the training of the model, an iteration refers to the process of all data being sent into the network to complete a forward calculation and reverse propagation. A model usually needs multiple iterations to fit convergence, but it is not that the more iterations the better, too much training may lead to overfitting, so it is necessary to test and evaluate the trained model to find an optimal fit state.

Evaluation metrics
In order to quickly discover the possible problems of the model in the training process and iterative optimize the model, the new model needs to be evaluated. In this paper, average precision (AP) and average recall (AR) were used to evaluate the object detection performance of the model. The accuracy rate is the ratio of the correct detected target to the actual detected target, and the recall rate is the ratio of the correct detected target to the expected detected target. The equations for precision and recall are as follows: where, TP represents the number of detection boxes for Intersection of Union (IOU)>0.5; FP represents the detection box with IOU≤0.5; FN represents the number of GT that was not detected. It is worth noting that the IOU represents the intersection threshold of the real box and the prediction box. Under different IOU thresholds, the values of Precision and Recall will also change accordingly. Taking recall as the abscissa and precision as the ordinate, a Precision-Recall (P-R) curve can be drawn. The area under the P-R curve is the representation of Average Precision, and the equation of Average Precision (AP) can be expressed as follows: 1 0 AP ( )d P r r = ∫ (8) In this integral, P(r) is a function with r as a parameter.
In fact, this integral is very close to the change of precision value multiplied by recall value for each threshold value, and then the product obtained under all thresholds is accumulated. The publicity is as follows: where, S represents the number of all images in the test set; P(k) represents the Precision value when k images can be recognized; ∆r(k) represents the change of Recall value when the number of identified images changes from k-1 to k.

Green fruit detection results
In the experimental training process, the effect of real orchard scenes on fruit object detection will be fully considered. After the training, the optimal D2D model was selected based on mAP and mAR. The optimal D2D model was used to test the two datasets of green persimmons and green apples. And the visual analysis was made of the test results of the images in complex environments such as occlusion, overlap, rainy day, backlight, and night. As shown in Figure 7, the detection effects of persimmons and apples are shown in Figures 7a and 7b respectively. The average precision and average recall rate of the model detection of persimmon and apple are listed in Table 2 for comprehensive evaluation.
The complex orchard environment in which the fruit is located and the confusion of leaf color on the fruit bring serious challenges to object detection. As shown in Figure 7, the proposed model in this study gives the best detection accuracy, and there are almost no detection errors and omissions. Even the fruit that was badly shaded and overlapped, which resulted in unclear contour, or the small target fruit in the nighttime environment could be detected. And the color of the background leaves had little influence on the green fruit. Table 2 shows the different average precision and average recall of persimmons and apples which are generated under different IOU thresholds, sizes, and quantities. Figure 8 shows the P-R curve of target fruits of different sizes when IOU=0.5. Take recall as abscissa, take 101 points evenly on the coordinate axis, and then calculate the precision corresponding to 101 different recall values respectively.
In combination with Table 2, compared with distant fruits, severely occluded or overlapped fruits, the detection effect of close-range unshaded fruits is better. Among them, the effect of distant fruit is poor because of the small target, unclear outline, and overlap of leaves, which brings a great challenge to fruit recognition. At the same time, the density of fruit also has a certain influence on the effect of object detection. The sparse fruit outline is clear and complete, while the dense fruit is prone to occlusion and overlap, which brings difficulties to object detection. On the whole, the detection effect of the model for the two kinds of fruits is good. Although the complex environment of the orchard may have a certain impact on the recognition effect of the target fruit, the new proposed model still has high accuracy and strong robustness.    Where the small size of fruit is less than 32 2 , medium-sized fruits are between 32 2 and 64 2 , the large size of fruit is large than 64 2 .

Comparisons
In order to analyze the performance of this model objectively, our proposed model was compared with representative computer vision models such as Faster RCNN, Mask RCNN [29] , FCOS, and SSD [30] algorithms. The experimental process adopted the same parameters and processing steps, and all persimmon datasets were used to evaluate and compare the average precision and average recall. The evaluation results were listed in Table 3. The comparison models were divided into two categories: anchor and anchor-free. Among them, Fast RCNN, Mask RCNN, and SSD are classic and stable object detection algorithms with anchor frame (Mask RCNN can be regarded as the combination of object detection and semantic segmentation, here only the part of object detection was compared). FCOS is a new algorithm with anchor-free. It can reduce the dependence on anchor frame in the detection process and improve the speed and accuracy of the model. As can be seen from Table 3, the average recall rate and average precision of the D2D model are 80.4% and 73.4% respectively, which is higher than those of other models. It can be seen that the detection effect of the D2D model is better than that of other models. Compared with the FCOS model without anchor frame, the average recall and average precision of this model are 7.7% and 7.4% higher respectively. The average precision of this model is similar to the result of Mask RCNN, but the average recall is 2.5% higher than that of Mask RCNN. This model still has advantages. A large number of experimental data show that the D2D model has a good detection effect. Moreover, this model uses the lightweight backbone network MobileNetV2, which is bound to be better than other models' efficiency.
The environment of persimmon and apple datasets used in this study is complex, which contains a large number of overlapped and small target fruit. However, the detection effect of the D2D model on target fruits is pretty good, which met the real-time operation requirements of orchard target fruit detection equipment.

Conclusions
This study aimed to develop an object detection model in a complex orchard environment and put forward an efficient and accurate object detection model of green target fruit optimized by D2D. The new proposed model adopted a structure with an anchor-free, which avoided the dependence on anchor in the detection process, greatly improved the detection accuracy of the model, and made the model widely used in various agricultural fields. And the lightweight MobileNetV2 backbone network was introduced to reduce the number of parameters and calculations, which greatly improved the operation efficiency of the model; In the process of regression, regression was carried out based on ROI local features.
After the regression, positive and negative samples of local features were judged, which greatly improved the operating efficiency and detection accuracy of the model. In the process of classification, a lighter weight was used to obtain high discriminative features, to achieve the binary classification of green target fruit, and to improve the detection accuracy and robustness of the model. The proposed model was trained and validated on the immature persimmon and apple datasets, and further analysis and comparison were made by ablation experiments.
The experimental results showed that the proposed model performed better in average recall rate and average precision compared with classical object detection models. The model has achieved a good verification on persimmon datasets, but still has a certain space for development. It can be summarized as follows: 1) The experimental dataset of this model is relatively small and larger dataset can be used for training and verification in the future; 2) Although this model is an object detection model, it may be used as an Instance Segmentation model by introducing a mask when judging positive and negative samples of local features.