Impact of dataset on the study of crop disease image recognition

: Datasets are very important in image recognition research based on machine learning methods. In particular, advanced methods such as deep learning and transfer learning are more dependent on datasets used for training models. The quality of datasets directly affects the final effect of these methods. In the research of crop disease image recognition, due to the complication of the agricultural environment and the variety of crops, datasets are scarce at present. Therefore, more and more researches adopt methods based on transfer learning, which can make up for the lack of data in the target domain with the help of other datasets. Among these methods, the selection of auxiliary domain datasets has great impact on the modeling effect of target domain. In order to clarify the impact of datasets on the research of crop disease image recognition, this study used different deep neural network frameworks to study and compare the effects of different datasets in crop disease image recognition based on transfer learning. The selected datasets include PlantVillage and Image Database for Agricultural Diseases and Pests Research (IDADP), which have been widely used in recent studies. And the selected deep neural network frameworks include ResNet50, InceptionV3, and EfficientNet. In the method of this study, the datasets are preprocessed first, such as data enhancement. After dividing the auxiliary domain and the target domain, the selected deep neural network frameworks are used to pre-train the model on the auxiliary domain dataset. Finally, the parameter-based transfer learning method was used to construct the corresponding crop disease recognition model in the target. In the experiments, multiple different datasets and different models were tested and compared. The results show that when the test set samples and training sample scenarios are consistent, the recognition accuracy of different network frameworks on multiple test sets is generally high. When the scenarios of test set samples and training samples are inconsistent, the recognition of various test sets by different network models cannot obtain ideal results. For the recognition of crop disease images that are collected from the actual cultivation environment, the use of IDADP dataset modeling is better, and it has more practical value in the actual application of crop disease image recognition.


Introduction 
Crop diseases and pests are one of the main factors affecting agricultural yield [1] . The scope and severity of diseases and pests have caused significant losses to the national economy, especially agricultural yield [2] . In 2020, the State Council Order No. 725 of the People's Republic of China "Regulations on the prevention and control of crop diseases and pests" mentioned that with climate change and the improvement of multiple cropping indexes, crop diseases and pests have occurred frequently.
Traditional methods can no longer meet the needs of disease and pest prevention and control in modern agriculture. Modern agriculture requires the intervention of more intelligent technologies, such as intelligent image recognition of crop diseases. It analyzes disease image information by comprehensively using machine learning methods, image processing, plant pathology, and other technical means to obtain disease identification features and models, and quickly identify the types of diseases. This can provide farmers with disease prevention information and improve agricultural production efficiency.
Since the 1980s, researchers at home and abroad have carried out a lot of research on crop disease image recognition by using various machine learning methods, including clustering methods [3,4] , classifier methods [5,6] , and shallow neural network methods [7,8] . However, crop disease image recognition based on these traditional machine-learning methods has the following problems: First, these methods are highly dependent on the quality of the original disease image samples, that is, they have strict requirements on the image acquisition environment and methods; Secondly, the process of the system is complex, including a series of operations such as preprocessing, image segmentation, feature extraction, and classifier construction, some of which still need to be further studied to improve the processing accuracy; Third, when the number of samples is large, it is difficult to effectively build corresponding models using these traditional machine learning methods.
In recent years, artificial intelligence technology has developed rapidly. Advanced machine learning methods represented by deep learning can overcome the above problems in intelligent crop disease image recognition and have received more and more attention and research [9,10] . These advanced methods can automatically extract image features, and reduce the dependence on professional and technical personnel. However, due to the complexity of the agricultural cultivation environment and the diversity of crops, the scale of crop disease image datasets often cannot meet the needs of deep learning modeling. Therefore, most studies introduced the idea of transfer learning to solve this problem [11] , which can make up for the lack of data in the target domain with the help of other datasets. In these studies, it is first necessary to select the relevant auxiliary domain, use the existing dataset of the auxiliary domain to construct the pre-training model, and then fine-tune the model according to the data of the target domain to obtain the classification model of this domain. Edna et al. [12] used multiple deep convolutional neural networks of VGG16, InceptionV4, ResNet, and DenseNets on the PlantVillage database to construct pre-training models, and respectively evaluated the effect of classification and recognition of plant diseases. Among them, DenseNets stacked model has the best effect. Yuan et al. [13] proposed a small sample crop disease image recognition method based on parameter transfer. The experimental results show that the proposed method has better performance in small sample disease image recognition than the deep learning method trained from scratch and the traditional machine learning Support Vector Machine (SVM) method. Barbedo et al. [14] used GoogleNet architecture for model pre-training to recognize more than 40 000 crop images captured by several different sensors, e.g., smartphones, pocket cameras, and digital SLR cameras, with accuracy ranging from 75% to 100%. Yang et al. [15] introduced a fast Region-Convolution Neural Network-based (R-CNN-based) lesion detection mechanism in the small sample transfer learning method, which improved the detection accuracy of the model.
It can be seen from some existing studies that in the research of crop disease image recognition based on advanced methods such as deep learning and transfer learning, the datasets used for modeling are very important [16] . Especially in methods based on transfer learning, the selection of auxiliary domain datasets directly affects the final effect of these methods. Due to the particularity of agricultural disease images, only choosing a dataset with higher similarity with agricultural disease images as auxiliary data can improve the effect of transfer learning [17] . However, there are few studies on the impact of datasets on the final effect of these methods. Arnal [18] discussed the impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification, mainly including background removal in different situations, limitations of datasets, and practical applications. Different from this work, we mainly discuss the impact of datasets with different image backgrounds in constructing different pre-training models, and from a practical point of view, the impact on the recognition performance of crop disease images derived from the actual cultivation environment.
In order to clarify the impact of dataset on the study for crop disease image recognition, especially the image background, in this paper, we select two agricultural disease image datasets with large samples, PlantVillage [19] and IDADP [20] with significant differences in image backgrounds. And Based on the above two datasets, different pre-training models are built based on some commonly used excellent network frameworks, including ResNet50, InceptionV3, EfficientNet, and use the model-based transfer learning method to carry out detailed experimental comparisons to explore the impact of datasets on the study of crop disease image recognition. It is expected to provide a reference for the application of crop disease image recognition in the actual cultivation environment. The experimental results show the recognition accuracy of crop disease images is greatly affected by the scene when the training samples are captured. For the task of identifying images of crop diseases actually collected in the cultivation environment, the model constructed using the IDADP dataset has better recognition effects.

Materials
In this study, the following two crop disease image datasets with large samples were selected, which are mainly used as training sets for machine learning methods to build crop disease, recognition models. Some examples of these two datasets are shown in Figure 1  PlantVillage [19] : This dataset contains 38 categories of diseases on 15 species of plants, with a total of 53 819 images. All images in this dataset are taken in the laboratory environment with low resolution and simple background.
Image Database for Agricultural Diseases and Pests Research (IDADP) [20] : This dataset contains more than 50 000 high-resolution images with 52 categories of diseases on 11 species of crops. All images in this dataset are captured from the fields or greenhouses with complex backgrounds in actual cultivation environment.
In the experiments, due to the following five crop disease data contained in both datasets, that is, maize common rust, maize northern leaf blight, maize healthy leaf, tomato early blight, and grape healthy leaf, these five disease image datasets were chosen as the target datasets. The number of images of each crop disease in the target datasets is listed in Table 1.

Data enhancement and preprocessing
Using a deep neural network to build a crop disease image recognition model needs a large number of training samples to improve the classification accuracy. Although the two datasets selected in this paper have a large number of samples, in order to improve the quality of modeling, we still need to further expand the training samples through data enhancement. Specifically, first, the original images are rotated by 90°, 180°, and 270°, respectively. Secondly, the images are transformed by random perspectives, including up and down flips, and left and right flips. Finally, the images are scaled to 224×224 pixels by the bilinear interpolation method.

Framework based on transfer learning
In this study, the idea of transfer learning is used to build the required machine learning classification and recognition models. As shown in Figure 2, compared with traditional machine learning methods that require an independent learning system for each task, the general idea of transfer learning is to transfer the trained model parameters or learned knowledge to the target domain to help the construction of new models, so that better models can be learned in the target domain without large-scale annotation data [21] . At present, crop disease image datasets are very scarce. Using these limited datasets to build a crop disease image classification model is just in line with the application scenario of transfer learning. a. Traditional machine learning b. Transfer learning Figure 2 Different learning processes between traditional methods and transfer learning Specifically, according to the different transfer objects, transfer learning methods can be divided into the transfer categories of instance-based, feature-based, parameter-based, and relational-based. At present, in the study of crop disease image recognition, the parameter-based transfer method is more studied. In this method, it is first necessary to use an appropriate deep network model to construct a pre-training model on the auxiliary dataset, and then transfer the obtained model or parameters to the target domain to improve the modeling quality. Therefore, in addition to the selection of datasets, building a good pre-training model is very important in this method.

Model pre-training
In this study, three mainstream deep learning network frameworks, i.e., ResNet50 [22] , InceptionV3 [23] , and EfficientNet [24] , were selected to build the pre-training models. Each network framework is described as follows: ResNet adds residual learning to the traditional convolutional neural network to solve the problems of gradient dispersion in deep networks and the reduction of training set accuracy. ResNet50 is a ResNet with 49 convolutional layers and one fully connected layer, which has a good effect on image classification and recognition tasks.
InceptionV3 is an optimized version of GoogleNet. This network framework introduces the idea of factorization into small convolutions to split a larger two-dimensional convolution into two smaller one-dimensional convolutions, reducing a lot of parameters, speeding up calculations, and reducing overfitting. Besides, it also optimizes the structure of the inception module, which can increase the network scale more efficiently.
EfficientNet uses the compound scaling method to balance the three dimensions of the resolution, depth, and width to achieve the optimization of the accuracy and efficiency of the convolutional network.
Compared with multiple previous deep learning network frameworks, the main advantage of EfficientNet is that it can save computing resources and improve computing efficiency as much as possible under the premise of ensuring the high accuracy of the model.
The above three network frameworks have their own characteristics. Therefore, this study selected these three network frameworks to conduct model training on the PlantVillage dataset and the IDADP dataset respectively, and then used these pre-training models for the classification and recognition of crop disease images.
Finally, through specific experiments, the classification accuracies of these pre-training models on different target datasets were tested and comparative analysis and discussion were given. The process of model pre-training and testing is shown in Figure 3.

Experimental setup
In the experiments, five crop diseases in the PlantVillage and IDADP datasets were selected as the target datasets, that is, maize common rust, maize northern leaf blight, maize healthy leaf, tomato early blight, and grape healthy leaf. In order to improve the classification accuracy, for these five crop diseases, the dataset of each disease was divided into two parts according to the ratio of 9:1, where the former was used to construct the pre-training model together with other disease data of the same dataset, and the latter was used as the target test set. The number of images of each disease in the target datasets has been given in Table 1 in Section 2.1.
The average classification accuracy (ACC) of the experimental results was used as the evaluation standard, and the calculation formula is as follows: where, t is the number of model tests; n c is the number of correctly classified samples; n i is the total number of test samples.

Results and analysis
Three network frameworks of ResNet50, InceptionV3, and EfficientNet were used to build pre-training models respectively.
The pre-training models obtained on the PlantVillage dataset were denoted as P_ResNet50, P_InceptionV3, P_EfficientNet, and the pre-training models obtained on the IDADP dataset are denoted as I_ResNet50, I_InceptionV3, I_EfficientNet. Then, these models were used to classify and recognize the disease images of the target datasets from PlantVillage and IDADP respectively. The result on each target dataset is listed in Table 2. More specifically, the experimental results were analyzed in detail as follows. Firstly, the pre-training models constructed by different network frameworks on the same dataset were compared. Figure 4 gives the results on the PlantVillage dataset and the IDADP dataset respectively, where the target datasets from PlantVillage and IDADP are abbreviated as (P) and (I) after the name of the crop disease dataset. As shown in Figure 4a, three pre-training models built on PlantVillage have good classification accuracies (from 95% to 100%) on the target datasets from PlantVillage, and their differences are small; while on the target datasets from IDADP, there are some differences in the performance of these three pre-training models, especially in the dataset of tomato early blight, where the pre-training model based on ResNet50 achieves the best result. Similar results can also be seen in Figure 4b, that is, for these three pre-training models built on IDADP, the classification accuracies on the target datasets from IDADP are better and more stable than that from PlantVillage. And the performance difference on the target datasets from PlantVillage is more significant, such as on the datasets of maize common rust, maize northern leaf blight, tomato early blight, and grape healthy leaf. Among them, the pre-training model based on EfficientNet achieves the best results.
In general, when the samples used to construct the pre-training model and the test set samples come from the same dataset, the pre-training models of these three network frameworks have higher classification accuracies on multiple test sets; On the contrary, when the samples used to construct the pre-training model and the test set samples come from different datasets, the classification accuracies of these pre-training models on the two target datasets are quite different. This shows that different datasets have a great impact on the effect of crop disease image recognition based on transfer learning. Therefore, secondly, a more detailed comparative analysis of the transfer learning effect of different datasets was conducted on the pre-training model based on the same network framework. Figures 5 show the experimental results of the pre-training models based on ResNet50, InceptionV3, and EfficientNet respectively. In the results of the pre-training model based on ResNet50 (Figure 5a), the effect of transfer learning from PlantVillage to the target datasets of IDADP are better, only on the dataset of core common rust is not good; on the contrary, the effect of transfer learning from IDADP to the target test sets of PlantVillage is better only on the dataset of maize healthy leaf, and the others are not good. Similar results can be seen in the results of the pre-training models based on InceptionV3 (Figure 5b), and EfficientNet (Figure 5c). From these figures, we can find the following conclusions about the effect of transfer learning between different datasets.
If the effect of transfer learning from PlantVillage to IDADP is better, and vice versa, e.g., maize healthy leaf dataset; If the effect of transfer learning on the same dataset is not good, the effect of transfer learning between different datasets is even worse, e.g., the maize common rust dataset.
Furthermore, the impact of the dataset and pre-training model on the effect of transfer learning was the only consideration, so the experimental results on different test sets were averaged. Specifically, as shown in Figure 6, the average recognition accuracies of the two test sets of PlantVillage and IDADP under each pre-training model are given. The abscissa represents different pre-training models, and the histograms P_average and I_average represent the average of the experimental results on PlantVillage and IDADP test sets respectively. It can be seen from Figure 6: First, when the modeling data and the test data come from the same dataset, the recognition accuracy is better. And the effect of different pre-training models on the same test set is not very obvious. In general, the pre-training model based on the PlantVillage dataset performs better on the PlantVillage test set. This is due to the simple background of the image, so the feature interference is less. Second, when the modeling data and test data come from different datasets, there are some differences in experimental results. The recognition accuracy of the pre-training model based on the PlantVillage dataset on the IDADP test set is better than that of the pre-training model based on the IDADP dataset on the PlantVillage test set. The former has the best recognition accuracy under the ResNet50 network framework and the latter works best under the EfficientNet network framework. Figure 6 Average accuracies of pre-training models

Discussion
In order to discuss the impact of different datasets on the study of crop disease image recognition, especially the universal applicability of different datasets in different scenarios, the methods, and experiments in this study did not use excessive manual preprocessing, e.g., image segmentation or target labeling. According to the above content, it can be seen that the impact of the dataset on the study of crop disease image recognition is significant. Compared with the impact on the final results caused by different datasets used in the modeling process, the differences based on different deep learning frameworks are very small, or even negligible.
On the one hand, for the research of crop disease recognition based on transfer learning, the image background in the PlantVillage dataset is simple, and there is not much background feature interference in the modeling process, so the effect of transfer learning is better. The comparative experiments in this paper show that although the disease images with simple backgrounds do not conform to the actual cultivation environment, the features learned from the crop disease images with simple backgrounds can be better transferred to the task of crop disease image recognition with the same background. In the case that the crop disease image dataset captured from the actual cultivation environment is not sufficient, the use of image datasets with simple background for transfer learning is still a better solution under the existing conditions and has certain practical value.
On the other hand, it is still a challenge to accurately recognize the actual disease images in the actual cultivation environment. Although the pre-trained model built with the IDADP dataset is better for recognizing crop disease images in the actual cultivation environment, such crop disease image datasets are still very scarce at present. Therefore, it is a type of very important work to continuously build the corresponding dataset.
And further exploration is also required in image recognition technology.

Conclusions
Intelligent recognition of crop diseases and pests images is very important to improve its prevention and control efficiency. In the near future, based on the development and support of 5G technology, it will become the norm to use video image analysis or use mobile phones to capture and identify diseases or pests in the cultivation environment. Using existing datasets and quickly constructing recognition models based on transfer learning methods has become a common method. In order to better study the impact of crop disease image samples on the recognition accuracy, the paper focuses on the two datasets of PlantVillage and IDADP with a large amount of data and performs pre-training and parameter transfer under three network frameworks of ResNet50, InceptionV3, and EfficientNet to obtain the recognition accuracies of different classification models.
Through analysis and discussion, it can be seen that the simple background test set has a better recognition effect on the PlantVillage dataset, and the recognition accuracy can reach more than 95%. The recognition accuracy of the complex background test set (images captured in the actual cultivation environment) on the IDADP dataset is better, reaching 89%. The recognition accuracies of the simple background test set on IDADP dataset and the complex background test set on PlantVillage dataset are greatly reduced. It indicates that the recognition accuracy of crop disease images is greatly affected by the scene when the training samples are captured. For the task of identifying images of crop diseases collected in the actual cultivation environment, the model constructed using the IDADP dataset has better recognition effects. However, from the current point of view, further research is needed for the recognition of crop disease images with complex backgrounds. On the one hand, it is necessary to collect as many images of actual crop diseases in the actual cultivation environment as possible to expand the existing dataset. On the other hand, it is necessary to study better image recognition modeling methods, such as introducing the attention mechanism to reduce the interference of complex backgrounds, or adopting semi-supervised methods to reduce manual intervention, so that the crop disease image classification model can have better universality.