Hyperspectral detection of walnut protein contents based on improved whale optimized algorithm

: Nondestructive and accurate estimation of walnut kernel protein content is important for food quality grading and profitability improvement of walnut packinghouses. Hyperspectral image technology provides potential solutions for walnuts nutrients detection by obtaining both spectral and textural information. However, the redundancy and large computation of spectral data prevent the widespread application of hyperspectral technology for high throughput evaluation. For walnut kernel protein inversion from hyperspectral image, this study proposed a novel feature selection method, which is named as improved whale optimized algorithm (IWOA). In the IWOA, a comprehensive feature selection criterion was applied in the iterative process, which fully considered the relevance of spectra information with target variables, representative ability of the selected wavebands to entire spectra, and redundancy of the selected wavebands. Especially in the relevance with target variables, the amplitude and shape characteristics of the spectra were both taken into consideration. Eight wavelengths around 996, 1225, 1232, 1377, 1552, 1600, 1691 and 1700 nm were then selected as the sensitive wavelengths to walnut protein. These wavelengths showed good correlation with certain chemical compounds related to protein contents mechanistically. Then three protein prediction models were established. After analysis and comparison, the model based on the selected wavelengths got better results with the one based on the full spectrum. Compared to the models based on solely spectral information, the model that combine spectral and textural information outperformed and got the best prediction results. The R 2 in the calibration group was 0.9047, and the root mean square errors ( RMSE ) was 11.1382 g/kg. In the validation group, the R 2 was 0.8537, and the RMSE was 18.9288 g/kg. The results demonstrated that the combination of the selected wavelengths through the IWOA with the textural characteristics could effectively estimate walnut protein contents. And the proposed method can be extended to the detection and inversion of other nutritional variables of nuts


Introduction 
Walnut is an important woody oil crop and is highly valued for its nutritional walnut kernels.Walnut kernels contain 15%-22% of protein, of which more than 96% is human absorbable protein, which is the highest compared with soybeans, peanuts, almonds, hazelnuts and eggs [1] .Protein content of walnut kernels is one of the most important factors determining perceived quality and ultimate price of walnut products.Nondestructive and accurate obtaining the walnut kernel protein content is crucially important for walnut kernel quality grading, and would assure the competitiveness and profitability of the walnut industry.
The traditional methods of kernel protein detection are chemical measurements, which are destructive and have the potential for environmental contamination.
Considering the demands in practice, it is more necessary to develop a fast and efficient method to accomplish the walnut kernel protein content detection.NIR spectroscopy, on the basis of electromagnetic characteristics of matter, makes it possible to detect protein content of food quickly and noninvasive, which could ensure the safety of food production during whole inspection processes [2,3] .However, spectroscopic measurements are generally obtained from a limited area (i.e., point measurement using ASD FieldSpec FR spectroradiometer (Analytical Spectral Devices, Inc., Boulder, CO, USA)) [4,5] or from prepared samples of specific size (i.e.limited area measurement using MATRIX-I type of FT-NIR analyzer with a rotating sample pool (Bruker Optical Company, Germany)) [6] .They do not provide image information of objects such as texture or location information, which is important in many food inspection applications [7,8] .
The advent of hyperspectral technology has enabled the perfect integration of spectroscopy as well as imaging analysis techniques, providing more information about the targeted objects, and provides a possibility to further improve the accuracy of composition content detection of targets.Hyperspectral data containing full-bands radiation information could describe various characteristics associated with the biochemical and physiological traits of targets [9][10][11][12][13] .But the existing data redundancy and band autocorrelation in hyperspectral information could always lead to an increase of computation complexity and the incorrect test results.
In recent years, the deep learning, as a state-of-the-art technique, has been extensively applied in the field of hyperspectral image processing, in which the features could be learned automatically according to the targeted tasks [14][15][16][17][18] .And large number of labeled samples is desirable to ensure the stability of the deep learning models.However, in actual productions, collecting such large amounts of labeled data is expensive or generally impossible [19] .Furthermore, the limited amounts of labelled sample set constrained the wide application and performance of deep learning algorithm in feature extraction and composition content determination [20,21] .Therefore, it is important to fully exploit the effective features contained in the hyperspectral images and reduce the dependence of feature selection on the volume of the dataset.
Therefore, the advantages of features selection method which is embedded in existing experience become obvious, because it no longer needs additional large amounts of labelled samples for training.Among the feature selection approaches, swarm-based optimization algorithms mimicking biological or physical phenomena could be used to solve complex feature selections [22][23][24][25] .Typical swarm-based optimization algorithms, such as genetic algorithm and particle swarm optimization algorithm can improve their stochastic capability by setting some parameters.But there is no guarantee that these algorithms will be able to search globally and jump out of the local optimum in hyperspectral feature selection.Moreover, some parameters such as mutation rate and particle velocity can seriously affect the quality of the solution, which need to be set with a lot of known experience [26] .A novel nature-inspired meta-heuristic optimization algorithm, called Whale Optimization Algorithm (WOA), which mimics the social behavior of humpback whales and inspired by the bubble-net hunting strategy was introduced in 2016 [27] .WOA, including three operators to simulate the search for prey, encircling prey, and bubble-net foraging behavior of humpback whales, shows advantages over other state-of-the-art meta-heuristic methods in exploration, exploitation, local optima avoidance, and convergence behavior [25,28,29] .
While in the process of feature selection using WOA, most studies only adopted random principle to set up whale foraging behaviors, which makes the selected features mostly dependent on the performance of WOA and does not fully consider the contribution of hyperspectral characteristics including amplitude and shape information in feature selection [30] .Besides, for variable selection, there are three factors which are quite important to be considered.
One is discrepancies with the selected wavebands; the other one is the representative ability of the whole spectral information and the last one is the correlation level with the target variable.To address the above issues, this study presented an improved WOA algorithm combined with hyperspectral characteristics and feature selection criteria.
This study takes walnut as the research object and aims to propose a novel inversion method for walnut protein contents estimation based on hyperspectral information.The paper is organized as follows.Section 2 introduces the walnut samples and hyper-spectrum determination in this study.Detailed description about the proposed optimized WOA feature selection method and random forest regression model are described in Section 3. Results of selected protein sensitive wavebands, validation of protein inversions based on different training dataset are presented in Section 4. The advantages and limitations of the proposed method are described in Section 5.The paper concludes in Section 6 with a summary of the results.

Walnut samples and pre-treatment
In this study, Xinjiang 'Wen185' walnuts were selected as target objects.The experimental samples, with water content of 7%, was stored at 4°C for about 5 months before the experiment.After manually shell breaking, 30 walnut kernel samples were collected.The front and back side of the half kernel samples are shown in Figure 1.After the hyperspectral image is collected by hyperspectral imaging equipment, the protein content was measured by Kjeltec automatic nitrogen analyzer (Foss, Denmark).Each sample was crushed and selected 0.3 g for the analyzer.After nitrification, the samples were titrated by distillation to obtain the total nitrogen content.Then the total nitrogen content was multiplied by a conversion factor (6.25) to calculate the protein content of the walnuts in accordance with the rule of the National Food Safety Standard GB5009.5-2010.After preheating, focusing the lens and adjusting the moving speed of the platform to avoid image distortion.The image acquisition software Spectrum View was used to collect the imaging information of walnut kernel.The hyperspectral measurement ranges were 863-1704 nm and 382-1027 nm respectively.The spectral resolutions in the two spectral ranges were 3.2 and 0.84 nm.In order to eliminate the noise influence caused by uneven illumination, measurement environment and dark current of the instrument, it is necessary to collect white background information (I w ) and black background information (I b ) by using standard whiteboard and lens cover respectively before collecting the hyperspectral image of the sample.Then according to Equation (1), the corrected hyperspectral image (I) was obtained from original hyperspectral image (I o ).
where, i is the corrected image; I w is the while background image; I b is the black background image; I o is the original image.

Improved whale optimization algorithm (WOA)
Hyperspectral feature selection using original WOA [27] mostly relies on the performance of WOA in feature selection and does not make full use of hyperspectral characteristics to achieve more in-depth valid information mining.Therefore, in the process of WOA optimization, a comprehensive evaluation criterion for hyperspectral characteristic variable selection is proposed, which takes into accounts three aspects: the degree of correlation between the wavelength information and the target (protein content), the redundancy between the selected wavelengths, and the ability of the selected wavelengths to represent the full spectrum.The wavelength information includes the amplitude and shape information that is unique to hyperspectral information.Based on the above principles, Pearson Correlation Coefficient is used to calculate the above three indicators.The general expression is shown below, which was used as the fitness function of the algorithm: o shp is the correlation degree between spectral shape information and target; x is the spectral information, x a is the spectral amplitude; x s is the spectral shape value; y is the protein contents; N is the number of all wavelengths; SN is the number of selected wavelengths.
In the above expression, the spectral shape information is the angle between the extension direction of the spectral curve within adjacent bands and the horizontal direction.
First, the hyperspectral reflectance was normalized from 1 to the number of wavelengths.The angle was then computed as shown in Equation (11).Subsequently, all the angle information of the entire curve was obtained and used to describe the shape characteristics of the hyperspectral curve.The feature selection evaluation criteria are effectively confused with the original WOA in terms of whale hunting behavior and fitness function in the optimization algorithm, which could ensure that the whales could jump out of the local optimum in the search process and comprehensively evaluate the selected features in terms of representativeness, relevance and redundancy.In addition, the unique amplitude and shape information in the hyperspectral information is effectively used to fully mine the hyperspectral information.
The process of the improved WOA algorithm is as follows: Step1 Initialize the whale population location, set the whale population size and the maximum number of iterations Step2 Calculate the fitness (as Equation ( 2)) of all whales in the group, and record the one with the highest fitness value as the location of the prey X p Step3 Update parameters a, A, C, l and p, their setting reference citation [27] Step4 Update the location of each whale if p<0.5 if |A|<1, encirclement contraction predation method update the position of each whale according to Equation( 5) else if |A|≥1, random exploration of predation update the position of each whale according to Equation ( 6) end else if p≥0.5, perform bubble-net feeding method update the position of each whale according to Equation ( 7) end Step5 Calculate the fitness of all whales in the population as Equation ( 2), and record the one with the largest fitness value as the optimal solution.If the iteration stopping condition is met or the maximum number of iterations is reached, the algorithm stops; otherwise, it returns to Step3.

Gray level co-occurrence Matrix (GLCM)
The recurrence of pixel grayscale in spatial location forms the texture of the image, and GLCM is a description of the joint distribution of two pixels grayscale with spatial location relationships.Haralick et al. [31] proposed GLCM to characterize texture features.In this research, the GLCM of the walnut images were calculated first, then four texture statistical indicators were calculated.Contrast (Con) was used to describe the sharpness of the textures and its calculation as shown in Equation (8).Dis and Homo can reflect the dissimilarity and homogeneity of the textures and the local textural variation separately, they are calculated by Equations ( 9) and (10).Energy (E), an indicator used to describe the uniformity of the distribution of greyscale and the coarseness of the texture, is calculated by Equation (11).22  , 11 (GLCM) ( ) , 11 (GLCM) 1+( ) (GLCM)

Random forest
The random forest algorithm consists of multiple decision trees, and it is an efficient and reliable integrated learning method with good tolerance to sample data outliers and noise.The randomness of random forest is reflected in the sample randomness and feature randomness.The random forest algorithm adopts the Bagging strategy to train decision trees, which means that m subsamples are randomly selected from the original training set for building m decision trees, and then the random forest will randomly select some variables from all independent variables as the nodes of the decision trees to reduce the correlation between each decision tree and make the decision process more multivariate.
Because the samples that are not selected each time under the Bagging strategy form the out-of-bag data set, there is no need to set aside extra data for cross-validation, and the random sample selection also reduces the computational effort, while the final decision result collects the information of all decision trees to ensure the prediction accuracy of the model.

Walnut NIR spectra and selected wavelengths
Walnut kernels are mainly composed of fat, protein, water, sugar and other trace elements.The electromagnetic wave signal in the near-infrared domain contained the most characteristic bands related to the above components.Therefore, due to the high sensitivity of the spectral signal in the NIR region, only hyperspectral images in the range of 863-1704 nm were selected for subsequent analysis in this study.Figure 2 shows the NIR spectral characteristic extracted from the hyperspectral images of 30 walnut kernel samples.Figure 2 reveals that the variation tendency of the walnut spectral curves of the different samples becomes similar with various protein contents.The spectral information at both ends of the curve (before 870 nm and after 1680 nm) contained a considerable amount of noise because of the vibration of the measurement system.Two obvious peaks around 1210 nm and 1470 nm could be observed, which are caused by water content.It could be concluded, except for the reflectance peaks of water, the reflectance peaks of other components were not obvious, and further processing of the spectra was required.

Figure 2 Spectral curves of walnuts samples
As described in Section 3.1, the proposed improved WOA for feature wavelengths selection of walnut kernel is calculated and visualized through self-developed software.The positions of selected wavebands in the whole spectral range using improved WOA algorithm are shown in Figure 3.The eight wavelengths around 996, 1225, 1232, 1377, 1552, 1600, 1691 and 1700 nm were selected.The previous publications revealed that the 996 nm was corresponding to protein contents in the identification of rough rice species and years by visible/near-infrared [32] .Tallada et al. [33] found the usual significant absorption peaks between 1175-1225 for protein through mean spectral profile of the 87 maize seed samples.Nagao et al. [34] determine the fat content in meats using a combination of absorbances at 1208 nm and 1230 nm.The optical transmittance spectrum recorded for the grown glycine phosphite single crystal shows an absorption edge at 1377 nm in the upper wavelength region.Hence it is clear that the grown glycine phosphite crystal has a transmittance window at 1377 nm with nearly about 100% transparency [35] .Yadav et al. [36] utilize this tapered fiber optic biosensor, operating at 1550 nm, for the detection of protein concentration.Capus and Cockcroft [37] measured the refractive indices of protein solution with different concentrations using an Abbe type refractometer at a wavelength of 1700 nm.It could be concluded that, after using the improved WOA wavebands selection method, the selected wavelengths were scattered in the whole spectral range.It means that this method reduced the autocorrelation and redundancy of the selected wavelengths.And these selected wavelengths could represent the whole spectral information to a certain extent.
Besides, throughout the previous studies, most characteristic bands selected through the improved WOA selection methods in this research have correlation relationship with certain chemical compounds related to protein contents in walnut kernels mechanistically.
Figure 3 Spectral curve and selected wavelengths of walnuts protein

Protein estimation using selected wavelengths
To verify the protein predictive ability of the selected sensitive wavelengths, this study firstly extracted spectral curves of the interested region of front side and back side of the walnut kernels and calculated the averages as the targeted full-spectrum information.The eight wavelengths namely, 996, 1225, 1232, 1377, 1552, 1600, 1691 and 1700 nm were then selected.Two datasets including full-spectral information in the range of 862-1710 nm and the eight selected wavelengths mentioned above were created for the following modelling and comparison.The protein contents of 30 soil samples in the dataset were in the range of 12.5%-20%.
The RF models based on full-spectral information and the eight selected wavelengths for walnut kernel protein content estimation were established separately.Samples were divided into two groups, that is, 20 samples were under the calibration group and the remaining 10 samples were under the validation group.
According to the established models, the 1:1 relationship diagram were drawn between the prediction and observation to demonstrate the reliability and consistency of the selected models.The calibration and validation results of the RF model based on full-spectrum was shown in Figure 4.And the walnut kernel protein prediction results based on the eight wavelengths was shown in Figure 5.The calibration R 2 of the models based on the full-spectrum and eight selected wavelengths reached 0.9149 and 0.8860, and the root mean square errors of calibration (RMSE cal ) were 11.1242 g/kg and 12.0227 g/kg.The validation R 2 reached 0.4128 and 0.7087, and the root mean square error of prediction (RMSE val ) were 19.2065 g/kg and 19.7313 g/kg.It could be concluded that the walnut kernel protein prediction based on the eight selected wavelengths obtained better inversion results with the one based on the full spectrum, the comparative conclusion will be verified under other regression models in Section 5.

Protein estimation using the combination of spectral and texture information
In order to further increase the walnut kernel protein prediction accuracy, the texture characteristics of the front side and back side of the kernels were considered in the following modelling.As illustrated in Section 3.2, four texture indicators including contrast, dissimilarity, homogeneity and energy were used to construct a mixed dataset with the eight selected wavelengths to retrieve the protein contents of walnut kernels.20 samples were used to calibrate the model and the remaining 10 samples were the validation group.In the RF regression model, the calibration and validation results of walnut kernel protein inversion based on the mixed dataset containing the eight wavelengths and four texture indicators is shown in Figure 6.The validation R 2 of the models based on the mixed dataset increased to 0.8537.And the RMSE val was 18.9288 g/kg.However, comprehensively observed from Figure 4 to Figure 6, in validation process, the predicted values are lower than the measured values.Although the mixed dataset improved this situation, it still existed.

Discussion
Prior works have made various attempts to inverse nut nutritional factors using hyperspectral technology [14,16,38,39] .However, the universal characteristic wavelength selection method for protein content inversion is strongly needed.In this study, an improved WOA method that perfectly combines swarm intelligence and feature selection criteria, was proposed for identifying the sensitive wavelengths from the aspects of mechanism and predictive ability of walnut kernel protein.The advantages and limitations of this proposed method would be discussed as follows.
(1) The proposed wavebands selection method is on the basis of WOA, which has the advantages in exploration, exploitation, local optima avoidance, and convergence behavior [28,29] .To further improve the performance of WOA in feature selection, a comprehensive criterion was used to merge with the WOA to fully utilize spectral information and advantage of WOA in feature selection.In this criterion, three aspects, including the degree of correlation between the wavelength information and the target (protein content), the redundancy between the selected wavelengths, and the ability of the selected wavelengths to represent the full spectrum were taken into consideration in the whole process of feature selection.Besides, it should be noted that, the shape information, which is unique to hyperspectral information is also considered in this study to fully mine the NIR spectral characteristics.Most selected wavelengths using this proposed method were correlated with protein contents in walnut kernels mechanistically and supported by the previous studies [32,[34][35][36][37] .This proposed method could be also applied to other areas.And further validations on other nutritional factors of nuts need to be conducted in the future.The results of this study could be transferred to the nuts packinghouses and processing industries.
(2) To further verify the validity of the selected features as well as the RF model in the process of walnut kernel protein contents inversions, two other commonly used Machine Learning (ML) regression models, including support vector machine (SVM) and back propagation neural network (BPNN) were selected as the comparison models.Subsequently, the three different models based on the full spectrum, eight selected wavelengths and a mixed dataset containing selected spectral information and texture features were established separately.The inversion accuracies of the three models are listed in Table 1.The results demonstrate that for three different regression models, the models based on the eight selected wavelengths have better or comparable performance with the one based on full spectrum.And the involvement of texture feature metrics further increased the protein inversion accuracy.The literature shows that the external epidermis of walnut kernels is developed from the growth of seed coat, which affected the asparagine synthesis [40] .Therefore, the textural characteristics of walnut kernels can reflect the internal quality of walnut kernels to some extent.Besides, among these three models, the RF model has the best inversion performance with the same input.In particular, the RF model based on spectral and texture mixture features got best results, the R 2 is 0.8537 and RMSE is 18.9288 g/kg.(3) The research proposed an innovative spectral feature extraction method and provided convincing evidence that walnut kernel protein contents could be estimated through the extracted features.However, some limitations are worth noting.The small amount of sample size limited the application of the existed algorithms including deep learning algorithms in this study, which constrained the undertaking of more comparative analyses.As for the protein inversion results in this study, from Figures 4-6, it could be observed that the drawback of the established models is that the predicted protein content values are lower than the actual values in the validation process, which may be caused by the limited training size.In future work, to further validate the generalization of the selected features and model, more samples covering different sizes, different species, different colors, should be included.After the expansion of the dataset, more comparative analyses with existing methods should be conducted to verify the superiority of the proposed method in this study.Besides, future work should also be addressed to evaluate whether the proposed hybrid method is suitable for other nut species in protein inversion.

Conclusions
To effectively utilize the hyperspectral information of walnut kernel samples, a novel method merged the WOA and feature selection criteria was innovatively proposed to screen the sensitive wavebands of walnut protein.The obtained sensitive wavebands were then mixed with the texture indicators to predict walnut protein contents.The main conclusions are as follows: (1) After wavelength selection using the proposed improved WOA method, eight wavelengths, including 996, 1225, 1232, 1377, 1552, 1600, 1691, and 1700 nm were determined as protein content sensitive wavebands.According to the previous literatures, all eight wavelengths had correlation relationship with certain chemical compounds related to protein contents in walnut kernels mechanistically, which verified the effectiveness of the improved WOA in wavelength selection.
(2) The accuracies of the RF regression model based on the selected wavebands achieved better precision with the full spectral regression models.In addition, the model based on the combination of selected wavelengths and texture indicators reached a highest accuracy in walnut protein contents prediction.All the results of the models indicated the effectiveness of the sensitive wavebands selected using improved WOA method in this research.And the full use of hyperspectral information performed well with high predictive ability in predicting the protein content.

Figure 1
Figure 1 Front and back side of the half kernel samples2.2Hyper-spectrum determinationHyperspectral image of walnut kernel was measured in a laboratory by hyperspectral imager (Gaia sorter, Zhuoli Hanguang Company, Beijing, China), which is mainly composed of imaging spectrometer (V10E), lens (OL23), CCD (LT365), uniform light source (2 sets of tungsten bromide lamps), electric control mobile platform, computer and software control system.Warm up first after startup to eliminate the impact caused by baseline drift.After preheating, focusing the lens and adjusting the moving speed of the platform to avoid image distortion.The image acquisition software Spectrum View was used to collect the imaging information of walnut kernel.The hyperspectral measurement ranges were 863-1704 nm and 382-1027 nm respectively.The spectral resolutions in the two spectral ranges were 3.2 and 0.84 nm.In order to eliminate the noise influence caused by uneven illumination, measurement environment and dark current of the instrument, it is necessary to collect white background information (I w ) and black background information (I b ) by using standard whiteboard and lens cover respectively before collecting the hyperspectral image of the sample.Then according to Equation (1), the corrected hyperspectral image (I) was obtained from original hyperspectral image (I o ).

Figure 4 Figure 5
Figure 4 Calibration and validation of walnut kernel protein prediction of the RF regression models based on full spectral information

Figure 6
Figure 6 Calibration and validation of walnut kernel protein prediction of the RF regression models based on the mixed dataset Compared with the model established by solely spectral information, the addition of texture information did not )