The Random Forest algorithm provided the highest predictive accuracy, as measured by the AUC value of 0.8, and is therefore presented in detail. The Random Forest ROC curve estimated from the LOO cross-validation analysis is shown in the left image of
Figure 3 in green, along with the ROC curves (for comparison) for all the other classifiers. The importance of each variable, as measured by the values of mean decrease in accuracy and mean decrease in Gini coefficient, when the Random Forest algorithm was applied to the full dataset, is shown in
Table 3. The decrease in accuracy measure indicates the reduction in classifier accuracy when we explicitly break any potential link between the predictor variable and the outcome variable. The Gini coefficient is a standard measure of how unequal a probability distribution is, and in this instance indicates how pure, with respect to case mix, the child nodes of a tree are after splitting using the predictor variable. Mean decrease in accuracy and mean decrease in Gini coefficient values are obtained by averaging over all the trees in the Random Forest ensemble, and were calculated using the randomForest package in R. We ran the Random Forest classifier process 2000 times, each with a different random initialization, to average over the nondeterministic element of the Random Forest algorithm. The table shows the averages of the mean decrease in accuracy and mean decrease in Gini coefficient across these 2000 runs. The numbers in brackets in
Table 3 denote the range corresponding to ±1.96 standard errors around the average, with the standard error estimated from the sample standard deviation across the 2000 runs of the Random Forest algorithm. From these figures we would conclude that “adjacent area sum” and “circ max” are the most important predictors in distinguishing normals from diabetics.