This method
19 identifies an optimal value for a machine learning algorithm based on the asymptotic properties of the nearest neighbor classification rule
20 by comparing a set of labeled data from a class of interest against a set of unlabeled data. In this quasi-supervised learning framework, the objective is to estimate the samples in the unlabeled dataset that overlap with the labeled set of the target class. In this study, we adapt this method to a supervised framework, in which the optimum
k is selected on the training set by comparing a set of RD samples against a set of control samples using the previously described optimization method and classifier.
19 The
k value selected on the training set is then used for classifying the test samples using the k-nearest neighbors classifier. To find the optimal
k value, this optimization method calculates the cost function
L(
k) for each possible value of
k, from 1 to the number of RD samples in the training set and selects the value that minimizes
L(
k). This cost function is defined as
\begin{eqnarray}
L\left( k \right) = 4{\rm{\ }}\frac{{{\rm{\ }}\left| {{{D}_1}} \right|{\rm{\ }}}}{{{\rm{\ }}\left| {{{D}_0}} \right|{\rm{\ }}}}{\rm{\ }}\mathop \sum \limits_{j \in {{D}_0}} P\left( {{{C}_0}{\rm{|}}{{x}_j}} \right) \cdot P\left( {{{C}_1}{\rm{|}}{{x}_j}} \right) \nonumber \\
+ \, 4{\rm{\ }}\mathop \sum \limits_{j \in {{D}_1}} P\left( {{{C}_0}{\rm{|}}{{x}_j}} \right) \cdot P\left( {{{C}_1}{\rm{|}}{{x}_j}} \right) + 2k{\rm{\ }}\end{eqnarray}
where
D0 = {(
xj,
yj) |
yj =
C0} and
D1 = {(
xj,
yj) |
yj =
C1} denote the subsets of the training set that contain only the first and second class samples, respectively, and |
D0| and |
D1| denote the number of samples in these subsets.
P(
C0|
xj) and
P(
C1|
xj) are the probabilities of classifying the sample
j with the first and second class, respectively, using the remaining training samples, assuming that its class label
yj is unknown. The multiplication of these probabilities reaches its minimum with increasing certainty of classification (in this case, one probability will go to 1 and the other to 0, and their product will be close to 0). Thus the first and the second term in
Equation 1 penalizes the classification uncertainty for the training samples in
D0 and
D1, respectively. The third term, 2
k, is to penalize selecting large
k values, and thus is used for regularization. The probabilities
P(
C0|
xj) and
P(
C1|
xj) are estimated with the average rates
f0(
j) and
f1(
j), respectively,
19 using the quasi-supervised learning algorithm.
20 This estimation relies on repeating the following experiment
M times, for each training sample
j separately, and obtaining the average rate of classifying this sample with the classes
C0 and
C1 over the
M trials. In each trial,
Rm is created by randomly selecting
k samples from
D0 and
k samples from
D1, and the sample
j is classified with the label of its nearest neighbor in
Rm. Although the number of trials
M is theoretically a very large number, the quasi-supervised learning algorithm provides a computationally efficient numerical approximation of these rates.
20 Once it is determined on the training samples, the same
k value is used for classifying the test samples using the k-nearest neighbors classifier.