September 2024
Volume 13, Issue 9
Open Access
Artificial Intelligence  |   September 2024
Eye-Rubbing Detection Using a Smartwatch: A Feasibility Study Demonstrated High Accuracy With Machine Learning
Author Affiliations & Notes
  • Sina Elahi
    Fondation Ophtalmologique Adolphe de Rothschild, Rue Manin, Paris, France
    Visual Intelligence for Transportation, Ecole Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, Switzerland
  • Tom Mery
    Visual Intelligence for Transportation, Ecole Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, Switzerland
  • Christophe Panthier
    Fondation Ophtalmologique Adolphe de Rothschild, Rue Manin, Paris, France
  • Alain Saad
    Fondation Ophtalmologique Adolphe de Rothschild, Rue Manin, Paris, France
  • Damien Gatinel
    Fondation Ophtalmologique Adolphe de Rothschild, Rue Manin, Paris, France
    Université Internationale Abulcassis des Sciences de Santé, Rabat, Morocco
  • Alexandre Alahi
    Visual Intelligence for Transportation, Ecole Polytechnique Fédérale de Lausanne, Route Cantonale, Lausanne, Switzerland
  • Correspondence: Sina Elahi, Fondation Ophtalmologique Adolphe de Rothschild, Service du Professeur Gatinel, 29 Rue Manin, Paris 75019, France. e-mail: selahi3000@hotmail.com 
  • Footnotes
     SE and TM equally contributed to the work.
Translational Vision Science & Technology September 2024, Vol.13, 1. doi:https://doi.org/10.1167/tvst.13.9.1
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Sina Elahi, Tom Mery, Christophe Panthier, Alain Saad, Damien Gatinel, Alexandre Alahi; Eye-Rubbing Detection Using a Smartwatch: A Feasibility Study Demonstrated High Accuracy With Machine Learning. Trans. Vis. Sci. Tech. 2024;13(9):1. https://doi.org/10.1167/tvst.13.9.1.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: In this work, we present a new machine learning method based on the transformer neural network to detect eye rubbing using a smartwatch in a real-life setting. In ophthalmology, the accurate detection and prevention of eye rubbing could reduce incidence and progression of ectasic disorders, such as keratoconus, and to prevent blindness.

Methods: Our approach leverages the state-of-the-art capabilities of the transformer network, widely recognized for its success in the field of natural language processing (NLP). We evaluate our method against several baselines using a newly collected dataset, which consist of data from smartwatch sensors associated with various hand-face interactions.

Results: The current algorithm achieves an eye-rubbing detection accuracy greater than 80% with minimal (20 minutes) and up to 97% with moderate (3 hours) user-specific fine-tuning.

Conclusions: This research contributes to advancing eye-rubbing detection and establishes the groundwork for further studies in hand-face interactions monitoring using smartwatches.

Translational Relevance: This experiment is a proof-of-concept that eye-rubbing detection is effectively detectable and distinguishable from other similar hand gestures, solely through a wrist-worn device and could lead to further studies and patient education in keratoconus management.

Introduction
Keratoconus is a progressive eye disease that affects the cornea and can cause visual impairment and blindness if left untreated. One of the risk factors for the development and progression of keratoconus is eye rubbing, which can progressively lead to corneal thinning, ectasia, and visual loss.1 A recent in vitro study on porcine eyes has suggested eye rubbing causes keratoconus and keratoconus progression in susceptible corneas only,2 but the limitations from its methodology may underestimate the causative link between eye rubbing and the disease.3 Indeed, biomechanical human corneal properties have been shown to be altered in humans within 1 week after a simple 1-minute light eye-rubbing motion.4 Establishing the extent of the importance of eye rubbing in the development of keratoconus or its progression is outside the scope of this paper and is a subject currently widely discussed among corneal experts.5 Although eye rubbing is a common behavior, its frequency and duration are difficult to measure objectively, which hinders efforts to assess its impact on keratoconus and develop effective interventions. To address this gap, the need for an objective method of eye-rubbing detection is evident. 
Accurate detection of unconscious daily gestures and movement patterns has many potential applications in habit tracking, hygiene, sports, self-improvement, and specifically in healthcare and disease prevention. Key habits to limit transmission of infectious disease, like the coronavirus disease 2019 (COVID-19) pandemic, are face touching avoidance, especially mucosal membranes (eyes, nose, and mouth), and regular hand washing. It has been estimated that face touching occurs, on average, 23 times per hour.6 However, although efforts have been made to detect various hand-body interactions,79 face-touching detection represents a challenge due to differentiating hand-to-face proximity (in gestures such as glasses removal, eating/drinking, smoking, hair brushing, or toothbrushing) with actual contact. Moreover, differentiating high-risk mucosal membrane contact with contact with skin, glasses, or clothes is of primordial importance as some may be relevant for disease transmission, and others for hygiene or other applications. Researchers have had encouraging results but are limited by either technical restraints, such as multiple or unpractical sensors, body instrumentation with multiple devices, or artificial constraints, such as fixed predetermined gestures.8,10,11 Some have achieved promising results, up to accurately predicting the specific area of the face that was touched, however, without detecting actual contact, resulting in a high proportion of false positive results.12 Although acceptable for preliminary stages of development, in a real-life situation, increased rate of false positive results will undoubtedly lead to users’ fatigability, and prevent long-term use of such devices. It is therefore key to further improve the face-touching detection as well as the specificity of the notifications while preserving user-friendliness, ease-of-use, and avoiding hardware encumbrance. 
Most studies base their algorithms on the readily available accelerometers, which have shown promises in detecting some gestures, such as body tapping,7 but are limited on their own when it comes to detecting face contact. Other sensors have been investigated to improve results, such as proximity Inertial Measurement Unit (IMU), gyroscopes, or thermosensors, but never achieved expectations for real-life scenario.1215 Impressive results were also obtained from sound, magnetic fields, conduction, or pressure sensors.1619 However, these rely on cumbersome devices, or an additional emitting device, worn on the finger, for example, or even on smart textile or skin-based sensors.2022 Some of the most advanced results were obtained with a wrist-worn device combined with strap-based infrared sensors,23,24 impedance tomography,25 force-sensitive sensors,26 or photodiodes and LED measuring wrist contour.27,28 
These all achieved efficient finger recognition from wrist-based sensors wearable on an every-day watch. In case of face-touching recognition, it is key to be able to correctly identify the fingers’ subtle gestures and positions. None of these, however, can be usable in a daily-life situation. 
In an attempt to compromise among state-of-the-art technology, daily-life situation, and user-friendliness, we aimed to explore the boundaries of detection using minimally invasive hardware, such as a wristband or smartwatch, to assess the extent of its capabilities. Although most current smartwatches offer accelerometer, magnetometer, and gyroscope data, the Apple Watch (Apple Inc.) stood out as the only readily available smartwatch that also provided orientation data (roll, pitch, and yaw), making it the chosen device for our study. 
In the field of human activity recognition from wearable sensor data, previous research has primarily focused on the classification of general human activities like walking, running, and swimming. End-to-end deep learning-based techniques are now widely used as they can simultaneously learn feature representation and classification using supervised training, eliminating the need for manual feature crafting. Convolutional neural network (CNNs) and long short-term memory (LSTM) networks have been extensively used for these tasks. CNNs can extract spatial information, whereas LSTM networks are well suited at modeling temporal dependencies. Specifically, the combination of CNN with recurrent networks (DeepConvLSTM) has shown notable performances.29 
In recent years, the application of transformer models in multivariate time series classification tasks, including human activity recognition, has been explored. A transformer network is a type of neural network architecture that has gained significant popularity in the field of natural language processing (NLP). It was introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017 and has since become a cornerstone of many state-of-the-art NLP models.30 In the context of human activity recognition, the self-attention mechanism used by transformers allows them to attend to different elements of the input sequence, enabling them to effectively identify and classify human actions. A self-attention based neural network model that foregoes recurrent architectures showed clear improvements with respect to previous benchmark models (DeepConvLSTM) on four different public datasets.31,32 
Furthermore, unsupervised pretraining techniques have been successfully applied to multivariate time series classification.33 By leveraging large amounts of unlabeled data, these techniques enable the model to learn meaningful representations and features, which can be fine-tuned for specific classification tasks. 
Although previous works have made significant contributions to the field of human activity recognition from wearable sensor data, the specific task of classifying hand-face interactions remains relatively unexplored. Our approach does not introduce new methodologies, it offers a novel application of existing techniques to a specific domain. We demonstrate the applicability and effectiveness of transformer models with self-attention and unsupervised pretraining in the challenging task of classifying specific hand-face interactions with smartwatch's sensors. We also propose different ways of collecting data while ensuring the capture of genuine hand-face interactions in real-world scenarios. 
Methods
Figure 1 illustrates the used pipeline to achieve the desired outcome. 
Figure 1.
 
High level pipeline of the eye-rubbing detection system.
Figure 1.
 
High level pipeline of the eye-rubbing detection system.
Problem Statement
Input
The Apple Watch provides sensor's measures sampled at 50 hertz (Hz). The signals are composed of the 19 features described in Table 1. The data are extracted from both raw and processed device motions.34,35 
Table 1.
 
Features Description
Table 1.
 
Features Description
Output
The classes of the classification task are illustrated in Figure 2
Figure 2.
 
Classes of the classification task.
Figure 2.
 
Classes of the classification task.
Real-Time Classification
To enable real-time operation on the Apple Watch, we utilize a sliding window approach illustrated in Figure 3. This approach involves dividing the continuous stream of sensor data into fixed-size windows. Each window is then processed by the machine learning model, extracting relevant features and performing activity classification to recognize human activities. The window size is set to 3 seconds, with a step size of 0.5 seconds. This configuration allows for classification every 0.5 seconds, utilizing the previous 3 seconds of sensor signals for activity recognition. 
Figure 3.
 
Sliding window approach for real-time classification.
Figure 3.
 
Sliding window approach for real-time classification.
Attention Based Model
The model presented previously by Mahmud et al. has been adapted and implemented for the purpose of classifying hand-face interactions.31 The resulting architecture, depicted in the left part of Figure 4, incorporates some modifications compared to the original paper. Specifically, the sensor modality attention component has been removed, as our scenario solely relies on data from the Apple Watch sensors. Additionally, we have replaced the simple positional encoding with a learnable positional encoding, which has yielded improved results in our context. 
Figure 4.
 
Left: Attention based model architecture.31 Right: Training setup of the unsupervised pretraining task.33
Figure 4.
 
Left: Attention based model architecture.31 Right: Training setup of the unsupervised pretraining task.33
Input Encoding
The model receives a time-window of sensor values as input. A linear layer is applied to transform the sensor features. As described previously by Zerveas et al., each sample X ∈ Rw×m (multivariate time series of length w with m different variables) constitutes a sequence of w feature vectors xt ∈ Rm : X ∈ Rw×m = [x1, x2, ..., xw]. The original feature vectors xt are linearly projected onto a d-dimensional vector space, where d is the dimension of the transformer model sequence element representations (typically called embedding size):  
\begin{equation}{{{\bf u}}_{{\bf t}}} = {{{\bf W}}_{{\bf p}}}{{{\bf x}}_{{\bf t}}} + {{{\bf b}}_{{\bf p}}}\end{equation}
(1)
where Wp ∈ Rd×m, bp ∈ Rd are learnable parameters and ut ∈ Rd, t = 0, ..., w are the input vectors of the transformer encoder. To incorporate positional information, the model utilizes a fully learnable positional encoder. 
Transformer Encoder
The resulting representation of the input encoding is then fed into self-attention blocks. Each block has two layers. The first is a multi-head self-attention mechanism, and the second is a simple fully connected feed-forward network, as proposed previously by Vaswani et al.30 A residual connection around each of the two sublayers is applied, followed by batch normalization. Note, here, that batch normalization is used instead of layer normalization proposed by Vaswani et al., as batch normalization can mitigate the effect of outlier values in time series, an issue that does not arise in NLP word embeddings.33 The output of the transformer encoder is the final vector representations zt ∈ Rd for each time steps. 
Global Temporal Attention
Following the methods presented by Mahmud et al.,31 the representation zt ∈ Rd generated from the transformer encoder is utilized by a global temporal attention layer. This layer learns parameters to rank each time step according to their respective importance for predicting the corresponding class label for the window. The attention score (ranking) is obtained through Equation 3. The terms Wg, bg, and gz are learnable parameters.  
\begin{equation}{{{\bf g}}_{{\bf t}}} = {\rm{tanh}}\left( {{{{\bf W}}_{{\bf g}}}{{{\bf z}}_{{\bf t}}} + {{{\bf b}}_{{\bf g}}}} \right)\end{equation}
(2)
 
\begin{equation}{\alpha _t} = \frac{{\exp \left( {{g_t}^T{g_z}} \right)}}{{\sum\nolimits_t {\exp \left( {{g_t}{g_z}} \right)} }}\end{equation}
(3)
 
Then, the weighted average, f ∈ Rd, of the representations of all the time steps is computed in Equation 4.  
\begin{equation}{f^{(i)}} = \sum\limits_{t = 1}^w {{a_t}z_t^{(i)}\quad {\rm{for}}\,{\rm{i}}\, \in \left\{ {1 \ldots d} \right\}} \end{equation}
(4)
 
Finally, the resulting representation f ∈ Rd is passed through fully connected and softmax layers to obtain a distribution over classes, and its cross-entropy with the categorical ground truth labels is the sample loss to minimize. 
Unsupervised Pretraining
Transformer-based models are highly expressive and have a large number of parameters, allowing them to capture intricate patterns in the data. However, this high model complexity and capacity can be problematic when data are limited. With fewer examples to learn from, the model may quickly overfit by memorizing the training samples instead of generalizing well to unseen data. 
In our case, the limited number of labeled sequences used during supervised learning led to overfitting. To address this, Zerveas et al. proposed for the first time a transformer-based framework for unsupervised representation learning of multivariate time series.33 This approach involves pretraining the transformer encoder on unlabeled data to learn meaningful representations, which can then be used for the classification task. By leveraging unsupervised learning, the model can benefit from a larger amount of data and improve generalization performance. 
The right part of Figure 4 shows the training setup of the unsupervised pretraining task. As proposed by Zerveas et al., a proportion r of each variable sequence in the input is masked independently, such that across each variable, time segments of mean length lm are masked, each followed by an unmasked segment of mean length \({l_u} = \frac{{1 - r}}{r}{l_m}\). Here, lm = 3 and r = 0.15, as in the article by Zerveas et al. 
A linear layer on top of the final vector representations zt is used to make an estimation xˆt of the uncorrupted input vectors xt:  
\begin{equation}{{{\bf \hat{x}}}_{{\bf t}}} = {{{\bf W}}_{{\bf o}}}{{{\bf z}}_{{\bf t}}} + {{{\bf b}}_{{\bf o}}}\end{equation}
(5)
 
Then, only the predictions on the masked values (with indices in the set M ≡ {(t,i) : mt,i = 0}, where mt,i are the elements of the mask M), are considered in the mean squared error loss for each data sample:  
\begin{equation}{{\rm{L}}_{{\rm{MSE}}}} = \frac{1}{{\left| M \right|}}\sum\limits_{\left( {t,i} \right) \in M} {\sum\limits_M {{{\left( {\hat{x}\left( {t,i} \right) - x\left( {t,i} \right)} \right)}^2}} } \end{equation}
(6)
 
Data Collection
Automatic Labeling
The data collection process with automatic labeling setup is illustrated in Figure 5
Figure 5.
 
Data collection with automatic labeling setup. (NB: These demonstrate how OpenPifPaf works and the illustrations are from preliminary works assessing the accuracy of the automatic labeling setup. In these, the user is not wearing an Apple Watch, but during later stages all users were indeed wearing an Apple Watch and collecting data.)
Figure 5.
 
Data collection with automatic labeling setup. (NB: These demonstrate how OpenPifPaf works and the illustrations are from preliminary works assessing the accuracy of the automatic labeling setup. In these, the user is not wearing an Apple Watch, but during later stages all users were indeed wearing an Apple Watch and collecting data.)
The data collection task has been first achieved with an automatic fine-grained labeling using a computer vision software, where OpenPifPaf was the main tool for user's motion detection.36 The software is capable of detecting the user's actions and labeling the data from the wearable device accordingly. The participants for the data collection were to complete a 20-minute session. The sessions were composed of 5 sets each, with 4 minutes per set. Participants were prompted to perform various tasks, the specifics of which were left to the preference of each participant. 
While the computer vision was labeling, the sensor's data of the Apple Watch was recorded by another software called SensorLog.37 Based on the output of the computer vision software (user id, class, and start time [timestamp], and end time [timestamp]) and the output of the SensorLog application (timestamp + 19 features from the sensors of the Apple Watch), each sensor's measure is labeled with the associated user id and class. This data collection resulted in signals (time series of sensor's data with variable lengths) with a label corresponding to one of the classes shown in Figure 6
Figure 6.
 
Classes used with the automatic labeling setup.
Figure 6.
 
Classes used with the automatic labeling setup.
The dataset collected with this setup has constituted a foundational basis for several analyses. Based on the signal statistics, we have set a window size of 3 seconds and a step size of 0.5 seconds to segment the stream of sensor data for real-time prediction. Furthermore, we have found that the 10 classes shown in Figure 6 are scarcely distinguishable, even for humans. Training any model on this classification task resulted in poor performance. Therefore, we have chosen to group the 10 classes into 4 more meaningful categories (i.e. eye rubbing, face touching, hair combing/skin scratching, and teeth brushing), as depicted in Figure 2
However, as these data were recorded in a static position, in an in vitro setting, the resulting algorithm initially suffered from a high rate of false positive results. To overcome this issue, another set of manually labeled data gathered directly from a Watch OS application in real-life setting was added. This presented the added benefit of less preprocessing and data cleaning steps as the automatically labeled data. 
Manual Labeling
The data collection process with manual labeling setup is illustrated in Figure 7
Figure 7.
 
Data collection with manual labeling setup.
Figure 7.
 
Data collection with manual labeling setup.
Initially, 10 participants were requested to perform the various hand-face interactions while wearing the Apple Watch. The specifics of which were left to the preference of each participant. The data collection process was as follow: the participant engaged in the data collection process by explicitly selecting an action from the provided list. Once an action was chosen, the participant had a 2-second window to reach the designated start position. Following the 2-second interval, a haptic feedback and a single ring sound signaled the initiation of the action. The participant started the action at this prompt, and the associated sensor's data was recorded over the subsequent 3 seconds. Upon completion of the 3-second recording period, 2 haptic feedbacks with 2 ring sounds were triggered. It was noted that the actual duration of the participant's action varied; however, what mattered was that the onset of the action fell within the 3-second window. Throughout the data collection, the participant was encouraged to exhibit a diverse range of start positions and perform natural movements. 
Evaluation Metric
Macro average F1 score is used as the evaluation metric to compare the performance of the proposed approach with other methods. The F1 score for each class i is computed as follows:  
\begin{equation} \hbox{F1-Score}_{i} = \frac{{2 \times {\rm{Precisio}}{{\rm{n}}_i} \times {\rm{Recal}}{{\rm{l}}_i}}}{{{\rm{Precisio}}{{\rm{n}}_i} + {\rm{Recal}}{{\rm{l}}_i}}}\end{equation}
(7)
 
Then the macro average F1 score is calculated by averaging the statistics for each label:  
\begin{equation} \hbox{Macro F1-Score} = \frac{1}{{\left| C \right|}} \times \sum\limits_{i = 1}^C \hbox{F1-Score}_{i} \end{equation}
(8)
 
Train-Validation Split
To ensure an unbiased estimation of the model's performance, the sequences should be split in such a way that each sequence in the training set comes from users who do not have sequences in the validation set. This can be achieved by splitting the sequences based on user IDs. In addition, while having overlapping windows in the training samples is not an issue, the validation set should only be populated by non-overlapping sequences. 
As previously discussed, the dataset collected using automatic labeling resulted in a high number of false positive results after deployment on the watch and required heavy preprocessing to be used for supervised training. Therefore, this dataset was used exclusively for unsupervised pretraining. The collected streams of sensor data were segmented using a sliding window approach, using a window size of 3 seconds and a step size of 3 seconds for both the training and validation sets (no overlapping). Conversely, the dataset collected using the manual labeling setup was solely used for supervised training. The dataset are split as follows: 
  • Unsupervised Pre-Training:  
    • - Train users: 10 to 38
    • - Validation users: 40 to 49
  • Supervised Training:  
    • - Train users: 50, 51, 53, 55, 56, 59, and 60
    • - Validation users: 52, 54, 57, and 58
The split for the unsupervised pretraining results in 17,238 sequences in the training set and 2774 sequences in the validation set. The split for the supervised training results in 1600 sequences in the training set, with 320 sequences in each of the 5 classes presented in Figure 2, and 400 sequences in the validation set, with 80 sequences per class. 
Experiments
Effectiveness of Unsupervised Pretraining
The effectiveness of the unsupervised pretraining method proposed in is evaluated.33 We trained four versions of the attention-based model both from scratch and used the unsupervised pretraining framework. We used the Adam optimizer with cosine warmup scheduler and GELU activation functions. 
For the pretraining of the transformer encoder, each version is pretrained over 500 epochs using a learning rate of 10−3, with 6000 warmup iterations, and a batch size of 128. The whole models are then trained for classification over 20 epochs using a learning rate of 5 × 10−4, with 200 warmup iterations, and a batch size of 16. For regularization, dropout of 10% and weight decay of 10−6 have been used. 
Model Comparisons
We initially established a baseline using traditional machine learning techniques, such as K-Nearest Neighbors, Support Vector Machines, and Random Forest. For this purpose, we utilized minimal handcrafted signal features computed in the time domain to adhere to the real-time classification constraint and computational resources available on the Apple Watch. These features include minimum, maximum, mean, standard deviation, skewness, and kurtosis computed for each of the 19 sensor channels. The number of neighbors in K-Nearest Neighbors is set to 5. In the case of Random Forest, the number of trees is set to 141, and the maximum depth is set to 16. We also compared the performances of conventional CNN and DeepConvLSTM models with our best attention-based model (transformer) version. We have implemented the original architecture of DeepConvLSTM presented previously.29 To recall, the DeepConvLSTM architecture consists of four consecutive convolutional layers and two layers of LSTMs. Each convolutional layer is composed with 64 filters, each with a size of 5 × 1. The convolutions are performed across the time steps. The output from the last convolutional layer is then passed through a two-layer LSTM, where each LSTM layer has 128 hidden units. The final output vector is connected to a fully connected layer, and the softmax operation is applied to the resulting output. In the fully connected layer, a dropout rate of 50% is applied. The CNN model is obtained by simply removing the LSTM layers from DeepConvLSTM. CNN and DeepConvLSTM models are trained using the Adam optimizer with One-CycleLR scheduler and ReLU activation functions. Both models are trained over 100 epochs using a learning rate of 5 × 10−4, a batch size of 16, and a weight decay of 10−6
Fine-Tuning
The gestures and movements of each individual are subject to individual variations and are not easily generalizable to an algorithm. This explains the poor results achieves on the validation set (only 0.63 of the F1 score). Therefore, we propose a fine-tuning step of the model. 
We collected 200 sequences from a new participant (user 62) using the manual data labeling setup. Out of these, 100 sequences were used to fine-tune the attention-based model that showed the best results on the validation set. The performance of the fine-tuned model was then evaluated on the remaining 100 sequences. When fine-tuning the models, we allow training of all the weights. 
Additionally, it is notable that half of the sequences in the supervised training set originated from user 50. To gain insights into the model's performance specifically on user 50, we collected an additional 500 sequences exclusively from that user. The evaluation was then conducted solely on these newly collected sequences to assess the model's performance in this specific user context. 
Leave-One-Out Cross-Validation
To assess the stability of the attention-based model and its ability to generalize to unseen users, we used a tailored evaluation method that combines leave-one-out cross-validation (LOOCV) with randomized training-validation splits. Specifically, for each iteration, one user’s data is held out as the test set, while the data from the remaining users are randomly divided into training (6 users) and validation (4 users) sets. Importantly, data from user 50 is deliberately kept in the training set across all iterations to ensure a balanced data distribution across users in the training phase. We mitigate overfitting through early stopping, guided by the performance on the validation set. The F1 score and the area under the receiver operating characteristic (ROC) curve (AUC) for the eye-rubbing classes are computed on the data of the test user (Tables 23
Table 2.
 
Users Split of LOOCV
Table 2.
 
Users Split of LOOCV
Table 3.
 
Comparative F1/AUC Scores of Models Across LOOCV Splits
Table 3.
 
Comparative F1/AUC Scores of Models Across LOOCV Splits
Results
Datasets
Table 4 summarizes the statistics for each collected dataset. The automatic labeling setup resulted in signals of variable length. For those signals, we provide statistics of the raw collected signals per user, presented as interactive plots online.38 
Table 4.
 
Statistics of the Collected Datasets
Table 4.
 
Statistics of the Collected Datasets
The resulting dataset collected with the manual labeling setup comprises sequences from 10 users (users 50 to 60). Each user contributes a total of 100 sequences, except for user 50, who contributes 1000 sequences. Each sequence is a signal of 3 seconds of sensor's data. In each user's dataset, an equal number of sequences is allocated for the five classes depicted in Figure 2. For classes that include subclasses (e.g. face touching and hair combing/skin scratching), an equal number of sequences is allocated for each subclass. We publicly share our datasets online.39 
Results shown in Table 5 confirm that unsupervised pretraining offers a substantial performance benefit over fully supervised learning both in term of classification performance (F1 score) and prediction confidence (cross entropy loss). 
Table 5.
 
Performances of Attention Based Model Trained From Scratch Versus Using Unsupervised Pretraining for Different Transformer Encoder Configurations
Table 5.
 
Performances of Attention Based Model Trained From Scratch Versus Using Unsupervised Pretraining for Different Transformer Encoder Configurations
Based on the results presented in Table 6, we confirm that the attention-based model (transformer) outperforms both traditional machine learning and deep learning methods by a significant margin. 
Table 6.
 
Comparison of Model's Performances
Table 6.
 
Comparison of Model's Performances
Overall Results
Performances of the attention-based model assessed on the supervised validation set are shown in Figure 8. The attention-based model reached an F1 score of 0.63 (Table 6). 
Figure 8.
 
Performances of the attention based model assessed on the validation set (5 individuals, 500 sequences, and 100 sequences per class). Attention based model reached an F1 score of 0.63.
Figure 8.
 
Performances of the attention based model assessed on the validation set (5 individuals, 500 sequences, and 100 sequences per class). Attention based model reached an F1 score of 0.63.
Performances of the fine-tuned attention-based model assessed on 100 sequences from the new participant are shown in Figure 9. The attention-based model reached an F1 score of 0.81. 
Figure 9.
 
Performances of the fine-tuned attention based model assessed on 100 sequences from the new participant, with 20 sequences per class. Attention based model reached an F1 score of 0.81.
Figure 9.
 
Performances of the fine-tuned attention based model assessed on 100 sequences from the new participant, with 20 sequences per class. Attention based model reached an F1 score of 0.81.
Performances of the attention-based model assessed on 500 sequences from user 50 are shown in Figure 10. The attention-based model reached an F1 score of 0.95. 
Figure 10.
 
Performances of the attention based model assessed on 500 sequences from user 50, with 100 sequences per class. Attention based model reached an F1 score of 0.95.
Figure 10.
 
Performances of the attention based model assessed on 500 sequences from user 50, with 100 sequences per class. Attention based model reached an F1 score of 0.95.
Conclusions
In this work, we first proposed a Watch OS app and a data collection procedure that ensured the capture of genuine hand-face interactions in real-world scenarios. Several other studies have managed to report good results using more cumbersome devices or tailored and limited gestures.1619 The current method improves on these and adds the challenge of achieving similar results solely with a wrist-worn device. George Nokas et al. proposed an initial approach for eye rubbing detection using machine learning methods, that was encouraging as a proof-of-concept study but had several limitations.40 First, their data were collected from only two participants, with a random split that could mix training and testing data from the same individual, weakening their study’s validity. Conversely, our study includes data from 50 participants and ensures a clean separation between training and testing sets, enhancing the reliability and generalizability of our results. Additionally, their use of a custom device, although convenient for in vitro studies and feasibility ones, restricts scalability and public deployment. In contrast, we utilize off-the-shelf smartwatches, offering a more practical solution. Finally, their reliance on a binary classifier to distinguish eye rubbing from random activities fails to accurately separate somehow similar actions, such as face skin scratching from eye rubbing, which would be key to limit false positive results and see their model applicable in a real-world setting. 
Existing models for human activity recognition from sensor reading, whether they are recurrent, convolutional, or hybrid, face challenges in capturing the spatiotemporal context information from the sequences. Although CNNs excel at capturing spatial information, LSTM networks were typically required to capture temporal information. However, the transformer architecture presents numerous advantages over LSTM networks, such as parallel computation, efficient capture of long-range dependencies with attention mechanisms, and mitigation of sequential bias. Transformers are also memory efficient, scalable for larger sequences, and offer interpretability. In this context, Mahmud et al. came up with a self-attention based neural network model that foregoes recurrent architectures and utilizes different types of attention mechanisms to generate higher dimensional feature representation used for classification.31 They performed extensive experiments on four popular publicly available datasets: PAMAP2, Opportunity, Skoda, and USCHAD and achieved significant performance improvement over recent state-of-the-art models. Moreover, Zerveas et al. demonstrated the successful application of unsupervised pretraining techniques in the realm of multivariate time series classification.33 By utilizing extensive unlabeled data, these techniques empower the model to acquire meaningful representations and features that can be further refined for specific classification tasks. 
In our study, we demonstrated the substantial superiority of the attention-based model (transformer) over CNN and DeepConvLSTM in accurately classifying specific hand-face interactions using smartwatch sensors. These findings provide strong evidence for the applicability and effectiveness of transformer models with self-attention in tackling this challenging task. Additionally, our study confirmed that unsupervised pretraining yielded substantial performance improvements compared to fully supervised learning for this particular task. 
The strength of this study is that it managed to successfully achieve promising results in predicting and differentiating among hand-face interactions, and detecting eye rubbing in a real-life scenario, with the self-imposed challenges and technological restraints of a solely wrist-worn device. Furthermore, the current algorithm can be further improved by a proposed fine-tuning step based of either 100 sequences (approximately 20 minutes of data collection) or 1000 sequences (approximately 3 hours) provided by the user. Currently, this step is not automated but could be further improved and facilitated in the near future by constant feedback and fine-tuning from the device while being worn day after day. 
In the current state, the trained model enables the detection of eye rubbing at 64%, which increases to 80% and 97% with 100 and 1000 sequences, respectively. These are commendable results, demonstrating the feasibility of the project, especially with the self-imposed technological restraints. However, further improvement is still required in terms of data collection, algorithm optimization, and, most importantly, model performance assessment, as the fine-tuning has been experimented on only one user for now. The current need of a 3-hour-long fine-tuning step on a user to achieve good results is the main limitation for achieving high accuracy, however, it may become negligible once the algorithm is constantly active and worn, and can benefit from constant feed-back and real-time data collection. Additionally, the specifics of each hand gesture was not clearly defined. The participants training the first algorithm using computer vision software (OpenPifPaf) were only required to perform the prompted gesture, without clear definitions, and the specifics of which were left to each participant. Conversely, the same can be said about the unsupervised algorithm. Although a different category for “eye touching” and “eye rubbing” was defined, it is likely that the algorithm performs differently with various eye rubbing methods (phalanx only, index only, full-hand, etc.).5 Paradoxically, it is possible that the loose definition of “eye rubbing,” mimicking the different ways eye rubbing exists and may be performed by various subpopulation and in different settings, helped in obtaining a better training model for the algorithm's goal of real-life eye-rubbing detection. Further studies regarding these variations are nevertheless warranted. 
Acknowledgments
Disclosure: S. Elahi, None; T. Mery, None; C. Panthier, None; A. Saad, None; D. Gatinel, None; A. Alahi, None 
References
Mazharian A, Flamant R, Elahi S, Panthier C, Rampat R, Gatinel D. Medium to long term follow up study of the efficacy of cessation of eye-rubbing to halt progression of keratoconus. Front Med (Lausanne). 2023; 10: 1152266. [CrossRef] [PubMed]
Torres-Netto EA, Abdshahzadeh H, Abrishamchi R, et al. The impact of repetitive and prolonged eye rubbing on corneal biomechanics. J Refract Surg. 2022; 38(9): 610–616. [CrossRef] [PubMed]
de Azevedo Magalhaes O, Dodds P. Ex vivo eye rubbing evidence. J Refract Surg. 2022; 38(11): 752. [CrossRef] [PubMed]
Li X, Wei A, Yang Y, Hong J, Xu J. Effect of eye rubbing on corneal biomechanical properties in myopia and emmetropia. Front Bioeng Biotechnol. 2023; 11: 1168503. [CrossRef] [PubMed]
Jaskiewicz K, Maleszka-Kurpiel M, Michalski A, Ploski R, Rydzanicz M, Gajecka M. Non-allergic eye rubbing is a major behavioral risk factor for keratoconus. PLoS One. 2023; 18(4): e0284454. [CrossRef] [PubMed]
Kwok YLA, Gralton J, McLaws ML. Face touching: a frequent habit that has implications for hand hygiene. Am J Infect Control. 2015; 43(2): 112–114. [CrossRef] [PubMed]
Chen X, Li Y. Bootstrapping user-defined body tapping recognition with offline-learned probabilistic representation. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology. UIST ’16. New York, NY: Association for Computing Machinery; 2016: 359–364.
Chen X, Marquardt N, Tang A, Boring S, Greenberg S. Extending a mobile device's interaction space through body-centric interaction. In: Proceedings of the 14th International Conference on Human-Computer Interaction with Mobile Devices and Services. Mobile HCI ’12. New York, NY: Association for Computing Machinery; 2012: 151–160.
Vechev V, Dancu A, Perrault S, Roy Q, Fjeld M, Zhao S. Movespace: on-body athletic interaction for running and cycling. In: Proceedings of the 14th International Conference on Human-Computer Interaction with Mobile Devices and Services. Mobile HCI ’12. New York, NY: Association for Computing Machinery; 2018: 1–9.
Sales Dias M, Gibet S, Wanderley M, Bastos R. Gesture-Based Human-Computer Interaction and Simulation, Proceedings of Gesture Workshop 2007. Vol. 5085; Lecture Notes in Computer Science; 2009.
Chen X, “Anthony” , Schwarz J, Harrison C, Mankoff J, Hudson S. Around-body interaction: sensing &; interaction techniques for proprioception-enhanced input with mobile devices. In: Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices & Services. Mobile HCI ’14. New York, NY: Association for Computing Machinery; 2014: 287–290.
Chen X FaceOff: detecting face touching with a wrist-worn accelerometer. arXiv: Preprint. 200801769 [cs]. Published online August 4, 2020. Availale at: http://arxiv.org/abs/2008.01769.
Yang Z, Yu C, Zheng F, Shi Y. ProxiTalk: activate speech input by bringing smartphone to the mouth. Proceedings of the ACM on Interactive, Mobile, Wearable Ubiquitous Technologies. 2019; 3(3): 1–25.
Dong Y, Scisco J, Wilson M, Muth E, Hoover A. Detecting periods of eating during free-living by tracking wrist motion. IEEE J Biomed Health Inform. 2014; 18(4): 1253–1260. [CrossRef] [PubMed]
Son JJ, Clucas JC, White C, et al. Thermal sensors improve wrist-worn position tracking. NPJ Digit Med. 2019; 2(1): 15. [CrossRef] [PubMed]
Harrison C, Tan D, Morris D. Skinput: appropriating the body as an input surface. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY: Association for Computing Machinery; 2010: 453–462.
Zhang Y, Zhou J, Laput G, Harrison C. SkinTrack: using the body as an electrical waveguide for continuous finger tracking on the skin. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. New York, NY: Association for Computing Machinery; 2016: 1491–1503.
Zhang X, Kadimisetty K, Yin K, Ruiz C, Mauk MG, Liu C. Smart ring: a wearable device for hand hygiene compliance monitoring at the point-of-need. Microsyst Technol. 2019; 25(8): 3105–3110. [CrossRef]
Amento B, Hill W, Terveen L. The sound of one hand: a wrist-mounted bio-acoustic fingertip gesture interface. In: CHI ’02 Extended Abstracts on Human Factors in Computing Systems. CHI EA ’02. New York, NY: Association for Computing Machinery; 2002: 724–725.
Weigel M, Lu T, Bailly G, Oulasvirta A, Majidi C, Steimle J. iSkin: flexible, stretchable and visually customizable on-body touch sensors for mobile computing. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI ’15. New York, NY: Association for Computing Machinery; 2015: 2991–3000.
Kao HL (Cindy), Holz C, Roseway A, Calvo A, Schmandt C. DuoSkin: rapidly prototyping on-skin user interfaces using skin-friendly materials. In: Proceedings of the 2016 ACM International Symposium on Wearable Computers. ISWC ’16. New York, NY: Association for Computing Machinery; 2016: 16–23.
Poupyrev I, Gong NW, Fukuhara S, Karagozler ME, Schwesig C, Robinson KE. Project Jacquard: interactive digital textiles at scale. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. New York, NY: Association for Computing Machinery; 2016: 4216–4227.
Ortega-Avila S, Rakova B, Sadi S, Mistry P. Non-invasive optical detection of hand gestures. In: Proceedings of the 6th Augmented Human International Conference. AH ’15. New York, NY: Association for Computing Machinery; 2015: 179–180.
Gong J, Xu Z, Guo Q, et al. WrisText: one-handed text entry on smartwatch using wrist gestures. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. CHI ’18. New York, NY: Association for Computing Machinery; 2018: 1–14.
Zhang Y, Harrison C. Tomo: wearable, low-cost electrical impedance tomography for hand gesture recognition. In: Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. UIST ’15. New York, NY: Association for Computing Machinery; 2015: 167–173.
Dementyev A, Paradiso JA. WristFlex: low-power gesture input with wrist-worn pressure sensors. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology. UIST ’14. New York, NY: Association for Computing Machinery; 2014: 161–166.
Fukui R, Watanabe M, Shimosaka M, Sato T. Hand shape classification with a wrist contour sensor: analyses of feature types, resemblance between subjects, and data variation with pronation angle. Int J Robotics Res. 2014; 33(4): 658–671. [CrossRef]
Fukui R, Watanabe M, Gyota T, Shimosaka M, Sato T. Hand shape classification with a wrist contour sensor: development of a prototype device. In: Proceedings of the 13th International Conference on Ubiquitous Computing. UbiComp ’11. New York, NY: Association for Computing Machinery; 2011: 311–314.
Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016; 16(1): 115. [CrossRef] [PubMed]
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. arXiv Preprint. Published online August 1, 2023, doi:10.48550/arXiv.1706.03762.
Mahmud S, Tonmoy MTH, Bhaumik KK, et al. Human activity recognition from wearable sensor data using self-attention. Published online March 17, 2020. arXiv Preprint, doi:10.48550/arXiv.2003.09018.
Dirgová Luptáková I, Kubovčík M, Pospíchal J. Wearable sensor-based human activity recognition with transformer model. Sensors. 2022; 22(5): 1911. [CrossRef] [PubMed]
Zerveas G, Jayaraman S, Patel D, Bhamidipaty A, Eickhoff C. A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD ’21. New York, NY: Association for Computing Machinery; 2021: 2114–2124.
Apple Developer. Getting raw accelerometer events. Apple Developer Documentation, https://developer.apple.com/documentation/coremotion/getting_raw_accelerometer_events.
Apple Developer. Getting processed device-motion data. Apple Developer Documentation, https://developer.apple.com/documentation/coremotion/getting_processed_device-motion_data.
Github Inc. openpifpaf/openpifpaf, https://github.com/openpifpaf/openpifpaf.
Mery T . TemryL/HFI_DataVisualization, https://github.com/TemryL/HFI_DataVisualization.
Nokas G, Kotsilieris T. Preventing keratoconus through eye rubbing activity detection: a machine learning approach. Electronics. 2023; 12(4): 1028. [CrossRef]
Figure 1.
 
High level pipeline of the eye-rubbing detection system.
Figure 1.
 
High level pipeline of the eye-rubbing detection system.
Figure 2.
 
Classes of the classification task.
Figure 2.
 
Classes of the classification task.
Figure 3.
 
Sliding window approach for real-time classification.
Figure 3.
 
Sliding window approach for real-time classification.
Figure 4.
 
Left: Attention based model architecture.31 Right: Training setup of the unsupervised pretraining task.33
Figure 4.
 
Left: Attention based model architecture.31 Right: Training setup of the unsupervised pretraining task.33
Figure 5.
 
Data collection with automatic labeling setup. (NB: These demonstrate how OpenPifPaf works and the illustrations are from preliminary works assessing the accuracy of the automatic labeling setup. In these, the user is not wearing an Apple Watch, but during later stages all users were indeed wearing an Apple Watch and collecting data.)
Figure 5.
 
Data collection with automatic labeling setup. (NB: These demonstrate how OpenPifPaf works and the illustrations are from preliminary works assessing the accuracy of the automatic labeling setup. In these, the user is not wearing an Apple Watch, but during later stages all users were indeed wearing an Apple Watch and collecting data.)
Figure 6.
 
Classes used with the automatic labeling setup.
Figure 6.
 
Classes used with the automatic labeling setup.
Figure 7.
 
Data collection with manual labeling setup.
Figure 7.
 
Data collection with manual labeling setup.
Figure 8.
 
Performances of the attention based model assessed on the validation set (5 individuals, 500 sequences, and 100 sequences per class). Attention based model reached an F1 score of 0.63.
Figure 8.
 
Performances of the attention based model assessed on the validation set (5 individuals, 500 sequences, and 100 sequences per class). Attention based model reached an F1 score of 0.63.
Figure 9.
 
Performances of the fine-tuned attention based model assessed on 100 sequences from the new participant, with 20 sequences per class. Attention based model reached an F1 score of 0.81.
Figure 9.
 
Performances of the fine-tuned attention based model assessed on 100 sequences from the new participant, with 20 sequences per class. Attention based model reached an F1 score of 0.81.
Figure 10.
 
Performances of the attention based model assessed on 500 sequences from user 50, with 100 sequences per class. Attention based model reached an F1 score of 0.95.
Figure 10.
 
Performances of the attention based model assessed on 500 sequences from user 50, with 100 sequences per class. Attention based model reached an F1 score of 0.95.
Table 1.
 
Features Description
Table 1.
 
Features Description
Table 2.
 
Users Split of LOOCV
Table 2.
 
Users Split of LOOCV
Table 3.
 
Comparative F1/AUC Scores of Models Across LOOCV Splits
Table 3.
 
Comparative F1/AUC Scores of Models Across LOOCV Splits
Table 4.
 
Statistics of the Collected Datasets
Table 4.
 
Statistics of the Collected Datasets
Table 5.
 
Performances of Attention Based Model Trained From Scratch Versus Using Unsupervised Pretraining for Different Transformer Encoder Configurations
Table 5.
 
Performances of Attention Based Model Trained From Scratch Versus Using Unsupervised Pretraining for Different Transformer Encoder Configurations
Table 6.
 
Comparison of Model's Performances
Table 6.
 
Comparison of Model's Performances
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×