The proliferation of wearable computing presents a unique challenge in human-computer interaction, as we sacrifice display-size for omnipresence.
The ambient nature of wearables affords us interfaces which don't depend on touch displays for interface I/O - the Apple Watch's Haptic Engine, for example, can relay turn-by-turn directions exclusively through haptic feedback, minimizing the demand placed on your attention.
But while much work is being done to mitigate the interface input constraints of the smaller canvas[1] , comparatively little is being done to leverage the input modalities afforded by the sensors beneath the canvas.
The Apple Watch (and it's kin) offer two sensors in particular - the accelerometer and the gyroscope - which capture motion. And we believe these sensors provide the foundation for an extremely expressive input modality: the hand.
Recent efforts have found success using SVM classifiers with fixed-window sizes on both
high-frequency IMUs (4000Hz)[2] and electrical impedence tomography (EIT)[3],
Both approaches result in incredible resolution for the resulting classifier, but neither reflect in-production technology. The modified kernel required to boost IMU sample rates to 4kHz would likely be too power-consumptive for the always-on context, and the 16 or 32-electrode array needed for accurate EIT would require entirely new hardware.
These high-resolution techniques are necessary, in-part, due to the weaknesses of the SVM as a classifier, and we believe the advent of the LSTM[4] might allow for accurate low-resolution (100Hz) inertial gesture recognition. Unlike an SVM, an LSTM can learn gesture representations through time; it might then be less-sensitive to noise in low-resolution sensor readings, and learn more robust temporal patterns.
We focused on learning three action-oriented gestures - snaps, claps, and waves.
These gestures are both fast and familiar, making them a natural interface for wearables. And, in particular, they're interfaces for performing discrete actions. Altough gesture-based interfaces for manipulating data make for slick sci-fi[5], we don't find them compelling as an interface; they would likely be less precise and require far more physical effort than touch-based interfaces.
We collected our own labeled gesture data via PowerSense[5]. We performed 100 iterations of each gesture, alongside 100 iterations of an ambient gesture, representing the neutral state. Collectively, this left us with 1200 samples in total - 300 examples per gesture and 300 neutral examples.
The accelerometer and gyroscope samples were collected at 100Hz, and truncated to two seconds in length, resulting in 200 6-dimensional vectors per labeled example. A representative input for each gesture is visualized below.
We trained both a baseline feed-forward network and an LSTM network with softmax outputs, to validate the recurrent structure of the IMU sequences.
The corresponding test accuracies (varied across network depth) are visualized below. Each experiment was trained, validated, and tested on a 45-10-45 split of the original dataset.
Interestingly, the baseline feedforward network outperformed the LSTM with 8, 16, and 32 hidden units; but performance then saturated (~75%), while LSTM performance continued to increase.
With 128 hidden units, the LSTM drastically outperformed our baseline feedforward network, achieving a 90.37% accuracy on the test set. And it did so with half the parameters of the baseline model (~150k vs ~70k). Altogether, these results suggest a deep recurrent structure embedded in the IMU gesture data.
Although these initial results are encouraging, it's likely that the learned gesture representations are not truly robust; with only three individuals in the dataset, the LSTM may have overfit the temporal noise of each individual person, rather than learning the underlying, person-invariant signals.
Scaling up these results successfully would require more data and more variance within that data, which in-turn would require a more significant investment in our data collection infrastructure. This became the focus for Flex[6].