A fundamental goal in wearable computing is contextual awareness: devices which could, through an array of ambient sensors, understand our behavior, and interfaces which convey or adapt to this understanding.
But our individual behavior is extremely complex; a more useful approximation, we believe, exists in recognizing our activity, a function of our motion. A particular behavior can then be implicitly defined as a pattern in our activity - or, an emergent property of a particular sequence of activities.
And while in-production wearables are already quite good at monitoring our activity, these techniques are still manually conditioned on the activity type, and we believe that recent advancements in speech recognition have provided a foundation for the next step forward: end-to-end activity transcription.
Our previous approach was successful in classifying short bursts of activity (i.e. gestures), but the underlying model doesn't scale to more complex, holistic activities. At 100Hz, a two minute-long activity would produce over 10,000 measurements, and LSTMs - though concieved to solve this very issue - are prone to vanishing gradients for sequences of this length.
Still, we'd like to retain some recurrent architecture, as the activity sequences are variable-length and highly contextual (i.e. the classification of an individual activity is conditionally dependent on the classification of any preceding or subsequent activities). So, we'll need to downsample these sequences.
The naïve approaches to doing so (picking every k-th sample, averaging every k samples, etc.) would effectively throw away the high-resolution offered by a 100Hz sampling rate, and would require another hyperparameter (k) to tune. Instead, we reduce sequence dimensionality through a 1D convolution over the sequence length. This results in a more expressive model - one which can learn a low-dimensional projection of the original sequence, and retain only the most useful information found in the high-resolution samples.
The convolved outputs are then fed into a bi-directional LSTM, resulting in a hybrid Long-Term Recurrent Convolutional Network (LRCN).
We could then train this network on individual activities with a conventional Softmax-CE loss, as in Momentum, but this approach would completely ignore the contextuality of sequential activities; conversely, training with Softmax-CE on sequences of activity would require sequence alignment, which is burdensome (and typically error-prone) to perform manually. Instead, we infer sequence alignment through Connectionist-Temporal-Classification (CTC) loss, which defines a distribution over all possible output sequences given the individual (per-timestep) softmax distributions.
A CTC objective requires labeled sequences of activity, and we chose to focus on an extension of the exercise monitoring domain: strength training. The fundamental unit of strength training is a repetition (as opposed to a calorie or a kilometer), which lends itself uniquely to transcription - monitoring repetitions requires recognizing sequential activities.
We targeted four strength training exercises in particular - the squat, bench press, curl, and power clean - which capture a diverse range of cyclical motion: the squat and the bench press are axial in their translation, the curl is axial in both it's translation and rotation, and the power clean is non-axial (i.e. a repetition doesn't monotically oscillate between boundaries).
We then built out a robust data-collection pipeline, targeting the Apple Watch (the most widespread in-production wearable at the time). The pipeline consisted of three components: a watchOS app to collect and label exercise data, an iOS app to archive and transmit collected data, and a backend server to receive, combine, and relay participant data directly into Tensorflow.
Participants are automatically notified to record a workout when they walk into a supported gym.
Once a participant starts a workout, our watchOS app remains on-display and in-memory. Participants can then record sessions for each of the pre-defined exercises.
After selecting an exercise, participants move into a baseline position and wait for a haptic signal.
At the signal, our watchOS app then continously records CoreMotion data as participants perform the exercise.
After tapping finish, participants can label performed repetitions by turning the digital crown.
The labeled exercise data is then temporarily saved on-disk and indexed for later synchronization.
When the workout is complete, participants can sync performed exercises to the iOS counterpart.
Our iOS app archives and uploads workouts to our database, which is then dynamically relayed into Tensorflow.
This infrastructure allowed us to reliably conduct a three-week exercise study with six participants in total, resulting in over two million inertial measurements in 2,300 labeled sets.
We split this dataset by participant, holding one out for both validation and test. We trained exclusively on the remaining five, providing a more accurate measure of our generalization accuracy (or, encouraging the LRCN to learn participant-invariant representations, instead of overfitting to any participant-specific noise).
The model was then tuned against the dev-set through stochastic iteration: hyperparameters were sampled uniformly, and the underlying distribution was repeatedly centered around values with the strongest set transcription accuracy (the fraction of dev-set examples with a correct prediction for both the exercise type and the number of repetitions). When the search space was sufficiently small, we performed an exhaustive sweep.
This resulted in 25 unique experiments, from which we derived a batch size of 16, dropout of 1e-1, a learning rate of 1e-3, 128 BLSTM units, and 64 CNN filters of size 50.
We also trained two baseline Bi-LSTMs of differing complexity (50 vs. 150 hidden units) to validate the LRCN. The corresponding dev-set transcription accuracies (and their respective training times) are visualized below.
An
R
signifies the inclusion of an explicit
activity conclusion class, which, in the strength training domain, is a
re-racking of the weight.
These bursts of activity looked more like repetitions than epsilon-noise, and exercise-specific re-rack classes allowed the model to express this distinction.
The LRCN outperformed the BLSTM-150 and saw a 6-fold increase in convergence speed, achieving a near-perfect 98.3% test-set accuracy.