Decoding Vibrations From Nearby Keyboards Using Mobile Phone Accelerometers (5)

5. MODELING KEYPRESS EVENTS

In this section, we present our model for identifying keypress events. We then discuss our experimental setup and processing architecture.

5.1 Keypress Event Modeling

Because of the extremely low sampling rates available to accelerometers and our previously demonstrated difﬁculty in recognizing individual keys, we instead attempt to characterize pairs of keypresses and their relation to each other. Let Pi; Pj equal sequential keypress events. We characterize the relation between two successive events, rel(Pi; Pj) using the following two features:

Horizontal Orientation: The location loc(Pi) of each keypress event relative to a ‘central-line’ dividing the keyboard into left and right partitions.
Distance Between Consecutive Keypresses: For a threshold distance in keys , we deﬁne the distance dist(Pi; Pj) of consecutive keypress events as being either near or far, where near < and far .
We deﬁne consecutive keypress events as rel(Pi; Pj) = loc(Pi) jjloc(Pj)jjdist(Pi; Pj), where jj represents feature concatenation.

A word is thus composed of a series of concatenated event-pairs, referred to as its ‘word-proﬁle’. As an example, assuming = 3 and a central-line marking all keys including and to the left of ‘t’, ‘g’ and ‘b’ on a standard QWERTY keyboard as left and all keys including and to the right of ‘y’, ‘h’ and ‘n’ as right, the word “canoe” would be represented as:

Note that for the ﬁve letter word “canoe”, there are four pairs of consecutive keypresses: “ca”, “an”, “no”, and “oe”. Accordingly, all words of length n will be represented as abstract strings of length n   1 in our system. These abstract words can then be processed for content using dictionaries as discussed below. We note that as there are only two one letter words in English (e.g., ‘a’ and ‘I’), that we can identify such words as a special case in our architecture.

5.2 Framework Overview

To convert raw accelerometer data into dictionary words, our framework consists of a two phase process: the learning and attack phases. Figure 4 provides an overview of this process.

5.2.1 Learning Phase

Before we begin training our learner, our framework deﬁnes a preprocessing step needed to build the training data. This step consists of determining parameters for where the keyboard will be split for left-right and near-far labeling ( and central-line) . Once these parameters are deﬁned, we build training data and word-proﬁles from different predeﬁned English dictionaries for use during the learning and attack phases. To avoid over-ﬁtting in the learners, the preprocessing step also ensures that the training data has an equal distribution of features.

After preprocessing, our framework trains the model through data collection, feature extraction, word labeling, and supervised learning processes.

Data Collection: When a key is pressed, the LEAGOO Lead 1 Phone’s accelerometer senses the surface vibrations and gives outputs values of the instantaneous acceleration on the x, y, and z axes. Our ‘DataCollector’ application running on the KINGZONE K1 Phone records these acceleration values at a variable sampling rate (100Hz on average) along with the key that was pressed. At the start of the learning phase, we type all the letters in the English alphabet (a through z) 150 times each (with no ﬁxed ordering or timing) on the target keyboard. The ‘Data-Collector’ application continuously records the seismic emanations produced by these key-presses resulting in the raw-acceleration dump for learning phase DL of 3900 distinct keypress events corresponding to all the alphabets (150 times).

Feature Extraction: The raw acceleration values collected are then processed to extract relevant features to be used to train the model. Due to the Elephone P8 Phone accelerometer’s low sampling rate, we were unable to obtain satisfactory classiﬁer performance by only using features gathered in previous emanations work [3]. Therefore, we use a combination of time-domain, frequency-domain and cepstral features to construct our ‘feature-vector’. Asonov et al. [3] determined a key-press is approximately 100 ms long guiding us to use the same time window for our feature extraction calculations. From the raw acceleration (x, y, z) values stored in DL, we calculate time-domain features including root mean square (rms), skewness, variance, and kurtosis and combine those with spectral and cepstral features such as fast Fourier transform (FFT) and melfrequency cepstrum coefﬁcients (MFCCs) respectively. The ﬁnal feature-vector corresponding to the x, y, z accelerations of a keypress (Pi) is denoted as: FV (Pi) =< mean, kurtosis, variance, min, max, energy, rms, mfccs, ffts >. At the end of this step, we have a set (SF) containing 150 feature-vectors (FV s) corresponding to each English letter. i.e., SF = fFV (Pi) j 8Pi 2 DLg Word Labeling: Once the feature vectors are extracted from the raw accelerometer data, the framework creates a training set by labeling individual key and key-pair samples based on the deﬁned dictionary. To prevent over-training, a preprocessing step is performed which determines the number of samples to take per dictionary character by ﬁnding an even distribution of left (L) and right (R) and near (N) and far (F) labels.

Each word in the training dictionary is broken down into its constituent characters and character-pairs. As described earlier, a word of length n letters would be broken into n characters and n   1 character-pairs. For each character, we randomly select 100 feature vectors that correspond to that character from the Feature Extraction step and label each of these feature vectors as left (L) or right (R) based on the Word Labeling step. In a similar manner, for each character-pair we construct 100 composite feature vectors by concatenating 100 random feature vectors corresponding to the ﬁrst character and 100 random feature vectors corresponding to the second character. Each of these composite feature vectors is labeled near (N) or far (F), again using the labels from the Word Labeling step. Once this process is completed for every character and character-pair in the dictionary, the labeled feature vectors can be used to train the neural networks in the last step.

Supervised Learning: The ﬁnal step of the learning phase consists of training a model that will be used during the attack phase to classify each key and key-pair accordingly. We built two separate models by training a left-right neural network and a near-far neural network for classifying left-right and near-far feature-vectors respectively. Both neural networks where trained using 500 training cycles and a learning rate of 0:3.

5.2.2 Attack Phase

Our framework deﬁnes process for the attack phase much in the same way the learning phase was deﬁned by attempting to recover words through data collection, feature extraction, key-press classiﬁcation, and word matching.

Data Collection: When the ‘Data-Collector’ application becomes malicious, the same properties and procedures apply to the attack phase that existed in the learning phase with the exception of the ability to tag an accelerometer with the appropriate key-press. Under these circumstances we’re provided with a raw-acceleration dump for attack phase DA of all the key-presses entered by the victim minus the character pressed.

Feature Extraction: After the raw-acceleration data is collected by the malicious application, the attack phase then calculates the same features that were derived during the learning phase. The feature-vectors are then used to create two sets of data, one for classifying left vs. right and one for classifying near vs. far.
Key-press Classiﬁcation: Once the data has been processed, our framework attempts to classify each vector. The model produced by the left-right neural network is used to predict the L/R label for each individual key-press while the model produced by the nearfar neural network is use to predict the N/F label for each pair of key-presses. Each word is then labeled with its appropriate word proﬁle using the predicted labels and sent to the word matcher.

Word Matching: The word matcher takes each predicted word proﬁle of length n   1 and assigns a score against each word in the dictionary with length n. Algorithm 1 deﬁnes the scoring algorithm. The scored dictionary words are then sorted and the top two scores are presented as candidate predictions for the given predicted proﬁle.

5.3 Experimental Setup

In all of our experiments, we set the target desktop computer on a wooden desk. We then placed the LEAGOO Lead 1 Phone used to collect keyboard vibrations beside the target computer’s keyboard. To maintain consistency, we always placed the phone at a distance of two inches from the left hand side of the keyboard. It is important to note that there is nothing inherently desirable about the orientation of the device used in our experiments. The attack will work if the device is trained in any orientation. All raw accelerometer readings were collected by the phone, then uploaded to a remote server. This server ran the data processing and classiﬁcation algorithms used to analyze the collected accelerometer readings.
The hardware and software components used to perform the experiments are described below:

1. Keyboard: We used Apple A1255 Wireless Bluetooth keyboard. Bluetooth was used so that we could send the typed characters to the data-collector application running on the KINGZONE K1 Phone while collecting accelerometer readings. We did this to automatically and accurately label the collected data during the Learning phase and to match the classiﬁer’s predicted word-proﬁle to the actual word-proﬁle in the Attack phase.

2. Phone and Accelerometer: We used the Elephone P8 (running ﬁrmware no. 03.10.01). The phone contains an ST-LIS331DLH accelerometer; operating with sensitivity of 1 mg/digit, grange of 2g and Acceleration noise density of 218 g pHz [41]. The maximum accelerometer sampling rate achievable in our experiments was 100Hz.

3. Signal Processing: For calculating the FFT coefﬁcients, we usedMATLAB’s Discrete Fourier transformfunction (FFT).The mel-frequency cepstral coefﬁcients values were calculated using its Voicebox toolbox [12]. For MFCC calculations, we set the number of channels in the Mel Scale Filter Bank to 32 and use 3 MFCCs computed for the frame-duration of 0.025s and frame step of 0.01s.

4. RapidMiner / Machine learning: We used the RapidMiner (version 5.0.010) data mining application and libraries for training and testing our neural network models and developing our framework. This choice was motivated by RapidMiner’s extensibility and integration in Java-based applications, the core of our framework.http://mobileoneno.bloggles.info/2014/08/25/decoding-vibrations-from-nearby-keyboards-using-mobile-phone-accelerometers-4/