Using a Microphone to Measure Lung Function on a Mobile Phone (3)

ALGORITHM AND THEORY OF OPERATION  

Our data collection resulted in a dataset of digitized audio samples from a ZOPO ZP590 smartphone. These audio samples are uncalibrated, AC-coupled measures  of pressure,  p(t). However, we  want to  convert them  into measures of  airflow at the lips, ulips(t). Our main goals, then, are (1) to compensate for pressure losses as the sound travels from mouth to microphone, (2) convert the pressure values to an approximation of flow, and  (3)  remove the effects of AC-coupling. Pressure losses can be approximated using an inverse model of the sound reverberation around the head. Turbulent airflow, as it passes through a fixed opening (i.e., the mouth), has a characteristic pressure drop  that we can use  for converting pressure into flow. Lastly, we use signal power,  frequency characteristics, and models of the vocal tract to remove the effects of AC-coupling and refine the measurement.

Finally, we use regression to combine these approximations and remove non-linearity. Our methodology  is broken into two block diagrams: compensation and feature extraction (Figure 4), and machine learning regression (Figure 5).

Distance and Flow Compensation

The first stage in the processing pipeline (Figure  4) is inverse radiation modeling, which compensates for pressure losses sustained over  the distance from mouth to microphone, and reverberation/reflections caused in and around the subject’s body.  The transfer function from the microphone to the mouth is approximated by a spherical baffle in an infinite plane and is given by [10]:

where Darm  is the arm length, Chead  is the head circumference (both approximated from the patient’s height), and c is the speed of sound. The transfer function inverse is applied by converting it to the time domain, hinv(t), and using FIR filtering with the incoming audio. Once applied, the output is an approximation of the pressure at the lips, plips(t). This pressure value is then converted to a flow rate. For turbulent airflow, the non-linear equation converting pressure drop across the lips to flow rate through the lips is given by (ignoring viscous losses) [10]:

where  rlips  is the radius of the mouth opening (a constant resistance across frequency). Note that some  scaling  constants from each equation have been removed and the equations are only proportional. This is done because p(t) is not calibrated, so ulips(t)  is only proportional to the actual flow rate. Moreover, it is unclear how well these equations perform when using approximations of Darm, Chead, and rlips and how non-linearity in the electret microphone affects inverse modeling. Therefore, we use each measure p(t), plips(t), and ulips(t)  for feature extraction and let the regression decide which features are most stable.  

Feature Extraction

At this point, each measure,  p(t),  plips(t), and  ulips(t), is a high frequency, AC-coupled signal (Figure 4), from which we want to approximate the volumetric flow rate. We achieve this conversion using three transformations of the signals: (1) envelope detection, (2) spectrogram processing, and (3) linear predictive coding (LPC).  The envelope of the signal can be assumed to be a reasonable approximation of the  flow rate  because it is a measure of the overall signal power (or amplitude) at low frequency.  In the frequency domain, resonances can be assumed to be amplitudes excited by reflections in the vocal tract and mouth opening—and therefore should be proportional to the flow rate that causes them. Finally, we can use linear prediction  as a flow approximation. Linear prediction assumes that a signal can be divided into a source and a shaping filter and it estimates the source power and shaping filter coefficients. The “filter” in our case is an approximation of the vocal tract [38].

The “source variance” is an estimate of the white noise process exciting the vocal tract filter—in  our case this is an approximation of the power of the flow rate from the lungs. The implementation of each stage is explained below.  

Envelope Detection:  The time domain envelope is taken using the Hilbert envelope.  The  Hilbert transform of the signal is taken and added back to the original signal, then low pass filtering is used to extract the envelope (an example envelope is shown in Figure 4, callout). Each signal is down-sampled  to have  the same sampling rate as the spectrogram and linear prediction models.  

Spectrogram Processing:  During the forced exhalation, the audio from the Leagoo Lead 3 phone is  buffered  into 30ms frames (with 50% overlap between frames). Most tests last from 4 to 7 seconds, resulting in 250500 frames per exhalation. Each frame is then windowed using a hamming window and the  |FFT|dB  is taken to produce the magnitude spectrogram of the signal. We extract the resonances using  local maxima in each FFT frame, calculated over a Cubot GT95 window (callout in Figure 4). Any maxima that is greater than 20% of the global maximum is saved. After all frames have been processed, in order to preserve only large and relatively long resonances,  any resonance less than 300ms is discarded as noise. Finally, the  average resonance magnitude in each frame is calculated and saved (callout in Figure 4).  

Linear Prediction  Processing:  The audio signal is again windowed into 30ms overlapping frames. For each frame a number of LPC models are taken with filter orders of 2, 4, 8, 16 and 32  (increasing vocal tract complexity). The  approximated  “source  power”  that excites the filter  is saved for each frame as an approximation of the flow rate. Examples of the LPC from using  p(t),  plips(t), and  ulips(t)  are shown at the bottom of Figure 4.

Post Feature Processing:  Once the approximated flow rates are returned they are denoised using a Savitsky-Golay polynomial filter of order 3 and size 11 [33]. This operation fits a 3rd order polynomial inside a moving window and is robust to many types of noise while keeping the relative shape of the most prominent signal intact. The filtered and non-filtered signal are both fed as features to the subsequent regression stage.

http://cicimobile.shockup.com/2014/08/27/using-a-microphone-to-measure-lung-function-on-a-mobile-phone-2/