<Namuda Gesture Recognition on the Holosound ii2010 platform>

by
dr.Godfried-Willem Raes

postdoctoral researcher
Ghent University, Ghent University College & Logos Foundation



2010

This paper describes the software layer for gesture recognition using doppler based hardware systems. For a good understanding it is advised that readers familiarize first with the hardware described in "Holosound 2010, a doppler sonar based 3-D gesture measurement system ". This paper is part of a triptych in which the last one will delve further into the expressive meaning of gesture and on artistic aspects of the namuda dance project developed at the Logos Foundation.

The holosound hardware is connected to the PC for sampling and further analysis, treated in this preliminary draft of a paper. It is very important to stress that all the data acquisition and classification procedures covered here were designed to work concurrently and in real time. The latency was to be better than 10ms. This restriction counts for the fact that we did not optimise the procedures for maximalisation of precision nor for maximalisation of feature extraction. It will be obvious that taking this restriction away, and performing the analysis off-line, can greatly improve the functionality of the gesture sensing technology for analytical and scientific purposes.

1. Data Acquisition:
Since the low pass filtering is performed in the hardware, we can set up a three-channel sampler with a sampling rate of 1024 S/s. Using a National Instruments USB data acquisition device (type: NI-USB 6210 or NI-USB 6212) and calling in the drivers and libraries offered by NiDAQmx, this is pretty easy. The resolution is 16 bits and the voltage sensitivity for the inputs is programmed for -10 V to +10 V. Headroom is plenty such that we should not fear clipping. since the signals from our hardware under normal circomstances stay well within the range of 4 Vpp. The data is read in a callback function operating at a rate of 256 times a second. So on each call, 3 x 4 = 12 new samples are read. Reading the data at the sampling rate itself is impossible due to limitations in the USB driver implementation provided by National Instruments, this despite the fact that the devices themselves can handle sampling rates well above 250 kS/s.
These samples are written to the fast circular data buffers, 256 entries in size, one for each of the three vectors. Thus the time window available here is 250 ms and the sampling rate remains at 1024S/s.
A medium size time window is setup in the same callback wherefore we downsample to 256 S/s and write the result to the medium data buffers. Thus these circular buffers cover a time window of 1 second, or, 256 samples. The downsampling applies a simple low pass filter on the data.
A long time window, covering a 4 second timespan, is than implemented by downsampling again to 64 S/s. These circular buffers again contain 256 samples.
Pointers to these nine circular buffers are made available to the user via a specific structure. All buffer data is in double precision format and normalized, bipolar -1 to +1.

Further preprocessing performed from within the same callback consists of:

A. setting up a running integrator with rectification in order to calculate the signal strength proportional to the size of the moving and reflecting body. These values are returned in the normalized .xa(), .ya(), .za() fields of the structure. The integration depth can be controller with the parameter in the .dta field.
The algorithm here also implements a high pass filter to eliminate slow DC offsets and artifacts originated by very slow and mostly involuntary movements of the body.
Summarizing, body surface derivation uses:

          high pass filter
          rectifier
          low pass filter
          integrator

B. Calculation of the frequencies in the doppler signals in order to grasp information on the speed of movement. Here we encounter quite a few fundamental problems: first of all, the signals are in no way periodic but in fact bandwidth limited noise bands. Second, the frequency limits of these doppler signals depend on the angle of the movement: the cosine in the doppler formula fd = 2 v fo cos(a) / c. Only when movements are performed towards or away from a particular vector in line with the transducer, the cosine will be unity and the absolute value of the bandwidth trustworthy. If the performer knows this, he/she can obviously take it into account. For general gesture analysis though, ideally the coordinates of the movement in space have to be known. As we proved in our papers and notes on hardware for gesture recognition, this can be solved either by combining sonar and radar systems, or by FM modulating the carrier wave. This adds a lot of complexity to the software and we will not delve deeper into the related problems on this place.
However, even without cosine correction and lack of positional information, we can reduce induced errors greatly by making use of the redundancy offered by the fact that we have three vectors of data available. The sensors being placed on the vertexes of an imaginary tetrahedron, make that the maximum value for detected speed in the three vectors together can never be more than 50% off, the cosine of 60 degrees being equal to 0.5.
The algorithm performed in the callback function, first converts the data in the fast buffer to a square wave with a Schmitt trigger and hysteresis to avoid noise interference. Then the zero crosses of the signal are counted, disregarding any periodicity. Thus the results, returned in the .xf(), .yf(), .zf() fields of our structure reflect noise densities of the signal, proportional to movement speed. The numeric values returned are not normalized but reflect real world values expressed in Hz, but to be interpreted carefully within the constraints as laid out. If normalization is required, the values can be simply divided by 200, a realistic maximum value as determined empirically.


C. Calculation of vectorial acceleration. This is done by differentiating the vectorial frequency information. The delta-t value can be user controlled via the .dtacc parameter. Larger values lead to better resolution at the detriment of responsivenes. The results are returned in .xac(), .yac(), .zac() the scaling being independent from the setting for .dtacc.


This callback function updates the parameter fields in the Doppler structure 256 times a second. This is the first level of processing of the movement data. It yields information on the moving body surface as well as a first approximation of speed.
Performing FFT's on the data buffers from within the same callback is impossible, even on a quad core PC. Therefore the spectral analysis of the data must be performed as a different thread. The shape of the spectrum (the distribution of the frequency bands) reflects the characteristics of the gestures very well. This can be demonstrated convincingly by mapping the output of the transform on the keys of a piano, whereby the power of each frequency band is mapped on the velocity of the attack for each key.

The complete structure as declared in the software is:

TYPE DopplerType DWORD      ' for ii_20xx with NiDAQmx
     xa     AS SINGLE       ' reflection amplitudes for the x-vector
     ya     AS SINGLE       ' idem for y-vector
     za     AS SINGLE       ' idem for z-vector 
     xf     AS SINGLE       ' doppler frequency shift density for the x-vector
     yf     AS SINGLE
     zf     AS SINGLE
     xac    AS SINGLE       ' acceleration for the x-vector
     yac    AS SINGLE
     zac    AS SINGLE
     pxfast AS DOUBLE PTR   ' pointer to the x-vector data(0). Most recent data are in data(255) 250 ms buffer  1024 S/s
     pyfast AS DOUBLE PTR
     pzfast AS DOUBLE PTR
     pxm    AS DOUBLE PTR   ' 1s buffer, sampling rate 256 S/s
     pym    AS DOUBLE PTR
     pzm    AS DOUBLE PTR
     pxslow AS DOUBLE PTR   ' 4s buffer, sampling rate 64 S/s
     pyslow AS DOUBLE PTR
     pzslow AS DOUBLE PTR
     pxfbuf AS SINGLE PTR   ' 64 values for the frequency measurement, used for calculation of acceleration
     pyfbuf AS SINGLE PTR   ' sampling rate: 256 S/s
     pzfbuf AS SINGLE PTR   ' most recent data in data(63)
     dtacc  AS WORD         ' dt for acceleration derivation. Valid values: 1-63. Do not exceed range!
     dta    AS WORD         ' amplitude integration depth  (0-255)
     noise  AS SINGLE       ' noise level threshold   (1E-3, default, for 60 dB signal noise ratio)
     hpf    AS WORD         ' high pass filter differentiation depth
END TYPE
          

2. Gesture recognition

Although the callback function described above already performs quite some data processing and allows us to get elementary information on the movement of the body, it is not capable to perform gesture recognition. Hence the next steps. In our approach, rather than trying to recognize predefined gesture models, we attempt to maximalise the number of gestural characteristics retrievable from the use of this particular hardware and its signals as used throughout thousands of experiments with over 20 different human bodies. The subjects were either trained musicians or dancers operating with direct auditory feedback. We baptized these retrievable gestural characteristics as Namuda gesture prototypes. Each prototype is defined within a certain timeframe (100 to 1000 ms) and has a calculated normalized property strength as well as a persistence value, expressed in time units (generally 1/256 second units, corresponding to the best possible resolution within the constraints of our sampling function).

Namuda gesture types: (3)

Sofar we are able to distinguish twelve types (not counting the no-movement property, 'Freeze'). We certainly do not exclude that it might be possible to derive a few more, but at the other end we found that these twelve micro-gestures might very well be already at the upper limit of what we can clearly control given the motorics of our bodies. Many gestural characteristics can be seen as dipoles: they exclude each other. Dipoles we distinguish are:

1.- Speeding up versus slowing down

speedup
definition: the movement speed goes up within a timeframe of maximum 500 ms

slowdown
definition: the movement speed goes down within a timeframe of maximum 500 ms


2.- Grow - Shrink ( or, expansion- implosion)

exploding
definition: the size of the moving body surface increases within a timeframe of maximum 1000 ms


imploding
definition: the size of the moving body surface decreases within a timeframe of maximum 1000 ms


Algorithms for feature extraction:

Speedup/slowdown and implode/explode use a gesture recognition algorithm based on a FIR filter. The coefficients used are the elementary gesture type that we are trying to evaluate. In the first approach these coefficients are linear functions. Further refinement consists of applying different functions to calculate the coefficients: quart sine or cosine, exponentials or even beta functions. However, this should only be done on the condition that the pattern recognition triggers the property with linear coefficients. The reason being that we simply do not have enough computing power to handle each recognition algorithm with many different coefficient models in real time.
The procedure is implemented as a function to which the pointer to the frequency data array (covering 500 ms) is passed as well as the vector. Vectors are numbered 0,1,2 for the x, y, and z vertexes where the transducers are placed physically. Vector 3 always refers to a sum or a maximum value. (2)

SUB GestureProp_Speedup (BYREF ar() AS SINGLE, BYVAL vektor AS LONG)
    'time weighted moving average approach using a Nth order FIR filter
    'vector: 0,1,2  for x,y,z
    'speed data   - the coefficients are the shape of a speedup gesture (rising)

    LOCAL i AS DWORD
    LOCAL av AS SINGLE
    STATIC order AS LONG
    STATIC Fscale  AS SINGLE
    DIM oldF(0 TO 2) AS STATIC SINGLE

    ' speedup coefficients, these have only to be recalculated if the window size is changed with .Forder:
    IF order <> gesture.Forder THEN
        order = gesture.Forder
        REDIM bs(0 TO order) AS STATIC SINGLE
        RESET Fscale
        FOR i = 0 TO order
            bs(i) = i / order                 'rising series of coefficients: this is the gesture model
        NEXT i
        Fscale = (order + 1) / 2              'for normalization
    END IF

    'FIR with time dependent linear coefficients:
    FOR i = (UBOUND(ar) - order) TO UBOUND(ar)
            av = av + (ar(i) * bs(i-UBOUND(ar) + order))
    NEXT i

    'now we calculate the matching value (fuzzy):
    IF av > oldF(vektor) THEN
          gesture.speedup(vektor) = (av - oldF(vektor)) * 2 
          INCR gesture.speedup_dur(vektor)
    ELSE
          RESET gesture.speedup(vektor), gesture.speedup_dur(vektor)
    END IF
    gesture.speedup_val(vektor) = av/ Fscale 'this is the renormalised output value of the filter itself
    oldF(vektor) = av

    'reconsider the global property (vektor 3):
    Gesture.speedup(3) =  MAX(Gesture.speedup(0), Gesture.speedup(1),Gesture.speedup(2))
    Gesture.speedup_val(3) =  MAX(Gesture.speedup_val(0), Gesture.speedup_val(1),Gesture.speedup_val(2))

END SUB              

Here is a plot of the result of this algorithm:

The upper three graphs give the frequency signals for the x,y and z vectors as found in pDoppler.xf, yf, zf. Vertical scale is in Hz. The window shows a timeframe of 4 seconds. The graphs below show the output of the recognition. The vertical scale shows property strength, again for the three vectors. The last graph shows the property for all three vectors together.
The procedure for slowdown recognition is identical, but the coefficients are a linearly descending series.

Here is a plot of the result of the slowdown recognition algorithm:


For shrinking/expanding body surface we use the same approach but found that the time window has to be taken larger for a good reliability. We have it set at 1 second. The FIR filter order can be changed by the user using the parameter in the .Sorder field.

Here is the procedure for shrinking gesture recognition with linear coefficients:

SUB Gestureprop_Shrink ((BYREF ar() AS SINGLE, BYVAL vektor AS LONG)
    ' pattern recognition code using FIR approach with the gesture model based in the coefficients
    ' since this works with our omnidirectional ii2010 receivers , no angle correction is required
    ' the amplitudes of the reflected signals are independent from the frequency of the doppler shift.

    LOCAL i,j AS DWORD
    LOCAL av, s AS SINGLE
    STATIC order AS LONG
    STATIC Sscale  AS SINGLE
    DIM oldS(0 TO 2) AS STATIC SINGLE
    DIM bs(0 TO gesture.Sorder) AS STATIC SINGLE

    'calculation of a descending series of coefficients:
    IF order <> gesture.Sorder THEN
        order = gesture.Sorder
        REDIM bs(0 TO order) AS STATIC SINGLE
        RESET Sscale
        FOR i = 0 TO order                          
            bs(i) = 1 - (i/(order+1))               'this could also be the result of a beta-function describing the gesture type
            Sscale = Sscale + bs(i)                 'sum of factors, required for normalisation.
        NEXT i
    END IF

    'FIR filter with time dependent linearly descending coefficients:
    RESET j
    FOR i = (UBOUND(ar) - order + 1) TO UBOUND(ar)
            av = av + (ar(i) * bs(j))
            INCR j
    NEXT i

    'now classify: shrinking surface
     IF (av < oldS(vektor)- (@pDoppler.noise/ SQR(order)))  AND (av/Sscale > @pDoppler.noise) THEN
          IF gesture.implo_dur(vektor) THEN                                      
                   gesture.implo(vektor) = MAX(MIN((oldS(vektor) - Av)* 2.5, 1),0) 
                                                          'scaling factor for commensurability with the explo property
                                                          'observed range, before rescaling with * 2.5: 
                                                          '19.04: x<= 0.156, y<=0.224, z<=0.08  [gwr with clothes]
                                                          '19.04: X<= 0.335, y<=0.373, z<=0.268 [gwr naked]
          ELSE
                  RESET gesture.implo(vektor)                               
                                                          'the property will only be set if it lasts at least 7.8ms
          END IF
          INCR gesture.implo_dur(vektor)                  'in 1/256s units (3.9ms).
     ELSE
                  RESET gesture.implo(vektor), gesture.implo_dur(vektor)
     END IF
     gesture.implo_val(vektor) = av/ Sscale   'we always return this normalized value for research purposes.
     oldS(vektor) = av

    gesture.implo(3) =  (gesture.implo(0) * gesture.implo(1) * gesture.implo(2)) ^0.33   'to be further investigated
    gesture.implo_val(3) =  (gesture.implo_val(0) * gesture.implo_val(1) * gesture.implo_val(2)) ^0.33

END SUB          
                                       

Here is a plot of the result of the implosion algorithm:

In this case we plot in the upper three graphs the normalised amplitudes of the doppler signals against time (4 seconds). Below, again property strength.

For explosion we get similarly:

3. Fluency and speed properties


Not all gesture properties are calculated with this or a similar FIR algorithm. For the properties fluency (constancy of moving body surface within the timeframe) and fixspeed (constancy of movement speed within the timeframe), we found traditional statistics useful. Here we look in the databuffer and calculate the running average as well as the significance. The smaller the deviation from the mean, the stronger the property will be evaluated.
Here is, as an example, our coding for fluency detection:

SUB Gestureprop_Fluent ((BYREF ar() AS SINGLE, BYVAL vektor AS LONG)
    LOCAL avg, d, s AS SINGLE
    LOCAL i AS DWORD
    STATIC maxval, maxprop AS SINGLE

         ' calculation of the average in the body surface data array:
        FOR i = 0 TO UBOUND(ar)
            Avg = Avg + ar(i)
        NEXT i
        Avg = Avg / (UBOUND(ar) +1)        'normalised 0-1

        'calculate the standard deviation: 
        FOR i = 0 TO UBOUND(ar)
            d = d + ((ar(i) - Avg)^ 2)     ' method of the smallest squares
        NEXT i                             ' d =  average fault for each sample
        s = SQR(d/(UBOUND(ar)-1))          ' s = standard deviation for a random population 

            ' statistics math memo:
            ' 68% of all values in dta are between Avg - s and Avg + s
            ' 95%                      are between Avg -2s and Avg +2s
            ' 97%                      are between Avg -3s and Avg +3s

            IF Avg > @pDoppler.noise * 5  THEN
                IF gesture.flue_dur(vektor) THEN                      
                    gesture.flue(vektor) = MAX(0, 1 - (5*s / Avg))   'gives quite a good range
                ELSE
                    RESET gesture.flue(vektor)
                END IF
                INCR gesture.flue_dur(vektor)
            ELSE
                RESET gesture.flue(vektor), gesture.flue_dur(vektor)
            END IF
            gesture.flue_val(vektor) = Avg                            ' always returned.

        Gesture.flue_val(3) = (Gesture.flue_val(0) * Gesture.flue_val(1) * Gesture.flue_val(2)) ^ 0.33  ' questionable
        gesture.flue(3) = (gesture.flue(0) * gesture.flue(1) * gesture.flue(2)) ^ 0.33

END SUB                       

 

Here is a plot of the result of the fluency property algorithm:

`

And here, again for the speed constancy property:

 

4. Collision - Theatrical Collision


For the dipole collision/theatrical collision, we use yet another approach, based on an analysis of the acceleration data acquired in the sampling callback. The time window used in this analysis is 100 to 390 ms. The smaller values make the algorithm more responsive.
We define the collision property as follows:
the acceleration rises up to the point of collision, where we have a sudden sign reversal.
And, the theatrical collision as:
the acceleration goes down to the point of collision -a standstill- , where we have a small sign reversal followed with a rise of acceleration.

Algorithm:
a first leaky integrator is applied on the first 3/4th's of the acceleration data (upval) and a second integrator on the last quart of the data (downval). The sensitivity is set with the variable sens.
If now upval > sens and downval < - sens, we set the collision property.

The practical coding is as follows:

SUB Gestureprop_Collision (BYREF ar() AS SINGLE,  BYVAL vektor AS LONG)
    ' the array passed should be the amplitude array, we need it to look up the impact value on the moment of the collision
    ' we could also pass the value ar(230) alone.
    ' the algo uses a leaky integrator, not FIR!, on the acceleration data
    ' the procedure returns the gesture properties collision and theatrical collision. These are mutually exclusive.
    STATIC tog AS LONG
    STATIC xslope, yslope, zslope, xdown, ydown, zdown, sens AS SINGLE
    IF ISFALSE tog THEN
            sens = 0.4                             ' this is a critical value to experiment with
            DIM xc(0 TO 25) AS STATIC SINGLE       ' 101 ms buffers for acceleration and collision
            DIM yc(0 TO 25) AS STATIC SINGLE       ' empirically determined
            DIM zc(0 TO 25) AS STATIC SINGLE
            tog = %True
    END IF

    SELECT CASE vektor
           CASE 0
                   ARRAY DELETE xc(), @pDoppler.xac    ' circular buffer with acceleration data. Most recent data now in xc(last)
                   'calculate the shape:
                   xslope = (xslope + xc(0)) / 2       ' low pass filter - should cut off around 25Hz
                   xdown = (xdown + xc(25)) /2
                   IF (xslope > sens) AND (xdown < -sens) THEN       ' this is the pattern recognition condition
                            Gesture.collision(0) = xslope - xdown    ' always positive, since xdown is negative !
                            INCR Gesture.collision_dur(0)
                            Gesture.impact(0) =  ar(230)             ' further research required for optimum value in the array
                            RESET Gesture.theacol(0), Gesture.theacol_dur(0)
                   ELSEIF (xslope < -sens) AND (xdown > sens) THEN    ' condition for theatrical collision
                            Gesture.theacol(0) = xdown - xslope       ' always positive, since xslope is negative !
                            INCR Gesture.theacol_dur(0)
                            RESET Gesture.collision(0),  Gesture.collision_dur(0)
                            Gesture.impact(0) =  ar(230)         
                   ELSE
                            RESET Gesture.collision(0), Gesture.theacol(0), Gesture.impact(0)
                            RESET Gesture.collision_dur(0), Gesture.theacol_dur(0)
                   END IF
           CASE 1
                   ARRAY DELETE yc(), @pDoppler.yac    ' the accelleration values are updated 256 times a second
                   yslope = (yslope + yc(0)) / 2
                   ydown = (ydown + yc(25)) / 2
                   IF (yslope > sens) AND (ydown < -sens) THEN
                         Gesture.collision(1) = yslope - ydown
                         INCR Gesture.collision_dur(1)
                         Gesture.impact(1) = ar(230)
                         RESET Gesture.theacol(1), Gesture.theacol_dur(1)
                   ELSEIF (yslope < -sens) AND (ydown > sens) THEN   
                         Gesture.theacol(1) = ydown - yslope    
                         Gesture.impact(1) =  ar(230)           
                         INCR Gesture.theacol_dur(1)
                         RESET Gesture.collision(1),  Gesture.collision_dur(1)
                   ELSE
                         RESET Gesture.collision(1), Gesture.theacol(1), Gesture.impact(1)
                         RESET Gesture.collision_dur(1), Gesture.theacol_dur(1)
                   END IF
           CASE 2
                   ARRAY DELETE zc(), @pDoppler.zac
                   zslope = (zslope + zc(0)) / 2
                   zdown = (zdown + zc(25)) / 2
                   IF (zslope > sens) AND (zdown < -sens) THEN
                        Gesture.collision(2) = zslope - zdown
                        Gesture.impact(2) = ar(230)
                        INCR Gesture.collision_dur(2)
                        RESET Gesture.theacol(2), Gesture.theacol_dur(2)
                   ELSEIF (zslope < -sens) AND (zdown > sens) THEN    
                        Gesture.theacol(2) = zdown - zslope    
                        Gesture.impact(2) =  ar(230)  
                        INCR Gesture.theacol_dur(1)
                        RESET Gesture.collision(2),  Gesture.collision_dur(2)
                   ELSE
                        RESET Gesture.collision(2), Gesture.theacol(2), Gesture.impact(2)
                        RESET Gesture.collision_dur(2), Gesture.theacol_dur(2)
                  END IF
    END SELECT

    Gesture.collision(3) = MAX(Gesture.collision(0), Gesture.collision(1), Gesture.collision(2))
    Gesture.theacol(3) = MAX(Gesture.theacol(0), Gesture.theacol(1), Gesture.theacol(2))
    IF Gesture.collision(3) THEN
        Gesture.impact(3) = MAX(Gesture.impact(0), Gesture.impact(1), Gesture.impact(2))
    ELSEIF Gesture.theacol(3) THEN
        Gesture.impact(3) = MAX(Gesture.impact(0), Gesture.impact(1), Gesture.impact(2))
    ELSE
        RESET Gesture.impact(3)
    END IF

END SUB


Here is a plot of the result of the collision detection algorithm:

 

In this case the upper graphs show the vectorial accelleration channels. Theatrical collision detection resulted in the graphs below:

5. Roundness - Edginess

The spectral analysis of the doppler signals in the short data buffer allow us to discriminate gesture properties situated on a dipole that can be described as 'roundness' versus edginess. Other ways to describe the gesture characteristic referred to could be smooth and shaky or jagged.

definition: roundness applies if the gestures are smooth and continuous; edginess, if the gestures contain many abrupt components and discontinuities. The property reflects aspects of the shape of a gesture very well.
The analysis is based on the spectral power distribution in the fast data buffer (250ms). The code as we wrote it at first for the fourier transforms is:

 

SUB DFT_Dbl (Samp#(), Sp#()) EXPORT
'discrete fourier transform on double precision data
LOCAL x, i, maat, halfsize AS LONG
LOCAL kfaktor#, kt#, Rl#, Im#, g#, pw#
DIM a(UBOUND(Samp#)) AS LOCAL DOUBLE
maat = (UBOUND(Samp#)) + 1 ' must be a power of 2
halfsize = maat / 2
'note: size of Sp# must be Sp#(halfsize-1)
kfaktor# = Pi2 / maat ' this is eqv. to ATN(1)* 8# / maat
MAT a() = (1.0/maat) * Samp#()
i = 0
DO
kt# = kfaktor# * i
RESET Rl# , Im#
FOR x = 0 TO maat - 1
g# = (kt# * x)
Rl# = Rl# + (a(x) * COS(g#))
Im# = Im# - (a(x) * SIN(g#))
NEXT x
pw# = (Rl# * Rl#) + (Im# * Im#) ' sum of squares
IF pw# THEN Sp#(i) = 2 * SQR(pw#)
INCR i
LOOP UNTIL i = halfsize ' the second half would be empty

END SUB

 

In a later 10-fold improvement for speed, we recoded this using lookup tables for the sine and cosine factors and added the application of a hanning window inline. The resulting code becomes obviously a lort less readible. All DFT's in our gesture recognition code are performed with a 25% overlap. To classify these properties, we split the power spectrum frequency bins in Sp#() in two parts at a variable point i. Note that our spectrum is 8 octaves wide (4 Hz to 512 Hz) and thus, taking into account the linear nature of the frequency distrtibution in the transform, we have to set i=15 for a normal midpoint at 64Hz. Then we sum up the power in the spectral bands up to i, as well as the power in the spectral bands i+1 up to the highest frequency. The power found in the lowest bin at Sp#((0) is dropped. If lw is the first sum, and hw the second, the smoothness property for each vector is calculated as 1 - (lh / (lw+lh)) and for the edgy property as 1 - (lw/(lw+lh)). The structure returns also the durations of these mutually exclusive properties in 1/256 s units. The algorithm for this property runs at a much lower speed of 16Hz, thus guaranteeing a data overlap of 25% on each refresh. Performing the required transforms much faster, causes to much jitter in the behaviour of our real time multitasker. A consequence of this is that the durations are incremented in 16 unit steps (62.5 ms).

As we found that the analysis of the power spectrum offers a good base for the analysis of some gestural features, we created a thread in the software to perform the spectral transforms on all three databuffers continuously. The results are returned to the user via pointers in the gesture structure: gesture.pspf() for the fast transform (4 Hz-512 Hz in 128 bands, with a refresh frequency of 25 Hz); gesture.pspm() for the medium buffer transform (1 Hz- 128 Hz in 128 bands, with a refresh frequency of 5 Hz), gesture.psps() for the slow buffer transform (0.25 Hz- 32 Hz in 128 band, with a refresh frequency of 1.25 Hz).

                           

6. Airborness

We came upon this gesture property, when doing experiments with balls thrown through the sensor system. When looking into the spectral analysis data obtained, we noticed that for flying objects, the spectrum is discontinuous and reveals a gap on its low end. By looking at the size of the gap, we can distinguish flying objects -or for that matter, similarly, acrobatic jumps- from bodies in contacts with the ground. The reason for this is quite easy to understand, if we realize that a moving body normally always has some parts at rest, even when extremities such as arms, legs, head are moving vehemently. The doppler noise bands returned are therefore continuous and start always at zero. Airborness might not be the most usefull feature to extract from musicians gesture, but forms an attractive bonus if it comes to dance movement analysis. It will be obvious that the timeframe for this property detection has to be taken very short and that the persistence values of the property cannot attain high values.

When coding for recognition of this property, one has to be very carefull to take into account the spread of the background noise in the spectrum. To clarify this, we give here a plot of the maxima encountered in the spectral transform over the 250ms time window (x-vector) without any body in the setup:

The values on the horizontal axis are the frequency bins and have to be multiplied with a factor four to obtain real world frequency values. Measurement time for this plot was 2 minutes. For proper evaluation of spectrum based gesture properties, this basic noise spectrum has to be subtracted from the acquired data arrays. From a scientific point of view the superiour way of handling this would be by letting the software measure the basic noise spectrum on initialisation. Taking into account the stringent conditions for a meaningfull measurement (no moving bodies nor objects in the neighourbood as well as a measurement time of a few minutes), we rather used a rough approximation of the curve in our coding. Here are two typical spectrum plots thus obtained during a jump:

The vertical scaling was adapted a factor 10, to cope with the higher dynamic range obtained here. Unfortunately, due to the refresh speed limitations of the transforms in our coding, this property can only be calculated at a rate of 8 times a second. Hence, the duration field returns only a rough estimate of the time the body is airborne and cannot be used for precise measurement of jumps. However, the shape of the gesture during the jump is pretty well reflected in the obtained spectrum. In order to optimize the recognition code for speed, we convert the spectra to octave-bands first. Then we evaluate the proportion of power distribution in the lowest two bands to the sum of all the higher bands. If the total power is higher than a measured noise level and this proportion exceeds an empirically determined value (a value of 17 was found), the property is set. The reliability of jump detection using this algorithm is higher than 90%.

7. Tempo and periodicity


The most difficult gesture property to retrieve in real time is periodicity. It represents a well known problem that has been treated in depth by a lot of different authors in very different contexts. The problem is the same, whether applied to music or to gesture. We need information about periodicity in the gesture because it allows us the derive a tempo from gesture input.

There are a number of different algorithms possible, each with pro and cons. Since we are dealing with slow periodic phenomena, the fastest algorithm consists of measuring the time between periods (1/t equals frequency then) and checking the deviation from the mean on each new periodic phenomenon. The fastest algorithm starts from the collision detector, and yields a delay time in the order of 100ms. Synchronization is well possible, but latency and sudden jumps cannot be avoided. For clearly colliding periodic gestures, this method is very reliable and precise as long as the periodicity stays below 4 Hz (tempo MM240). However it imposes many restrictions on the type of gesture input. The jitter on the periodicity can be used as a base to calculate a fuzzy value for the tempo property. For conducting kind of gestures we found that this path can be followed although the results are not always very convincing.
The FFT approach can be followed as well, but since it has to work on the 4 second buffers if it is to resolve tempi as low as MM30, the latency will be very large. The advantage is that it also works on less clear-cut gesture input. A serious disadvantage is that synchronizing with the input is difficult. The code as we wrote it, tries to cancel out the frequent occurrences of harmonics in the spectral transforms. Yet, it is far from perfect although we have been using it to synchronize music coded in midi files or calculated in real time with dancers on stage.

In the FFT approach, we coded a task that tries to derive a musical tempo expressed in MM numbers by interpollating between the frequency bins from the spectrum transforms. Here is a code snippet:

' spectra obtained after a DFT with a hanning window in the FFT thread
' vectors are 0,1,2 for the x,y,z buffers and 3 for the sum.
DIM sps(128) AS LOCAL DOUBLE AT @gesture.psps(vector)
i = ArrayMax_Dbl (sps()) 'find the strongest frequency bin
IF i THEN
gesture.periodic(3) = ((i+1) * 15) - (7.5 * sps(i-1)/sps(i)) + (7.5 * sps(i+1)/sps(i))
gesture.jitter(3) = (SQR(sps(i) + sps(i-1) + sps(i+1))) * 1E9
ELSE
gesture.periodic(3) = 15 + (7.5 * sps(1)/sps(0))
gesture.jitter(3) = (SQR(sps(0) + sps(1))) * 1E9
END IF
' now we try to evaluate the spectrum
RESET ct
FOR i = 0 TO 2
IF ABS(gesture.periodic(i) - gesture.periodic(3)) < gesture.periodic(3)/ 20 THEN ' 5% deviation allowed
INCR ct
ELSEIF ABS((gesture.periodic(i)/2) - gesture.periodic(3)) < gesture.periodic(3)/ 20 THEN
INCR ct 'in this case the i vector must be an octave above the sum of vectors
ELSEIF ABS((gesture.periodic(i)*2) - gesture.periodic(3)) < gesture.periodic(3)/ 20 THEN
INCR ct 'in this case the i vector must be an octave below the sum of vectors
ELSEIF ABS((gesture.periodic(i)/3) - gesture.periodic(3)) < gesture.periodic(3)/ 20 THEN
INCR ct 'in this case the i vector must be an fifth above the sum of vectors
END IF
NEXT i
SELECT CASE ct
CASE 0
gesture.jitter(3) = gesture.jitter(3) ^4 ' no periodicity
CASE 3 ' here we are certain, although we may have the wrong octave.
gesture.jitter(3) = gesture.jitter(3) ^ 0.25
CASE 2 ' high probability
gesture.jitter(3) = gesture.jitter(3) ^ 0.5 ' we should see how the one deviating period relates to the equal ones.
' if it is a harmonic, we could increase confidence
CASE 1
gesture.jitter(3) = gesture.jitter(2) ^ 1.5 ' decrease certainty
END SELECT

Suggestions for improved code here are most welcomed.

 

8. Freeze

A gesture property almost so trivial that we would almost forget to mention it: the absence of movement. We could also have named it rest, referring to the musical meaning. Absence of movement can be detected as soon as the received signals are sinking away in the noise floor. This happens when no body is in the setup. A gesture freeze with a body in the field of the sensors will always have signal levels above absolute noise, since a living body always moves. Therefore, this property is not binary, but again a fuzzy value with a duration of its own.

Here is a plot for the freeze property, wherein only for the y-vector the property gets set:

 

As of the time of the writing, the structure used to return our gesture properties to the user looks like:

TYPE GestureType DWORD      
     collision (3) AS SINGLE         '0=X, 1=Y, 2=Z, 3= total     acceleration based, the values are the correlation magnitude
     collision_dur (3) AS LONG       'in 1/256 s units
     theacol   (3) AS SINGLE         'theatrical collision        acceleration based
     theacol_dur (3) AS LONG         'in 1/256 s units
     implo     (3) AS SINGLE         'surface based  - normalized property strength
     implo_dur (3) AS LONG           'in 1/256 s units 0 at the start of detection, climbing up as long as the property is valid.
     implo_val (3) AS SINGLE
     explo     (3) AS SINGLE         'surface based  - normalized property strength
     explo_dur (3) AS LONG           'duration of the property in 1/256 s units
     explo_val (3) AS SINGLE         'prediction value of the property (FIR output)
     speedup   (3) AS SINGLE         'speed based
     speedup_dur (3) AS LONG
     speedup_val (3) AS SINGLE
     slowdown  (3) AS SINGLE         'speed based
     slowdown_dur (3) AS LONG
     slowdown_val (3) AS SINGLE
     periodic  (3) AS SINGLE      'should give the tempo in MM units
     jitter    (3) AS SINGLE      'should give the degree of certainty of the above tempo.
     flue      (3) AS SINGLE      'gives the standard deviation for the surface buffer  (in fact 1-s)   constant body surface property
     flue_val  (3) AS SINGLE      'gives the average value for the surface buffer   (normalized 0-1)    size of body surface
     flue_dur  (3) AS LONG        'duration in 1/256 s units
     fixspeed  (3) AS SINGLE      'speed based, gives standard deviation for the frequency buffer
     fixspeed_val (3) AS SINGLE   'gives the average value of the constant speed
     fixspeed_dur (3) AS LONG     'duration in 1/256 s units
     impact    (3) AS SINGLE      'surface of the body on the moment of detected collision (not a gesture property!)
     smooth    (3) As SINGLE
     smooth_dur(3) AS LONG
     edgy      (3) AS SINGLE
     edgy_dur  (3) AS LONG 
     airborne (3) AS SINGLE
     airborne_dur (3) AS LONG
     freeze    (3) AS SINGLE      'property set when no movement is detected above the noise level
     freeze_dur (3) AS LONG       'in 1/256 s units
     freeze_val (3) AS SINGLE     'normally = 1- freeze, value always set
     distance  (3) AS SINGLE      'requires the combination of radar and sonar.
     pspf (3) AS DOUBLE  PTR            'pointer to the spectral transform data for the fast buffer
     pspm(3) AS DOUBLE PTR            'pointer to the spectral transform data for the medium buffer
     psps(3) AS DOUBLE PTR             'pointer to the spectral transform data for the slow buffer
     algo               AS DWORD       'parameter for algorithm to use (for research)
     Sorder             AS DWORD       'FIR filter order for the surface related properties
     Forder             AS DWORD       'FIR filter order for the speed related properties
END TYPE   

Since the entire structure is recalculated 256 times a second, responsiveness is pretty good, certainly in comparison with competing non radar or sonar based technologies. We performed some measurements to estimate the processor load and found out that the complete gesture analysing procedure takes between 12 MIPS and 25 MIPS, using a quad core Pentium processor. Thats about half of the available capacity. More than half of this processor load comes on the account of the spectral transforms.
Full source code as well as compiled libraries (DLL's) are available from the author for evaluation. The software, with slight modifications can also be used without the National Instruments data acquisition device (4) , if a decent quality four channel sound card is used. We have sensors available that can connect directly to audio inputs of a PC. These do not require a analog computing and processing board. We welcome contributions from other researchers and invite them to use, test and improve our approach.

dr.Godfried-Willem Raes


Notes:

(1) Thanks to my dance and music collaborators who helped me perform the many necessary experiments and measurements using their bodies: Moniek Darge, Angela Rawlings, Helen White, Dominica Eyckmans, Lazara Rosell Albear, Zam Ebale, Marian De Schryver, Nicoletta Brachini, Emilie De Vlam, Tadashi Endo, Nan Ping, Jin Hyun Kim and many others. Thanks also to my collaborator Kristof Lauwers to help out with code development, data analysis and debugging.

(2) All our code is written in PowerBasic and compiled with their Windows compiler, version 9.04. The code is part of a pretty large programming environment that we developed for real time composition: GMT. We put it in the public domain and all source code is available on the Logos website.

(3) Namuda is a word of our own casting. It was constructed from naked music dance. We felt the need for a new word to describe the very dance related way of music making entailed by the development of our invisible instrument.

(4) A big advantage to using the National Instruments devices is that they are per definition very well supported in Labview, undeniably the leader in professional instrumentation software and tools for analysis.

Bibliographical references:


First sketch published on the web: 05.05.2010 by dr.Godfried-Willem Raes

Please when quoting or referencing this paper, always include the link to the original source as well as the date, since we are often updating our writings.

Last update:2010-05-26