Fourier transform notes
------------------------

Say we have a sampled sound :


   s[0], s[1], s[2], ... , s[N-1]


 .-"-.     .-"-.     .-"-.     .-"-.     .-"-.  
     "-.-"     "-.-"     "-.-"     "-.-"     "-

 (http://textart4u.blogspot.com/2012/03/waves-text-art-ascii-art.html)

 where the points are amplitude (decibels) of a sound, taken at some
 sampling rate.

 N = number of points in sample
     (also called the "block size" or "buffer size"
      in some software settings)

 dt = time per sample = elapsed time between s[n] and s[n+1].

The elapsed time for the whole sample is then

 T = sample total time = N * dt

Example:

 dt = 1/(44100) sec (typical audio sampling rate) = 2.27e-5 sec = 0.0227 msec
 N = 2048
 T = 0.0464 sec

The sampling frequency is

 f_sampling = 1/dt = 44.1 kHz = 44100 samples per second

And if it takes B bits to represent one sample, then the bitrate would be

 bitrate = B * f_sampling

The fastest oscillation possible within this signal looks like this :

  [1, -1, 1, -1, 1, ... ]

in which each sample is one half of the audio cycle, or in other words
it takes two sample points for one audio up-and-down sort-of-sine wave.

 f_highest_audio = 1/(2*dt) = (1/2) f_sampling = 22kHz

A "fourier transform" is a math procedure to convert the N sound samples
into N frequency values, where the frequencies go from f=0 (the time average
part of the signal, if it's nonzero) to f_highest_audio.

For all the frequencies except f=0 and f=f_highest_audio, there are two
independent phases (i.e. sin, cos) and so two values.

Often times one of these two is numbered with a negative index, which is
consistent with the writing the oscillations as complex exponentionals
i.e. exp(i f) and exp(- i f) ... but that is too much math to go into
the details here.

So the fft values can be numbered like this

 fft[0], (fft[-1], fft[1]), ... (fft[-(N/2 -1)], fft[N/2-1]), fft[N/2]

where fft[0] is amount of frequncy=0 signal, and fft[N/2] is the amount
of frequency = f_highest_audio = 1/(2*dt).

To make this clear, consider some small cases:

 N=4, sound=( s[0], s[1], s[2], s[3] ),
      fourier=( fft[0], (fft[-1], fft[1]), fft[2] )
 
 N=8,
  sound=( s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7] ),
  fourier=(fft[0], (fft[-1],fft[1]), (fft[-2],fft[2]), (fft[-3],fft[3]),fft[4])

The audio frequency resolution is therefore

 df = f_higest_audio / (N/2) = 1/T = 1/(N*dt) = (1/N) * f_sampling 

And the k'th component of the fft has frequency k*df.

When plotting a sound frequency spectrum, the typical axes are :

 x axis:  f = k*df = frequency, with k = (0 ... N/2)
 y axis:  sqrt(fft[k]**2 + fft[-k]**2) = RMS total power in both phases

In the example with N=2048 and f_sampling = 44.1kHz,

 df = 44.1kHz/2048 = 21.5 Hz.

That implies for example that an A=440Hz would show up at about an
index k = 440Hz/df = 440 / 21.5 ~ 20 , out of the 1024 (i.e. N/2)
different frequencies ... so close to the left of the plot.

. . . . . . . . . . . . .

Are we having fun yet?

Jim Mahoney | cs.marlboro.edu | Feb 2017 | MIT License