Fourier transform notes ------------------------ Say we have a sampled sound : s[0], s[1], s[2], ... , s[N-1] .-"-. .-"-. .-"-. .-"-. .-"-. "-.-" "-.-" "-.-" "-.-" "- (http://textart4u.blogspot.com/2012/03/waves-text-art-ascii-art.html) where the points are amplitude (decibels) of a sound, taken at some sampling rate. N = number of points in sample (also called the "block size" or "buffer size" in some software settings) dt = time per sample = elapsed time between s[n] and s[n+1]. The elapsed time for the whole sample is then T = sample total time = N * dt Example: dt = 1/(44100) sec (typical audio sampling rate) = 2.27e-5 sec = 0.0227 msec N = 2048 T = 0.0464 sec The sampling frequency is f_sampling = 1/dt = 44.1 kHz = 44100 samples per second And if it takes B bits to represent one sample, then the bitrate would be bitrate = B * f_sampling The fastest oscillation possible within this signal looks like this : [1, -1, 1, -1, 1, ... ] in which each sample is one half of the audio cycle, or in other words it takes two sample points for one audio up-and-down sort-of-sine wave. f_highest_audio = 1/(2*dt) = (1/2) f_sampling = 22kHz A "fourier transform" is a math procedure to convert the N sound samples into N frequency values, where the frequencies go from f=0 (the time average part of the signal, if it's nonzero) to f_highest_audio. For all the frequencies except f=0 and f=f_highest_audio, there are two independent phases (i.e. sin, cos) and so two values. Often times one of these two is numbered with a negative index, which is consistent with the writing the oscillations as complex exponentionals i.e. exp(i f) and exp(- i f) ... but that is too much math to go into the details here. So the fft values can be numbered like this fft[0], (fft[-1], fft[1]), ... (fft[-(N/2 -1)], fft[N/2-1]), fft[N/2] where fft[0] is amount of frequncy=0 signal, and fft[N/2] is the amount of frequency = f_highest_audio = 1/(2*dt). To make this clear, consider some small cases: N=4, sound=( s[0], s[1], s[2], s[3] ), fourier=( fft[0], (fft[-1], fft[1]), fft[2] ) N=8, sound=( s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7] ), fourier=(fft[0], (fft[-1],fft[1]), (fft[-2],fft[2]), (fft[-3],fft[3]),fft[4]) The audio frequency resolution is therefore df = f_higest_audio / (N/2) = 1/T = 1/(N*dt) = (1/N) * f_sampling And the k'th component of the fft has frequency k*df. When plotting a sound frequency spectrum, the typical axes are : x axis: f = k*df = frequency, with k = (0 ... N/2) y axis: sqrt(fft[k]**2 + fft[-k]**2) = RMS total power in both phases In the example with N=2048 and f_sampling = 44.1kHz, df = 44.1kHz/2048 = 21.5 Hz. That implies for example that an A=440Hz would show up at about an index k = 440Hz/df = 440 / 21.5 ~ 20 , out of the 1024 (i.e. N/2) different frequencies ... so close to the left of the plot. . . . . . . . . . . . . . Are we having fun yet? Jim Mahoney | cs.marlboro.edu | Feb 2017 | MIT License