Memory accesses are bottlenecks.
DSP processors can make multiple memory accesses in a single instruction cycle. But the inner loop of the FIR filter program requires four memory accesses: three reads for each of the operands, and one write of the result to memory. Even without counting the need to load the instruction, this exceeds the capacity of a DSP processor. For instance the Lucent DSP32C can make four memory accesses per instruction cycle: two reads of operands, plus one write of the result, plus the read of one instruction. Even this is not enough for the simple line of C code that forms the inner loop of the FIR filter program.
Fortunately, DSP processors have lots of registers which can be used to hold values inside the processor for later use - thus economising on memory accesses. We can see that the result of the inner loop is used again and again during the loop: it as the code is written, it has to be read from memory and then written back to memory in each pass. Making this a register variable will allow it to be held within the processor, thus saving two memory accesses:
register float temp; temp = 0.0; for (k = 0; k < N; k++) temp = temp + *c_ptr++ * *x_ptr--;
The C declaration 'register float temp' means that variable temp is to be held in a processor register: in this case, a floating point register. The inner loop now only requires two memory accesses, to read the two operands *c_ptr and *x_ptr (three accesses if you count the instruction load) - this is now within the capabilities of the DSP processor in a single instruction.
A small point to note is that the initialisation of the register variable temp=0.0 is wasted. It is simple to make use of this necessary initialisation to make the first calculation, thus reducing the number of iterations of the inner loop:
register float temp; temp = *c_ptr++ * *x_ptr--; for (k = 1; k < N; k++) temp = temp + *c_ptr++ * *x_ptr--;
This leads to a more efficient C program for the FIR filter:
float y[N], c[N], x[N]; float *y_ptr, *c_ptr, *x_ptr; register float temp; int n, k; y_ptr = &y[0]; for (n = 0; n < N-1; n++) { c_ptr = &c[0]; x_ptr = &x[N-1]; temp = *c_ptr++ * *x_ptr--; for (k = 1; k < N; k++) temp = temp + *c_ptr++ * *x_ptr--; *y_ptr++ = temp; } }