The development of efficient assembly language code shows how efficient a DSP processor can be: each assembler instruction is performing several useful operations. But it also shows how difficult it can be to program such a specialised processor efficiently.
temp = *c_ptr++) * *x_ptr--); a1 = *r3++ * *r4-- for (k = 1; k < N-1; k++) do 0,r1 temp = temp + *c_ptr++ * *x_ptr--) a1 = a1 + *r3++ * *r4-- *y_ptr++ = temp *r2++ = a1
Bear in mind that we use DSP processors to do specialised jobs fast. If cost is no object, then it may be permissible to throw away processor power by inefficient coding: but in that case we would perhaps be better advised to choose an easier processor to program in the first place. A sensible reason to use a DSP processor is to perform DSP either at lowest cost, or at highest speed. In either case, wasting processor power leads to a need for more hardware which makes a more expensive system which leads to a more expensive final product which, in a sane world, would lead to loss of sales to a competitive product that was better designed.
One example shows how essential it is to make sure a DSP processor is programmed efficiently:
The diagram shows a single assembler instruction from the Lucent DSP32C processor. This instruction does a lot of things at once:
All of these operations can be done in one instruction. This is how the processor can be made fast. But if we don't use any of these operations, we are throwing away the potential of the processor and may be slowing it down drastically. Consider how this instruction can be translated into MIPS or Mflops.
The processor runs with an 80 MHz clock. But, to achieve four memory accesses per instruction it uses a modified von Neuman memory architecture which requires it to divide the system clock by four, resulting in an instruction rate of 20 MIPS. If we go into manic marketing mode, we can have fun working out ever higher MIPS or MOPS ratings as follows:
20 MIPS = 20 MOPS
but 2 floating point operators per cycle = 40 MOPS
and four memory accesses per instruction = 80 MOPS
plus three pointer increments per instruction = 60 MOPS
plus one floating point register update = 20 MOPS
making a grand total MOPS rating of 200 MOPS
Which exercise serves to illustrate three things:
Of course, we omitted to include in the MOPS rating (as some manufacturers do) the possibility of DMA on serial port and parallel port, and all those associated increments of DMA address pointers, and if we had multiple comm ports, each with DMA, we could go really wild...
Apart from a cheap laugh at the expense of marketing, there is a very serious lesson to be drawn from this exercise. Suppose we only did adds with this processor? Then the Mflops rating falls from a respectable 40 Mflops to a pitiful 20 Mflops. And if we don't use the memory accesses, or the pointer increments, then we can cut the MOPS rating from 200 MOPS to 20 MOPS.
It is very easy indeed to write very inefficient DSP code. Luckily it is also quite easy, with a little care, to write very efficient DSP code.