Speed tests of multiply for four different scenarios:

   mul: 22 12-bit pieces with reduction at the end
 mul16: 17 16-bit pieces with reduction during the multiply
 mul24: 11 24-bit pieces with reduction during the multiply
 mul32:  8 32-bit pieces with reduction during the multiply

17 16-bit pieces are used instead of 16 pieces to handle the product of two sums
with no mod until the end, although the carries must be reduced during the multiply

The "prod" options specify using multiply product-scanning order (Comba)
vs. "op" for operand-scanning order.  If "add" option is specified only mul
and mul16 are tested computing the product of two sums.  mul and mul16
use only 32-bit data regardless of the use64 option.

timing.c is compiled with no optimization so the results may be representative
of similar code on a 32-bit PLC

gcc 7.1.0 on a Dell PowerEdge 2650 with OS: Linux 2.6.32-696.1.1.el6.i686 i686
and processor: Intel(R) Xeon(TM) CPU 2.40GHz (32-bit)
address sizes: 36 bits physical, 32 bits virtual

count = 1000000, nrand = 100, seed = 12345, add = 0, use64 = 0, prod,16,24,32 = 0 0 0 0
  mul: 5.84811 secs
mul16: 4.87626 secs
mul24: 10.7014 secs
mul32: 9.61054 secs
count = 1000000, nrand = 100, seed = 12345, add = 0, use64 = 0, prod,16,24,32 = 1 1 1 1
  mul: 4.31134 secs
mul16: 4.79127 secs
mul24: 10.3384 secs
mul32: 8.85865 secs
count = 1000000, nrand = 100, seed = 12345, add = 1, use64 = 0, prod,16,24,32 = 0 0 0 0
  mul: 6.17506 secs
mul16: 5.2862 secs
count = 1000000, nrand = 100, seed = 12345, add = 1, use64 = 0, prod,16,24,32 = 1 1 1 1
  mul: 4.68029 secs
mul16: 5.16322 secs