Speed tests of multiply for four different scenarios: mul: 22 12-bit pieces with reduction at the end mul16: 17 16-bit pieces with reduction during the multiply mul24: 11 24-bit pieces with reduction during the multiply mul32: 8 32-bit pieces with reduction during the multiply 17 16-bit pieces are used instead of 16 pieces to handle the product of two sums with no mod until the end, although the carries must be reduced during the multiply The "prod" options specify using multiply product-scanning order (Comba) vs. "op" for operand-scanning order. If "add" option is specified only mul and mul16 are tested computing the product of two sums. mul and mul16 use only 32-bit data regardless of the use64 option. timing.c is compiled with no optimization so the results may be representative of similar code on a 32-bit PLC gcc 7.1.0 on a Dell PowerEdge 2650 with OS: Linux 2.6.32-696.1.1.el6.i686 i686 and processor: Intel(R) Xeon(TM) CPU 2.40GHz (32-bit) address sizes: 36 bits physical, 32 bits virtual count = 1000000, nrand = 100, seed = 12345, add = 0, use64 = 0, prod,16,24,32 = 0 0 0 0 mul: 5.84811 secs mul16: 4.87626 secs mul24: 10.7014 secs mul32: 9.61054 secs count = 1000000, nrand = 100, seed = 12345, add = 0, use64 = 0, prod,16,24,32 = 1 1 1 1 mul: 4.31134 secs mul16: 4.79127 secs mul24: 10.3384 secs mul32: 8.85865 secs count = 1000000, nrand = 100, seed = 12345, add = 1, use64 = 0, prod,16,24,32 = 0 0 0 0 mul: 6.17506 secs mul16: 5.2862 secs count = 1000000, nrand = 100, seed = 12345, add = 1, use64 = 0, prod,16,24,32 = 1 1 1 1 mul: 4.68029 secs mul16: 5.16322 secs