My target FPU for wm-FPU-emu is that described in the Intel486 Programmer's Reference Manual (1992 edition). Unfortunately, numerous facets of the functioning of the FPU are not well covered in the Reference Manual. The information in the manual has been supplemented with measurements on real 80486's. Unfortunately, it is simply not possible to be sure that all of the peculiarities of the 80486 have been discovered, so there is always likely to be obscure differences in the detailed behaviour of the emulator and a real 80486.

wm-FPU-emu does not implement all of the behaviour of the 80486 FPU,
but is very close. See **Limitations**
for a list of some differences.

Please report bugs, etc to me at: billm@melbpc.org.au

Bill Metzenthen October 1996

- Add, subtract, and multiply. Nothing remarkable in these.
- Divide has been tuned to get reasonable performance. The algorithm is not the obvious one which most people seem to use, but is designed to take advantage of the characteristics of the 80386. I expect that it has been invented many times before I discovered it, but I have not seen it. It is based upon one of those ideas which one carries around for years without ever bothering to check it out.
- The sqrt function has been tuned to get good performance. It is based upon Newton's classic method. Performance was improved by capitalizing upon the properties of Newton's method, and the code is once again structured taking account of the 80386 characteristics.
- The trig, log, and exp functions are based in each case upon quasi- "optimal" polynomial approximations. My definition of "optimal" was based upon getting good accuracy with reasonable speed.
- The argument reducing code for the trig function effectively uses a value of pi which is accurate to more than 128 bits. As a consequence, the reduced argument is accurate to more than 64 bits for arguments up to a few pi, and accurate to more than 64 bits for most arguments, even for arguments approaching 2^63. This is far superior to an 80486, which uses an internal value of pi which is accurate to about 68 bits.

The code of the emulator is complicated slightly by the need to account for a limited form of re-entrancy. Normally, the emulator will emulate each FPU instruction to completion without interruption. However, it may happen that when the emulator is accessing the user memory space, swapping may be needed. In this case the emulator may be temporarily suspended while disk i/o takes place. During this time another process may use the emulator, thereby perhaps changing static variables. The code which accesses user memory is confined to five files:

fpu_entry.c reg_ld_str.c load_store.c get_address.c errors.cAs from version 1.12 of the emulator, no static variables are used (apart from those in the kernel's per-process tables). The emulator is therefore now fully re-entrant, rather than having just the restricted form of re-entrancy which is required by the Linux kernel.

- the operands have a higher precision than the current setting of the precision control (PC) flags.
- the underflow exception is masked.
- the magnitude of the exact result (before rounding) is less than 2^-16382.
- the magnitude of the final result (after rounding) is exactly 2^-16382.
- the magnitude of the exact result would be exactly 2^-16382 if the operands were rounded to the current precision before the arithmetic operation was performed.

movl %esp,[%ebx] fld1The FPU instruction may be (usually will be) loaded into the pre-fetch queue of the cpu before the mov instruction is executed. If the destination of the 'movl' overlaps the FPU instruction then the bytes in the prefetch queue and memory will be inconsistent when the FPU instruction is executed. The emulator will be invoked but will not be able to find the instruction which caused the device-not-present exception. For this case, the emulator cannot emulate the behaviour of an 80486DX.

function Turbo C djgpp 1.06 WM-emu387 wm-FPU-emu + 60.5 154.8 76.5 139.4 - 61.1-65.5 157.3-160.8 76.2-79.5 142.9-144.7 * 71.0 190.8 79.6 146.6 / 61.2-75.0 261.4-266.9 75.3-91.6 142.2-158.1 sin() 310.8 4692.0 319.0 398.5 cos() 284.4 4855.2 308.0 388.7 tan() 495.0 8807.1 394.9 504.7 atan() 328.9 4866.4 601.1 419.5-491.9 sqrt() 128.7 crashed 145.2 227.0 log() 413.1-419.1 5103.4-5354.21 254.7-282.2 409.4-437.1 exp() 479.1 6619.2 469.1 850.8The performance under Linux is improved by the use of look-ahead code. The following results show the improvement which is obtained under Linux due to the look-ahead code. Also given are the times for the original Linux emulator with the 4.1 'soft' lib.

[ Linus' note: I changed look-ahead to be the default under linux, as there was no reason not to use it after I had edited it to be disabled during tracing ]

wm-FPU-emu w original w look-ahead 'soft' lib + 106.4 190.2 - 108.6-111.6 192.4-216.2 * 113.4 193.1 / 108.8-124.4 700.1-706.2 sin() 390.5 2642.0 cos() 381.5 2767.4 tan() 496.5 3153.3 atan() 367.2-435.5 2439.4-3396.8 sqrt() 195.1 4732.5 log() 358.0-387.5 3359.2-3390.3 exp() 619.3 4046.4These figures are now somewhat out-of-date. The emulator has become progressively slower for most functions as more of the 80486 features have been implemented.

The results of the basic arithmetic functions (+,-,*,/), and fsqrt match those of an 80486 FPU. They are the best possible; the error for these never exceeds 1/2 an lsb. The fprem and fprem1 instructions return exact results; they have no error.

The following table compares the emulator accuracy for the sqrt(), trig and log functions against the Turbo C "emulator". For this table, each function was tested at about 400 points. Ideal worst-case results would be 64 bits. The reduced Turbo C accuracy of cos() and tan() for arguments greater than pi/4 can be thought of as being related to the precision of the argument x; e.g. an argument of pi/2-(1e-10) which is accurate to 64 bits can result in a relative accuracy in cos() of about 64 + log2(cos(x)) = 31 bits.

Function Tested x range Worst result Turbo C (relative bits) sqrt(x) 1 .. 2 64.1 63.2 atan(x) 1e-10 .. 200 64.2 62.8 cos(x) 0 .. pi/2-(1e-10) 64.4 (x <= pi/4) 62.4 64.1 (x = pi/2-(1e-10)) 31.9 sin(x) 1e-10 .. pi/2 64.0 62.8 tan(x) 1e-10 .. pi/2-(1e-10) 64.0 (x <= pi/4) 62.1 64.1 (x = pi/2-(1e-10)) 31.9 exp(x) 0 .. 1 63.1 ** 62.9 log(x) 1+1e-6 .. 2 63.8 ** 62.1** The accuracy for exp() and log() is low because two FPU operations are required to compute them (they are not directly supported).

The emulator passes the
**'paranoia'**
tests (compiled with gcc 2.3.3 or
later) for 'float' variables (24 bit precision numbers) when precision
control is set to 24, 53 or 64 bits, and for 'double' variables (53
bit precision numbers) when precision control is set to 53 bits (a
properly performing FPU cannot pass the
**'paranoia'** tests for 'double'
variables when precision control is set to 64 bits).

The code for reducing the argument for the trig functions (fsin, fcos, fptan and fsincos) has been improved and now effectively uses a value for pi which is accurate to more than 128 bits precision. As a consequence, the accuracy of these functions for large arguments has been dramatically improved (and is now very much better than an 80486 FPU). There is also now no degradation of accuracy for fcos and fptan for operands close to pi/2. Measured results are (note that the definition of accuracy has changed slightly from that used for the above table):

Function Tested x range Worst result (absolute bits) cos(x) 0 .. 9.22e+18 62.0 sin(x) 1e-16 .. 9.22e+18 62.1 tan(x) 1e-16 .. 9.22e+18 61.8It is possible with some effort to find very large arguments which give much degraded precision. For example, the integer number

8227740058411162616.0is within about 10e-7 of a multiple of pi. To find the tan (for example) of this number to 64 bits precision it would be necessary to have a value of pi which had about 150 bits precision. The FPU emulator computes the result to about 42.6 bits precision (the correct result is about -9.739715e-8). On the other hand, an 80486 FPU returns 0.01059, which in relative terms is hopelessly inaccurate.

For arguments close to critical angles (which occur at multiples of pi/2) the emulator is more accurate than an 80486 FPU. For very large arguments, the emulator is far more accurate.

Prior to version 1.20 of the emulator, the accuracy of the results for the transcendental functions (in their principal range) was not as good as the results from an 80486 FPU. From version 1.20, the accuracy has been considerably improved and these functions now give measured worst-case results which are better than the worst-case results given by an 80486 FPU.

The following table gives the measured results for the emulator. The number of randomly selected arguments in each case is about half a million. The group of three columns gives the frequency of the given accuracy in number of times per million, thus the second of these columns shows that an accuracy of between 63.80 and 63.89 bits was found at a rate of 133 times per one million measurements for fsin. The results show that the fsin, fcos and fptan instructions return results which are in error (i.e. less accurate than the best possible result (which is 64 bits)) for about one per cent of all arguments between -pi/2 and +pi/2. The other instructions have a lower frequency of results which are in error. The last two columns give the worst accuracy which was found (in bits) and the approximate value of the argument which produced it.

frequency (per M) ------------------- --------------- instr arg range # tests 63.7 63.8 63.9 worst at arg bits bits bits bits ----- ------------ ------- ---- ---- ----- ----- -------- fsin (0,pi/2) 547756 0 133 10673 63.89 0.451317 fcos (0,pi/2) 547563 0 126 10532 63.85 0.700801 fptan (0,pi/2) 536274 11 267 10059 63.74 0.784876 fpatan 4 quadrants 517087 0 8 1855 63.88 0.435121 (4q) fyl2x (0,20) 541861 0 0 1323 63.94 1.40923 (x) fyl2xp1 (-.293,.414) 520256 0 0 5678 63.93 0.408542 (x) f2xm1 (-1,1) 538847 4 481 6488 63.79 0.167709Tests performed on an 80486 FPU showed results of lower accuracy. The following table gives the results which were obtained with an AMD 486DX2/66 (other tests indicate that an Intel 486DX produces identical results). The tests were basically the same as those used to measure the emulator (the values, being random, were in general not the same). The total number of tests for each instruction are given at the end of the table, in case each about 100k tests were performed. Another line of figures at the end of the table shows that most of the instructions return results which are in error for more than 10 percent of the arguments tested.

The numbers in the body of the table give the approx number of times a result of the given accuracy in bits (given in the left-most column) was obtained per one million arguments. For three of the instructions, two columns of results are given: * The second column for f2xm1 gives the number cases where the results of the first column were for a positive argument, this shows that this instruction gives better results for positive arguments than it does for negative. * In the cases of fcos and fptan, the first column gives the results when all cases where arguments greater than 1.5 were removed from the results given in the second column. Unlike the emulator, an 80486 FPU returns results of relatively poor accuracy for these instructions when the argument approaches pi/2. The table does not show those cases when the accuracy of the results were less than 62 bits, which occurs quite often for fsin and fptan when the argument approaches pi/2. This poor accuracy is discussed above in relation to the Turbo C "emulator", and the accuracy of the value of pi.

bits f2xm1 f2xm1 fpatan fcos fcos fyl2x fyl2xp1 fsin fptan fptan 62.0 0 0 0 0 437 0 0 0 0 925 62.1 0 0 10 0 894 0 0 0 0 1023 62.2 14 0 0 0 1033 0 0 0 0 945 62.3 57 0 0 0 1202 0 0 0 0 1023 62.4 385 0 0 10 1292 0 23 0 0 1178 62.5 1140 0 0 119 1649 0 39 0 0 1149 62.6 2037 0 0 189 1620 0 16 0 0 1169 62.7 5086 14 0 646 2315 10 101 35 39 1402 62.8 8818 86 0 984 3050 59 287 131 224 2036 62.9 11340 1355 0 2126 4153 79 605 357 321 1948 63.0 15557 4750 0 3319 5376 246 1281 862 808 2688 63.1 20016 8288 0 4620 6628 511 2569 1723 1510 3302 63.2 24945 11127 10 6588 8098 1120 4470 2968 2990 4724 63.3 25686 12382 69 8774 10682 1906 6775 4482 5474 7236 63.4 29219 14722 79 11109 12311 3094 9414 7259 8912 10587 63.5 30458 14936 393 13802 15014 5874 12666 9609 13762 15262 63.6 32439 16448 1277 17945 19028 10226 15537 14657 19158 20346 63.7 35031 16805 4067 23003 23947 18910 20116 21333 25001 26209 63.8 33251 15820 7673 24781 25675 24617 25354 24440 29433 30329 63.9 33293 16833 18529 28318 29233 31267 31470 27748 29676 30601 Per cent with error: 30.9 3.2 18.5 9.8 13.1 11.6 17.4 Total arguments tested: 70194 70099 101784 100641 100641 101799 128853 114893 102675 102675

Linus Torvalds Tommy.Thorn@daimi.aau.dk Andrew.Tridgell@anu.edu.au Nick Holloway, alfie@dcs.warwick.ac.uk Hermano Moura, moura@dcs.gla.ac.uk Jon Jagger, J.Jagger@scp.ac.uk Lennart Benschop Brian Gallew, geek+@CMU.EDU Thomas Staniszewski, ts3v+@andrew.cmu.edu Martin Howell, mph@plasma.apana.org.au M Saggaf, alsaggaf@athena.mit.edu Peter Barker, PETER@socpsy.sci.fau.edu tom@vlsivie.tuwien.ac.at Dan Russel, russed@rpi.edu Daniel Carosone, danielce@ee.mu.oz.au cae@jpmorgan.com Hamish Coleman, t933093@minyos.xx.rmit.oz.au Bruce Evans, bde@kralizec.zeta.org.au Timo Korvola, Timo.Korvola@hut.fi Rick Lyons, rick@razorback.brisnet.org.au Rick, jrs@world.std.com...and numerous others who responded to my request for help with a real 80486.