Okay, so I decided to pick a random bug and work on it, and so I selected this one. I wrote a test case and ran it:
IOQuake3 BoxOnPlaneSide(): avg 184 milliseconds (1024 tests of 10M calls)
Proposed BoxOnPlaneSize(): avg 179 milliseconds (1024 tests of 10M calls)
This difference is small and basically negligible. Keep in mind this is the C version, not the x86 assembly version. If the assembly version is even marginally faster than the C version, this will drop this difference into epsilon-squared.
My guess:
The IOQuake3 BoxOnPlaneSide() uses a switch() statement. If the case is 7, then the first 6 cases must be checked first. The proposed version can be done as an unrolled loop. Otherwise, the fp-add/fp-mul is exactly the same...
I'm sure if you got GCC to generate a jump table the results would be basically the same.
SO WHAT IF THE DIFFERENCE WAS 500 MSEC?
==============================================
Let's put this into perspective. That's 500 milliseconds OVER 10M tests. That means for every 100,000 calls, you save 5 msec. Since there aren't even 100K calls to this function, probably 1K-5K at most, you'd be talking about saving microseconds per frame.
BUT
===
But the difference isn't 500, it's 5, so were talking tens to hundreds of nanoseconds per frame. This is basically no speed up. Light travels one meter in about 2.2 nanoseconds.
MY IDEAS
========
As for deleting the assembly, I don't think it is likely to be dramatically faster, and even if it was dramatically faster, I don't think it would be a big deal. It did exist for a reason when it was first written, so while a Phenom doesn't care whether the code is hand-optimized asm or just whatever GCC spits out, an older machine might. Replacing the C version more or less doesn't matter.
Patrick
I think the really nice thing about this patch is that it replaces 559
lines of unreadable x86 FPU assembly code with _nine_ lines of C that
are at least as fast, and maybe trivially faster. I think it's a no
brainer.
See for yourself. Save the patch and run
egrep -c '^\-[^-]' the.patch
to see how many lines of code were removed, and
egrep -c '^\+[^+]' the.patch
to see how many lines were added.
(In reply to comment #5)
> I think the really nice thing about this patch is that it replaces 559
> lines of unreadable x86 FPU assembly code with _nine_ lines of C that
> are at least as fast, and maybe trivially faster.
Correct me if I'm wrong, but it looks to me like the performance tests are comparing the old C version with the new C version, not the old x86 with the new C version? Assuming this is the case, it may not be faster.
In any case, this doesn't really matter; the cleaner code is probably justification in itself. Having said that, let me echo Ryan's comments; changes to fundamental math functions need to be definitely correct. Even trivially small differences in output can result in problems (prediction misses) as we must cooperate with legacy clients/servers which may use the old code.
So rather than tests for performance, tests for correctness are really what is needed here.
Two things:
1) Yes, Tim is correct, my test case compares the C version. The x87 FPU assembly is supposedly faster (I think it is imported from Quake[1]).
2) It is easy to show that the new code fragment is mathematically equivalent to the previous one. The crux of this algorithm is this formula:
dot(Plane.Normal, Point) + Plane.Dist = D, (signed) distance from plane to point. If you remove the "+ Plane.Dist" from the equation, you get D', the distance from point to a plane on the origin. When it happens that the point isn't a far out along the plane's normal as the plane is (i.e. D' < Plane.dist), then point is behind the plane, not in front of it.
Mathematically, a box has 8 points, and using the sign bits of the plane, selects 1 of the eight points which is furthest in the direction of the plane's normal, and then the opposite corner (can be furthest) from the plane.
In 2D, it is easier to see.
v0 v1
o====\==o
| \/|
| \|
| \
o=======o
v2 v3
In this, you can see the plane has a normal with +X and +Y components. Therefore it chooses V1, which is the vector {emaxs[0], emaxs[1]} and then V2 which is the vector {emins[0],emins[1]}.
The proposed code use a for() loop from i = 1 to 3 to perform the dot product, while id's code explicitly maps out every permutation of sign bits and expands the dot product.
In fact, there is no reason one couldn't do this like (inefficient!)
vec3_t P;
P1[0] = plane->signbits & 1 ? emins[0] : emaxs[0]; //check bit 0
P1[1] = plane->signbits & 2 ? emins[1] : emaxs[1]; //check bit 1
P1[2] = plane->signbits & 4 ? emins[2] : emaxs[2]; //check bit 2
dist1 = P1[0]*plane->normal[0] + P1[1]*plane->normal[1] + P1[2]*plane->normal[2]
//etc...
In that form, I think it is easier to understand the above mathematics.
Patrick
I modified Patricks Benchmark code to use the assembly code. On my notebook (Pentium m 1,6GHz throttled to 800MHz, 2GB RAM, debian Lenny with stock kernel) I got the following benchmark results (NR_TESTS = 1M, NR_REPEAT = 1024):
new C: avg of 68ms, 69ms, 67ms (3 test runs)
old ASM: avg of 76ms, 76ms, 76ms (3 test runs)
old C: avg of 84ms, 83ms, 84ms (3 test runs)
At least on my machine the new C code is faster than the old ASM code.
I also checked the correctness of the code (compared each case of the switch-case-statement in the old code with what would happen in the new code) and I'm quite certain that the proposed implementation is correct.
I attached a text file with some comments to the non-trivial code-snippets of each implementation.
Cheers,
Daniel
Right, so the difference is 16msec when run with 1M tests, so that means each function call saves 1.6e-3 x 1e-6 = 1.6e-9 or 1.6 nanoseconds, or less than 1.3 clock cycles at 800MHz. This is basically the episilon-squared I was referring to in the original post. =\
Most assuredly, the only things worth considering is a) readability of the code and b) support of older processors for which the code is optimized for (P6 arch).
Again, I vote in favor of whatever appears to be the most clear and understandable approach, or if no decision can be made, then just closing this bug already.
Patrick
I forgot one case: the default case in that switch statement. I don't know if its important (the comments suggest it was just added to shut the compiler up).
However, if the new implementation should behave *exactly* like the old one, there has to be made a minor change:
Add "if(p->type < 8)" just before the for()-loop.
This made the implementation slightly slower (73-74ms instead of 67-69ms), but it's still faster than the ASM version (76ms) or the old C version (83-84ms).
As Patrick said, this is no groundbreaking improvement.
But the new code is shorter than the old C code, slightly faster than the old ASM code, about 1.13 times as fast as the old C code (good for non-i386) and you could get rid of >700lines of assembly code.
So I'd suggest to accept the patch, possibly with my suggested fix for the default case (that probably shouldn't happen).
Cheers,
- Daniel
(In reply to comment #12)
> I forgot one case: the default case in that switch statement. I don't know if
> its important (the comments suggest it was just added to shut the compiler up).
> However, if the new implementation should behave *exactly* like the old one,
> there has to be made a minor change:
> Add "if(p->type < 8)" just before the for()-loop.
> This made the implementation slightly slower (73-74ms instead of 67-69ms), but
> it's still faster than the ASM version (76ms) or the old C version (83-84ms).
It should have been "if(p->signbits < 8)" before the for()-loop of course.. shouldn't have tested that idea at 6am :-/
Of course I've also benched that version and it even seems like it's a bit faster than the wrong version (~70ms, instead of 73-74ms with "if(p->type <8)").
I also altered the Benchmarking Code again to check correctness. It runs both implementations and compares the results (see attachement).
Cheers,
- Daniel
Created attachment 1755 [details] New implementation, deleting assembler versions.
Created attachment 2125 [details] Benchmark for proposed code. Benchmark (UNIX-based systems) for proposed code. Instructions included at top of file.
Created attachment 2129 [details] Benchmark for the proposed BoxOnPlaneSide() VS i386 ASM Version
Created attachment 2130 [details] the important code-snippets of each implementation with additional comments
Created attachment 2131 [details] Altered Benchmark-Code that now checks correctness