Opened 11 years ago

Last modified 4 years ago

#3281 new enhancement

ffs() arch optimization

Reported by: tqh Owned by: tqh
Priority: low Milestone: R1
Component: System/libroot.so Version:
Keywords: Cc: fredrik.holmqvist@…
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Our ffs can be optimized at least for x86. Providing a test-program based on FreeBsd's ffs. No real implementation as I don't know how you want to do arch optimizations and fallbacks.

On newer x86, it can even be done in two ops with cmov, but that would require runtime configuration which might be overkill. Other platforms tries to inline function as well.

Attachments (3)

ffs_arch_x86_opt.patch (6.8 KB ) - added by tqh 10 years ago.
x86 ffs optimization
ffs_posix_x86_opt.patch (3.0 KB ) - added by tqh 10 years ago.
Final version?
test_ffs.cpp (3.4 KB ) - added by tqh 9 years ago.
Updated test with new example, and also with builtin version.

Download all attachments as: .zip

Change History (15)

comment:1 by axeld, 10 years ago

Priority: normallow
Type: bugenhancement

by tqh, 10 years ago

Attachment: ffs_arch_x86_opt.patch added

x86 ffs optimization

comment:2 by tqh, 10 years ago

Learning by doing :) Might have missed something in the patch, but it works fine here.

comment:3 by tqh, 10 years ago

Cc: fredrik.holmqvist@… added

comment:4 by tqh, 10 years ago

This patch may be overkill though and the function just defined in the arch_string.S, I just wanted to learn to be able to go all the way to runtime cpu-optimized functions.

by tqh, 10 years ago

Attachment: ffs_posix_x86_opt.patch added

Final version?

comment:5 by axeld, 10 years ago

Owner: changed from axeld to tqh
Version: R1/pre-alpha1

I guess that's better on your plate now :-)

comment:6 by jackburton, 10 years ago

Also check out this one: http://osdir.com/ml/solaris.opensolaris.performance/2006-01/msg00000.html

Hi,

Not sure if this can be of any interest, but I noticed the ffs function is
implemented using a loop.

A faster implementation exists using the "pcount" assembler instruction (where
available and fast - this does not seem to be the case on all processors), or
through another algorithm using only a few instructions.

Check out: http://supertech.csail.mit.edu/papers/debruijn.pdf

For a 32-bit word, the optimized function could be written this way:

int
ffs(int field)
{
static const int index[] = { 1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21,
16, 26, 18, 5, 9, 32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10 };
unsigned int w = field;
if (w == 0)
return (0);
w &= -w;
w *= 125613361U;
w >>= 27;
return index[w];
}

Regards,
Olivier.
This message posted from opensolaris.org

comment:7 by tqh, 9 years ago

It's faster as long as the result is 9 or higher.

comment:8 by korli, 9 years ago

I'd prefer to let GCC use its built-in ffs (I mean at least on GCC 4).

comment:9 by tqh, 9 years ago

Sounds like a plan. Other OS'es seem to try to inline functions like these, it would be nice to try to reduce function overhead on all these small helper functions. Any ideas in this regard or experiments to perform?

by tqh, 9 years ago

Attachment: test_ffs.cpp added

Updated test with new example, and also with builtin version.

comment:10 by tqh, 9 years ago

For gcc 4 builtin_ffs is always at least twice as fast. It should probably be ok on gcc2.95 as well. Should we just change the header to #define ffs(x) builtin_ffs(x) ?
Since we use gcc on other archs it should work there as well.

comment:11 by waddlesplash, 4 years ago

Milestone: R1Unscheduled

comment:12 by waddlesplash, 4 years ago

Milestone: UnscheduledR1

Reverting earlier milestone change

Note: See TracTickets for help on using tickets.