ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition #24924
ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition #24924r-devulap wants to merge 8 commits intonumpy:mainfrom
Conversation
|
@r-devulap the test failures here do look real, at least in the form of compiler warnings. EDIT: If you should want to backport the DOWNFALL fix, that probably would need a more targeted diff. OTOH, I am not sure it's worth the trouble since it "only" restores performance. |
|
@seberg Indeed, still working on fixing them. And I don't think we need to worry about backporting the DOWNFALL fix. |
|
Update: We have changed API to use |
So that is technically be broken or need a guard until we would do gh-24888? Although I guess that the use of that code implies we are not on a niche platform. |
I don't think that's necessary.
let me know if my assumptions are wrong. |
|
hmm, the macOS x86_64 failure seems hard to figure out. It's using x86_64-apple-darwin13.4.0-clang++ (clang 15.0.7 "clang version 15.0.7") but I don't see this build fail with clang++-15.0.7 locally. Fails with: |
|
yay, finally. Decided to explicitly instantiate to |
|
Friendly ping :) |
|
Changes in NumPy here are pretty minor in this PR. Perhaps we can close this and merge it into a single PR #25045. Please reopen if you disagree. |
Updating x86-simd-sort to latest commit. Includes 2 major updates:
np.sortby up-to 2x for 32-bit and up-to 1.5x for 64-bit data. Ref: Various performance improvements x86-simd-sort#83 adds optimal sorting networks and minor improvements to vectorized partitioning. This also speeds upnp.partitionby up-to 1.3x.np.argsortandnp.argparitionrelied heavily on the gather instruction which unfortunately is terrible for performance because of a new vulnerability DOWNFALL (see https://www.phoronix.com/review/downfall). We reverted back to using scalar emulation of gather (see Use scalar emulation of gather instruction for arg methods x86-simd-sort#65) which then improves performance by about 1.6x for bothnp.argsortandnp.argpartition.Benchmarks for random data:
Detailed benchmarks can be seen here: https://gist.github.com/r-devulap/21dc3afdbab47c7aa08087c1445954b7