ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition by r-devulap · Pull Request #24924 · numpy/numpy

r-devulap · 2023-10-13T21:06:38Z

Updating x86-simd-sort to latest commit. Includes 2 major updates:

Speed up np.sort by up-to 2x for 32-bit and up-to 1.5x for 64-bit data. Ref: Various performance improvements x86-simd-sort#83 adds optimal sorting networks and minor improvements to vectorized partitioning. This also speeds up np.partition by up-to 1.3x.
np.argsort and np.argparition relied heavily on the gather instruction which unfortunately is terrible for performance because of a new vulnerability DOWNFALL (see https://www.phoronix.com/review/downfall). We reverted back to using scalar emulation of gather (see Use scalar emulation of gather instruction for arg methods x86-simd-sort#65) which then improves performance by about 1.6x for both np.argsort and np.argpartition.

Benchmarks for random data:

| Change   | Before [67539a40] <main>   | After [da55324f] <xss-perf>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------|-------------------------------|---------|------------------------------------------------------------------------------------------|
| -        | 124±2μs                    | 118±0.9μs                     |    0.95 | bench_function_base.Partition.time_partition('int64', ('random',), 10)                   |
| -        | 124±0.8μs                  | 118±0.3μs                     |    0.95 | bench_function_base.Partition.time_partition('int64', ('random',), 1000)                 |
| -        | 124±0.3μs                  | 107±1μs                       |    0.87 | bench_function_base.Partition.time_partition('float32', ('random',), 10)                 |
| -        | 123±0.4μs                  | 105±0.4μs                     |    0.86 | bench_function_base.Partition.time_partition('float32', ('random',), 1000)               |
| -        | 123±0.4μs                  | 105±0.2μs                     |    0.85 | bench_function_base.Partition.time_partition('float32', ('random',), 100)                |
| -        | 56.5±0.5μs                 | 44.7±1μs                      |    0.79 | bench_function_base.Partition.time_partition('int32', ('random',), 10)                   |
| -        | 57.4±0.4μs                 | 43.6±1μs                      |    0.76 | bench_function_base.Partition.time_partition('int32', ('random',), 100)                  |
| -        | 58.1±0.8μs                 | 43.4±0.8μs                    |    0.75 | bench_function_base.Partition.time_partition('int32', ('random',), 1000)                 |
| -        | 73.2±0.06μs                | 50.5±0.02μs                   |    0.69 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                        |
| -        | 210±0.2μs                  | 142±0.2μs                     |    0.68 | bench_function_base.Sort.time_argsort('quick', 'int64', ('random',))                     |
| -        | 60.0±0.1μs                 | 40.2±0.2μs                    |    0.67 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                      |
| -        | 447±0.6μs                  | 295±2μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 10)              |
| -        | 449±1μs                    | 298±1μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 100)             |
| -        | 445±1μs                    | 294±1μs                       |    0.66 | bench_function_base.Partition.time_argpartition('float64', ('random',), 1000)            |
| -        | 34.4±0.05μs                | 22.7±0.03μs                   |    0.66 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                        |
| -        | 218±0.5μs                  | 142±0.2μs                     |    0.65 | bench_function_base.Sort.time_argsort('quick', 'float64', ('random',))                   |
| -        | 211±0.7μs                  | 136±1μs                       |    0.64 | bench_function_base.Sort.time_argsort('quick', 'float32', ('random',))                   |
| -        | 35.7±0.1μs                 | 22.8±0.05μs                   |    0.64 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                       |
| -        | 418±3μs                    | 264±1μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 10)              |
| -        | 419±3μs                    | 264±2μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 100)             |
| -        | 416±3μs                    | 261±1μs                       |    0.63 | bench_function_base.Partition.time_argpartition('float32', ('random',), 1000)            |
| -        | 424±0.7μs                  | 264±1μs                       |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 10)                |
| -        | 427±1μs                    | 266±0.5μs                     |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 100)               |
| -        | 423±0.7μs                  | 264±1μs                       |    0.62 | bench_function_base.Partition.time_argpartition('int64', ('random',), 1000)              |
| -        | 410±3μs                    | 251±2μs                       |    0.61 | bench_function_base.Partition.time_argpartition('int32', ('random',), 100)               |
| -        | 408±3μs                    | 248±2μs                       |    0.61 | bench_function_base.Partition.time_argpartition('int32', ('random',), 1000)              |
| -        | 190±0.7μs                  | 116±0.2μs                     |    0.61 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))                     |
| -        | 190±0.6μs                  | 115±0.2μs                     |    0.61 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('random',))                    |
| -        | 412±3μs                    | 249±2μs                       |    0.6  | bench_function_base.Partition.time_argpartition('int32', ('random',), 10)                |
| -        | 44.5±0.07μs                | 22.6±0.04μs                   |    0.51 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))

Detailed benchmarks can be seen here: https://gist.github.com/r-devulap/21dc3afdbab47c7aa08087c1445954b7

seberg · 2023-10-17T08:31:03Z

@r-devulap the test failures here do look real, at least in the form of compiler warnings.

EDIT: If you should want to backport the DOWNFALL fix, that probably would need a more targeted diff. OTOH, I am not sure it's worth the trouble since it "only" restores performance.

r-devulap · 2023-10-17T20:54:34Z

@seberg Indeed, still working on fixing them. And I don't think we need to worry about backporting the DOWNFALL fix.

This reverts commit 76d5534.

r-devulap · 2023-10-18T04:23:40Z

Update: We have changed API to use size_t for all array indices and the calls to avx512 routines are updated accordingly. That made it easy for us to vectorize np.argpartition and np.argsort on 32-bit platforms (see PR numpy/x86-simd-sort#88).

seberg · 2023-10-18T11:29:14Z

We have changed API to use size_t for all array indices

So that is technically be broken or need a guard until we would do gh-24888? Although I guess that the use of that code implies we are not on a niche platform.

r-devulap · 2023-10-18T15:15:29Z

We have changed API to use size_t for all array indices

So that is technically be broken or need a guard until we would do gh-24888? Although I guess that the use of that code implies we are not on a niche platform.

I don't think that's necessary.

For the sort and partition methods npy_intp will get cast to size_t. Array size passed to sort/partition won't be negative and so this implicit cast isn't a problem.
For argsort, we need to reinterpret_cast npy_intp* to size_t*. Whether npy_intp is Py_intptr_t or ssize_t shouldn't make a difference to this cast.

let me know if my assumptions are wrong.

r-devulap · 2023-10-18T21:27:07Z

hmm, the macOS x86_64 failure seems hard to figure out. It's using x86_64-apple-darwin13.4.0-clang++ (clang 15.0.7 "clang version 15.0.7") but I don't see this build fail with clang++-15.0.7 locally.

Fails with: error: implicit instantiation of undefined template 'zmm_vector<unsigned long>'. With long as 8 bytes, unsigned long should just default to uin64_t and it should find the template definition like all other cases.

r-devulap · 2023-10-19T18:33:06Z

yay, finally. Decided to explicitly instantiate to zmm_vector<uint64_t> on 64-bit macOS.

r-devulap · 2023-11-02T22:41:09Z

Friendly ping :)

r-devulap · 2023-11-27T19:13:51Z

Changes in NumPy here are pretty minor in this PR. Perhaps we can close this and merge it into a single PR #25045. Please reopen if you disagree.

github-actions bot added the 01 - Enhancement label Oct 13, 2023

r-devulap mentioned this pull request Oct 17, 2023

Fix NumPy CI failures numpy/x86-simd-sort#86

Merged

Raghuveer Devulapalli added 3 commits October 17, 2023 10:14

Update x86-simd-sort to latest

40c202f

Dont include arg methods on 32-bit platforms

76d5534

Update x86-simd-sort to latest

7f28d3b

r-devulap force-pushed the xss-perf branch from 2c7e029 to 7f28d3b Compare October 17, 2023 17:16

Raghuveer Devulapalli added 3 commits October 17, 2023 20:47

Revert "Dont include arg methods on 32-bit platforms"

4c1e4a7

This reverts commit 76d5534.

Enable np.partition and np.argpartition on 32-bit

aad69d8

Update x86-simd-sort to latest

6f88670

Update x86-simd-sort to latest

53a7cd3

Update x86-simd-sort to latest

fd0e377

r-devulap mentioned this pull request Oct 31, 2023

ENH: Vectorize np.sort and np.partition with AVX2 #25045

Merged

seiko2plus mentioned this pull request Nov 20, 2023

ENH: Use Highway's VQSort on AArch64 #24018

Merged

r-devulap closed this Nov 27, 2023

r-devulap mentioned this pull request Dec 22, 2023

BUG: Fix build issues on SPR #25475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition #24924

ENH: Perf improvements to np.sort, np.argsort, np.partition and np.argpartition #24924
r-devulap wants to merge 8 commits intonumpy:mainfrom
r-devulap:xss-perf

r-devulap commented Oct 13, 2023 •

edited

Loading

Uh oh!

seberg commented Oct 17, 2023 •

edited

Loading

Uh oh!

r-devulap commented Oct 17, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

seberg commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 19, 2023

Uh oh!

r-devulap commented Nov 2, 2023

Uh oh!

r-devulap commented Nov 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Uh oh!

Conversation

r-devulap commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Oct 17, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

seberg commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 18, 2023

Uh oh!

r-devulap commented Oct 19, 2023

Uh oh!

r-devulap commented Nov 2, 2023

Uh oh!

r-devulap commented Nov 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

r-devulap commented Oct 13, 2023 •

edited

Loading

seberg commented Oct 17, 2023 •

edited

Loading