ENH: Enable SVE detection for Highway VQSort#25247
Conversation
|
@rdevulap don't panic; this was just so easy to do that I wanted to see if it would work 😸 I'll wait for your patches to be in and rebase on top of them. @seiko2plus SVE features! Woo! |
0d56dbb to
20b5f2d
Compare
seiko2plus
left a comment
There was a problem hiding this comment.
LGTM, Thank you. Just a few changes more.
This pr doesn't change any C/C++ code so it should be fine to merge it. |
4c05be1 to
027cce0
Compare
This will conflict with https://github.com/numpy/numpy/pull/25045/files#diff-1df2f2cfde074ed7de97ebae8a7ef48ad000241256ea3f480f81b2598d03b4f8R784-R788 |
027cce0 to
f0b936d
Compare
Leveraging the meson infrastructure to selectively enable SVE specifically for Highway, which already supports SVE. ``` | Change | Before [94bc564] <main> | After [1ffedb85] <sve-sort> | Ratio | Benchmark (Parameter) | |----------|----------------------------|-------------------------------|---------|----------------------------------------------------------------------------------| | + | 551±0.8μs | 654±5μs | 1.19 | bench_function_base.Sort.time_sort('merge', 'float32', ('random',)) | | + | 72.9±0.1μs | 80.5±0.08μs | 1.11 | bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',)) | | + | 553±1μs | 606±2μs | 1.1 | bench_function_base.Sort.time_sort('merge', 'float64', ('random',)) | | + | 41.9±0.3μs | 45.6±0.04μs | 1.09 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000)) | | + | 493±1μs | 532±2μs | 1.08 | bench_function_base.Sort.time_sort('merge', 'float16', ('random',)) | | + | 73.7±0.2μs | 78.8±0.04μs | 1.07 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100)) | | + | 565±1μs | 600±1μs | 1.06 | bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',)) | | + | 465±1μs | 491±1μs | 1.06 | bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',)) | | + | 696±3μs | 739±2μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',)) | | + | 645±3μs | 684±3μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10)) | | + | 651±1μs | 691±1μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100)) | | + | 627±3μs | 665±1μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) | | + | 467±1μs | 494±0.7μs | 1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',)) | | + | 166±0.1μs | 174±1μs | 1.05 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10)) | | - | 77.4±0.08μs | 73.2±0.2μs | 0.95 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) | | - | 379±1μs | 359±0.1μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',)) | | - | 341±0.5μs | 324±0.5μs | 0.95 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) | | - | 590±0.5μs | 554±1μs | 0.94 | bench_function_base.Sort.time_argsort('heap', 'float32', ('ordered',)) | | - | 239±5μs | 226±0.6μs | 0.94 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 10)) | | - | 195±2μs | 184±1μs | 0.94 | bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10)) | | - | 692±2μs | 637±2μs | 0.92 | bench_function_base.Sort.time_argsort('merge', 'float32', ('random',)) | | - | 45.5±0.03μs | 42.0±0.2μs | 0.92 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000)) | | - | 80.5±0.07μs | 73.0±0.1μs | 0.91 | bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',)) | | - | 78.9±0.2μs | 71.7±0.2μs | 0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',)) | | - | 79.2±0.1μs | 72.1±0.2μs | 0.91 | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',)) | | - | 131±2μs | 118±0.8μs | 0.9 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 100)) | | - | 82.8±0.2μs | 73.7±0.3μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',)) | | - | 83.4±0.07μs | 74.1±0.2μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',)) | | - | 78.6±0.2μs | 70.3±0.2μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',)) | | - | 79.2±0.09μs | 70.8±0.08μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',)) | | - | 3.22±0.02μs | 2.86±0μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',)) | | - | 3.26±0.04μs | 2.84±0μs | 0.87 | bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',)) | | - | 82.6±0.06μs | 71.1±0.08μs | 0.86 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10)) | | - | 4.91±0.01μs | 4.22±0μs | 0.86 | bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',)) | | - | 79.0±0.2μs | 66.8±0.05μs | 0.85 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000)) | | - | 78.8±0.05μs | 67.0±0.2μs | 0.85 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10)) | | - | 84.2±0.07μs | 70.8±0.1μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',)) | | - | 89.4±0.1μs | 75.5±0.05μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000)) | | - | 78.9±0.04μs | 65.9±0.1μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10)) | | - | 85.4±0.06μs | 71.9±0.05μs | 0.84 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000)) | | - | 85.3±0.08μs | 70.5±0.1μs | 0.83 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000)) | | - | 80.5±0.03μs | 66.4±0.1μs | 0.83 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',)) | | - | 87.5±0.05μs | 71.6±0.1μs | 0.82 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100)) | | - | 80.4±0.05μs | 65.4±0.07μs | 0.81 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',)) | | - | 83.6±0.05μs | 66.9±0.1μs | 0.8 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100)) | | - | 83.5±0.05μs | 65.8±0.08μs | 0.79 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100)) | | - | 6.87±0.01μs | 5.13±0.08μs | 0.75 | bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',)) | | - | 12.2±0.02μs | 8.79±0.1μs | 0.72 | bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',)) | | - | 193±0.5μs | 124±0.5μs | 0.65 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',)) | | - | 27.7±0.2ms | 18.0±0.2ms | 0.65 | bench_function_base.Sort.time_sort_worst | | - | 192±0.4μs | 123±0.2μs | 0.64 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',)) | | - | 202±0.2μs | 128±0.04μs | 0.63 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) | | - | 203±0.5μs | 125±0.09μs | 0.62 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100)) | | - | 199±0.4μs | 122±0.07μs | 0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',)) | | - | 195±0.4μs | 120±0.09μs | 0.61 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10)) | | - | 215±0.3μs | 121±0.3μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',)) | | - | 216±0.3μs | 121±0.7μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',)) | | - | 225±0.3μs | 126±0.3μs | 0.56 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000)) | | - | 223±0.2μs | 119±0.09μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',)) | | - | 219±0.06μs | 118±0.08μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10)) | | - | 227±0.3μs | 123±0.2μs | 0.54 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100)) | ```
f0b936d to
682d33c
Compare
|
@seiko2plus I've rebased this on top of @rdevulap's changes, I think this is good to go? |
Leveraging the meson infrastructure to selectively enable SVE specifically for Highway, which already supports SVE.