ENH: Add timsort to npysort by liwt31 · Pull Request #12418 · numpy/numpy

liwt31 · 2018-11-19T13:41:40Z

This is part of #12186 .
This PR is still under development in that only numeric timsort is implemented. For argsort and sorting algorithms of string types and generic type, a dummy mergesort is pasted as place-holder. I wish to hear more feedback before moving on, for it appears implementing the sibling sorting algorithms requires lots of copy and paste.

Overview

Currently what I've done:

Implement timsort for numeric sort. More on this below.
Add some lines in .gitignore to ignore generated files.
Add another option 't' (timsort) for kind argument in sort function. This is primarily for benchmark convenience.
Add the timsort algorithm to the test suite.
Some benchmark between the four sorting algorithms (quick merge heap tim). The benchmark is hosted in another repo.

The algorithm

The implementation is primarily based on Tim's little article, with some simplifications:

Insertion sort in run-counting is not binary-search insertion sort
No gallop mode appear in merge process

The reason for these simplifications is their large overhead. Tim's algorithm is specially designed for CPython, which have a very time-consuming comparing function. The direction of Tim's optimization is to reduce the number of comparisons required whenever possible. However, this is not necessary for most NumPy use case. Maybe for generic sort, these further optimizations can be included, however, IMHO this should be done in another PR.
I haven't prepared organized materials to show their "large overhead" I found in my experiments. If anyone is interested I'd be happy to elaborate on this.

Benchmark

The benchmark includes a benchmark script, a jupyter-notebook to show the shape of the arrays involved in the benchmark for an intuitive picture, and a readme file for benchmark results and comments. Simply speaking, timsort performs well in "nearly sorted" data and shows minimal overhead in random data. Besides, I noticed the following 2 interesting phenomena:

Quicksort is slower than mergesort for random data, demonstrating that there has to be something wrong with the current quicksort algorithm.
Standard deviation for time cost is very large even for the definitive already sorted array.

I haven't got time to dig into these and I'm more than happy to hear from you if they are actually perfectly normal.
Though timsort is designed to be stable and I've tried to make it so, I haven't done any benchmark or test for this, because I only have numeric sort in my hand.

Feedbacks I wish to hear

Most importantly, is the timsort implementation OK? If so I can start porting it to argsort and so on.
Should timsort replace mergesort? (according to benchmark result probably so)
Is there any need to improve the numpy benchmark suite for sorting algorithms? BENCH: add benchmark suite for mergesort #12368
All other comments, advice and suggestions.

Todo

Implement argsort, string sort and generic sort.
Tests and docs

charris · 2018-11-19T17:58:46Z

Quicksort is slower than mergesort for random data,

That's strange, historically it has been about 15-30% faster, improving with recent hardware. I wonder if something has changed.

In [1]: a = numpy.random.random(10**6)

In [2]: %timeit b = a.copy()
512 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit b = a.copy()
509 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit b = a.copy(); b.sort()
69.2 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit b = a.copy(); b.sort()
69.2 ms ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit b = a.copy(); b.sort(kind='mergesort')
79.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit b = a.copy(); b.sort(kind='mergesort')
79.6 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit b = a.copy(); b.sort(kind='heapsort')
125 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit b = a.copy(); b.sort(kind='heapsort')
123 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Seems about the same as before. What hardware are you running on?

liwt31 · 2018-11-20T11:55:04Z

@charris Thanks for the response. I can reproduce both your result and mine, on the development branch or stable 1.15. We set up the array differently

>>> a = np.random.rand(10**5)  # your way, random float64

>>> %timeit b=a.copy(); b.sort()
8.25 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
10.2 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> a = np.arange(10**5)   # my way, ordered int64 then shuffle

>>> np.random.shuffle(a)

>>> %timeit b=a.copy(); b.sort()
7.46 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
5.71 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Shuffle a twice before sorting doesn't change anything so it's unlikely related to different random number pattern. Meanwhile changing data type does the trick:

>>> a = np.arange(10**5)

>>> np.random.shuffle(a)

>>> a = np.float64(a)

>>> %timeit b=a.copy(); b.sort()
8.49 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
9.88 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> a = np.random.rand(10**5) * 10**5

>>> a = np.int64(a)

>>> %timeit b=a.copy(); b.sort()
7.43 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
5.95 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

After some digging in the assembly code I think the reason for this is some heavy compiler optimization for mergesort of int64. For integer, the following core mergesort operation

while (pj < pi && pm < pr) {
    if (@TYPE@_LT(*pm, *pj)) {
        *pk++ = *pm++;
    }
    else {
        *pk++ = *pj++;
    }
}

is translated to:

0x00007ffff08bf7d0 <+240>:   cmp    %rbp,%rcx
0x00007ffff08bf7d3 <+243>:   jbe    0x7ffff08bf808 <mergesort0_long+296>
0x00007ffff08bf7d5 <+245>:   mov    0x0(%rbp),%rax
0x00007ffff08bf7d9 <+249>:   mov    (%r12),%rdx
0x00007ffff08bf7dd <+253>:   lea    0x8(%rbp),%rdi
0x00007ffff08bf7e1 <+257>:   lea    0x8(%r12),%rsi
0x00007ffff08bf7e6 <+262>:   cmp    %rdx,%rax
0x00007ffff08bf7e9 <+265>:   cmovle %rdi,%rbp
0x00007ffff08bf7ed <+269>:   add    $0x8,%rbx
0x00007ffff08bf7f1 <+273>:   cmp    %rdx,%rax
0x00007ffff08bf7f4 <+276>:   cmovg  %rsi,%r12
0x00007ffff08bf7f8 <+280>:   cmp    %rax,%rdx
0x00007ffff08bf7fb <+283>:   cmovle %rdx,%rax
0x00007ffff08bf7ff <+287>:   cmp    %r12,%r13
0x00007ffff08bf802 <+290>:   mov    %rax,-0x8(%rbx)
0x00007ffff08bf806 <+294>:   ja     0x7ffff08bf7d0 <mergesort0_long+240>

Thanks to the low cost of integer comparison (cmp between %rdx and %rax appear 3 times), the whole loop contains only 2 jump instructions which are both predictable. This makes the CPU pipeline very efficient. However, when sorting float number, the comparison is expensive, hence more branches are introduced. Also, none of quicksort assembly code shows this pattern.
So for integer type mergesort should be faster than quicksort? that is something new to me.

I didn't expect numeric data type has such a huge impact on sorting efficiency... Benchmark becomes more and more difficult 😭.

BTW, If you are interested, I'm running on a Linux virtual machine on Windows. CPU info:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 94
model name  : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping    : 3
microcode   : 0x33
cpu MHz     : 2592.001
cache size  : 6144 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch epb fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bogomips    : 5184.00
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

EDIT: I wrote a blog on these interesting phenomena: Mergesort is faster than Quicksort including my tests on multiple CPUs and compilers.

charris · 2018-12-06T23:42:49Z

@liwt31 I'll get back to this once the 1.16 rc is out.

liwt31 · 2018-12-08T06:00:47Z

It's all right 😄. Take your time.

charris · 2018-12-31T23:20:43Z

I'm back. Your blog post was very interesting.

charris · 2019-01-01T17:37:54Z

The benchmarks are impressive, timsort certainly looks like a good replacement for mergesort, and in many cases for quicksort as well. I guess there might be some discussion as to whether it should silently replace mergsort in the same way as introsort has replaced quicksort, but however that goes, it looks like it is worth pursuing. The sorting benchmarks need updating per your issue on the matter, so that might be the next step here.

I find it surprising that a algorithm so complex does so well.

charris · 2019-01-01T17:41:02Z

There is another PR for radix sorting, #12586. It might be helpful to coordinate, at least for the benchmarks and tests.

liwt31 · 2019-01-03T01:19:02Z

@charris Thanks for your reply! So it seems I can move on to the sibling algorithms then. This can probably be done in 1 or 2 weeks.

There is another PR for radix sorting, #12586. It might be helpful to coordinate, at least for the benchmarks and tests.

I see that he has updated the NumPy sorting benchmark and has noticed my little benchmark work here. If he finds any parts of my work interesting he can probably include these in his PR. I think the PR can also close #12368 which proposes to add benchmark for mergesort.

I find it surprising that a algorithm so complex does so well.

Well, the idea behind Timsort isn't very complex (just as other great ideas), and I've removed some "over-complex" part of it (primarily gallop mode during merging) for NumPy use case.

charris · 2019-01-03T02:00:28Z

So it seems I can move on to the sibling algorithms then.

Yes, please do.

charris · 2019-01-03T02:02:00Z

PS, you might look to publish your blog post somewhere noticeable.

liwt31 · 2019-01-03T08:49:17Z

PS, you might look to publish your blog post somewhere noticeable.

Glad you like the post! I'd also like to have this piece of "new" knowledge been heard, but I can't think of any proper place. Do you have any suggestions?

liwt31 · 2019-01-08T01:26:55Z

Finished numeric argsort, string sorts and generic sort. New benchmarks here. I think the next to decide is whether replace Mergesort with Timsort or not. Once it's done I can move on to add docs for the new algorithm.

EDIT: Sorry for the force-push, a typo in the commit message.

mattip · 2019-01-18T10:03:32Z

This needs documentation in the 1.17 release note, in the docstrings, and in doc/source/*. You can look at what has been done for radix sort #12586.

Should the benchmarks as suggested in #12368 be merged as a condition for this PR?

liwt31 · 2019-01-21T13:01:47Z

@mattip I'm not sure if Timsort should replace Mergesort. The decision is important for docs.

OTOH, I can't add docstring without rebasing because when I forked there's no 1.17 release note file. I suppose it's OK for me to rebase?

Should the benchmarks as suggested in #12368 be merged as a condition for this PR?

#12368 suggests at least 2 ways to address the issue and as far as I know we haven't decided yet, but probably this PR can be safely merged without updates in the benchmark suite because it's already benchmarked (although elsewhere) and the issue can be solved in another PR.

mattip · 2019-01-21T13:14:30Z

Rebasing is fine. This is not a backport candidate to 1.16 as it is a new feature.

I don't think we should replace mergesort as default at this time until we have a chance to harden the code a bit, but others may feel differently.

liwt31 · 2019-01-21T14:23:38Z

Rebase & add docs assuming not replacing mergesort. If I missed something or mergesort should be replaced, please let me know.

mattip · 2019-01-21T16:24:29Z

@hameerabbasi any thoughts?

hameerabbasi · 2019-01-21T16:57:55Z

I've been through this a few times rather quickly. I agree with @charris that we should make timsort replace merge sort... Also, this will have the side benefit of speeding up quicksort, as that falls back to merge sort as well.

Another (kind of selfish) factor is that my PR will need fewer rebase steps.

doc/release/1.17.0-notes.rst

numpy/core/src/multiarray/conversion_utils.c

numpy/core/fromnumeric.py

mattip · 2019-01-29T16:36:49Z

In our last weekly developer conversation, we decided that timsort should become the stable sort. I added a few comments where things need updating

mattip · 2019-01-30T07:38:49Z

Why are the azure builds (32 bit and macos) failing? Rerunning them again failed, so it is not a once-off.

liwt31 · 2019-01-30T08:12:35Z

To me, it seems Run 32-bit Ubuntu Docker Build / Tests run on a windows host. The previous job #20190130.1 met the same problem. Very weird...

charris · 2019-01-31T13:47:26Z

Azure is doing that for everyone at the moment, it is an azure problem.

mattip · 2019-01-31T14:23:06Z

Thanks @liwt31

liwt31 · 2019-01-31T14:32:59Z

Thank you all for your comments and advice.

charris · 2019-01-31T20:28:33Z

Hmm, I just noticed that this is going to break forwards compatibility of extensions due to PyArray_ArrFuncs containing function pointer arrays whose size depends on NPY_NSORTS. If/when we release a NumPy 2.0 we should fix that, but I think the current upshot is that we just replace mergesort. @hameerabbasi That is going to make putting radix sort in more complicated, but I think the way forward is to make yet a third sort function that in turn calls either radixsort or timsort depending on the data type.

charris · 2019-01-31T20:29:55Z

Lets call the third function stable_sort if we go that way.

EDIT: Actually, we just need to change the initialization of the pointers depending on type. The various sorts should still be independent for functions that do not call through the function pointers.

hameerabbasi · 2019-02-01T04:26:10Z

Since this is most likely going into 1.17, and that breaks 2.7 compat either way, wouldn't it be okay to push it in?

hameerabbasi · 2019-02-01T04:53:58Z

Anyway, I'd be okay with replacing mergesort with radix/timsort based on the dtype as well.

mattip · 2019-02-01T05:38:39Z

wouldn't it be okay to push it in

No, since it will move the offset of all the following functions, which will break too many things. We should have a test for the offset of the last element in each of the exported structs to make sure this is spotted earlier ...

liwt31 · 2019-02-03T05:44:00Z

Not sure I understand how and why a change in the size of PyArray_ArrFuncs affects forwards compatibility. Can I simply take that mergesort should be replaced with timsort to keep the total number of sorting algorithms to 3?

charris · 2019-02-03T14:43:41Z

@liwt31 The offset of the elements in the structure changes, so code compiled for the old structure will access the wrong elements in the new structure. The idea of forward compatibility is that C extension modules compiled against an older NumPy will run with newer NumPy versions. What we don't guarantee is the extension modules compiled against new NumPy run on old NumPy. If new variables are added to the end of a structure, there is no problem because old code will never access them and the offsets of the old elements doesn't change. Unfortunately, the arrays containing the sort function pointers are in the middle of the PyArray_ArrFuncs structure, so if they change size, all the elements after them are shifted. The upshot is that we can have lots of sort functions, but any given data type can only use 6 of them, three sorts and three argsorts. I think the way to do that is to have more enum types, but fix NPY_NSORTS at three (with a comment). We cannot change the name of that macro because that would force downstream projects to rewrite their code.

charris · 2019-02-03T14:53:55Z

Note that the PyArray_ArrFuncs structures are initialized in arraytypes.c.src.

charris · 2019-02-03T15:02:16Z

~ I think the way to do that is to have more enum types,~

Changed my mind. We should leave the enum types as it was, maybe with some added aliases, and just change the structure initializations.

charris · 2019-02-03T19:13:38Z

@liwt31 Do you want to do this? I was going to do it myself but would be happy if you wanted to pursue it.

charris · 2019-02-03T20:29:05Z

Another option is to add additional function tables to the end of the structure.

liwt31 · 2019-02-04T01:57:24Z

Thanks for the explanation, I get the idea now. I'd love to make a PR for this but I'm on vacation recently, so please go ahead.

mattip · 2019-02-07T22:19:06Z

Looking at this again, I am not sure exactly where the change manifests itself. We use PyArray_ArrFuncs in dtype construction, so any users who build their own dtypes will have the wrong sized dtype. But:

Won't they need to do so in any case? How will we detect that PyArray_DESCR(arr) is missing the new functions?
How many of these dtypes exist in the wild? Checking Scipy and Pandas I did not find any use of PyArray_ArrFuncs.

Note that pickling or np.save/np.load across numpy versions will not be problematic since the dtypes are rebuilt (or the internal singletons reused).

mattip · 2019-02-07T22:20:46Z

xref PR #12945

tylerjereddy added 25 - WIP 01 - Enhancement component: numpy._core labels Nov 23, 2018

charris mentioned this pull request Jan 1, 2019

ENH: Implement radix sort #12586

Merged

8 tasks

liwt31 force-pushed the timsort-dev branch from 6d74fc1 to 2ee88ac Compare January 8, 2019 01:27

liwt31 added 2 commits January 21, 2019 20:42

ENH: preliminary numeric timsort

19952f4

ENH: add argsort, string sorts, generic sort for timsort

918988b

DOC: add docstring for timsort

5298b46

liwt31 force-pushed the timsort-dev branch from 2ee88ac to 5298b46 Compare January 21, 2019 14:20

mattip reviewed Jan 29, 2019

View reviewed changes

doc/release/1.17.0-notes.rst Show resolved Hide resolved

mattip reviewed Jan 29, 2019

View reviewed changes

numpy/core/src/multiarray/conversion_utils.c Outdated Show resolved Hide resolved

mattip reviewed Jan 29, 2019

View reviewed changes

numpy/core/fromnumeric.py Show resolved Hide resolved

DOC: update docstring of timsort

317d13b

mattip merged commit b5c855a into numpy:master Jan 31, 2019

mattip mentioned this pull request Feb 7, 2019

BUG: Add timsort without breaking the API. #12945

Merged

rgommers changed the title ~~WIP, ENH: Add timsort to npysort~~ ENH: Add timsort to npysort Mar 22, 2019

Uh oh!

Conversation

liwt31 commented Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

The algorithm

Benchmark

Feedbacks I wish to hear

Todo

Uh oh!

charris commented Nov 19, 2018

Uh oh!

liwt31 commented Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 6, 2018

Uh oh!

liwt31 commented Dec 8, 2018

Uh oh!

charris commented Dec 31, 2018

Uh oh!

charris commented Jan 1, 2019

Uh oh!

charris commented Jan 1, 2019

Uh oh!

liwt31 commented Jan 3, 2019

Uh oh!

charris commented Jan 3, 2019

Uh oh!

charris commented Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liwt31 commented Jan 3, 2019

Uh oh!

liwt31 commented Jan 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liwt31 commented Jan 21, 2019

Uh oh!

mattip commented Jan 21, 2019

Uh oh!

liwt31 commented Jan 21, 2019

Uh oh!

mattip commented Jan 21, 2019

Uh oh!

hameerabbasi commented Jan 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattip commented Jan 29, 2019

Uh oh!

mattip commented Jan 30, 2019

Uh oh!

liwt31 commented Jan 30, 2019

Uh oh!

charris commented Jan 31, 2019

Uh oh!

mattip commented Jan 31, 2019

Uh oh!

liwt31 commented Jan 31, 2019

Uh oh!

charris commented Jan 31, 2019

Uh oh!

charris commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Feb 1, 2019

Uh oh!

hameerabbasi commented Feb 1, 2019

Uh oh!

mattip commented Feb 1, 2019

Uh oh!

liwt31 commented Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Feb 3, 2019

liwt31 commented Nov 19, 2018 •

edited

Loading

liwt31 commented Nov 20, 2018 •

edited

Loading

charris commented Jan 3, 2019 •

edited

Loading

liwt31 commented Jan 8, 2019 •

edited

Loading

mattip commented Jan 18, 2019 •

edited

Loading

charris commented Jan 31, 2019 •

edited

Loading

liwt31 commented Feb 3, 2019 •

edited

Loading

liwt31 commented Feb 4, 2019 via email •

edited

Loading