Skip to content

ENH: Add timsort to npysort#12418

Merged
mattip merged 4 commits intonumpy:masterfrom
liwt31:timsort-dev
Jan 31, 2019
Merged

ENH: Add timsort to npysort#12418
mattip merged 4 commits intonumpy:masterfrom
liwt31:timsort-dev

Conversation

@liwt31
Copy link
Contributor

@liwt31 liwt31 commented Nov 19, 2018

This is part of #12186 .
This PR is still under development in that only numeric timsort is implemented. For argsort and sorting algorithms of string types and generic type, a dummy mergesort is pasted as place-holder. I wish to hear more feedback before moving on, for it appears implementing the sibling sorting algorithms requires lots of copy and paste.

Overview

Currently what I've done:

  1. Implement timsort for numeric sort. More on this below.
  2. Add some lines in .gitignore to ignore generated files.
  3. Add another option 't' (timsort) for kind argument in sort function. This is primarily for benchmark convenience.
  4. Add the timsort algorithm to the test suite.
  5. Some benchmark between the four sorting algorithms (quick merge heap tim). The benchmark is hosted in another repo.

The algorithm

The implementation is primarily based on Tim's little article, with some simplifications:

  • Insertion sort in run-counting is not binary-search insertion sort
  • No gallop mode appear in merge process

The reason for these simplifications is their large overhead. Tim's algorithm is specially designed for CPython, which have a very time-consuming comparing function. The direction of Tim's optimization is to reduce the number of comparisons required whenever possible. However, this is not necessary for most NumPy use case. Maybe for generic sort, these further optimizations can be included, however, IMHO this should be done in another PR.
I haven't prepared organized materials to show their "large overhead" I found in my experiments. If anyone is interested I'd be happy to elaborate on this.

Benchmark

The benchmark includes a benchmark script, a jupyter-notebook to show the shape of the arrays involved in the benchmark for an intuitive picture, and a readme file for benchmark results and comments. Simply speaking, timsort performs well in "nearly sorted" data and shows minimal overhead in random data. Besides, I noticed the following 2 interesting phenomena:

  • Quicksort is slower than mergesort for random data, demonstrating that there has to be something wrong with the current quicksort algorithm.
  • Standard deviation for time cost is very large even for the definitive already sorted array.

I haven't got time to dig into these and I'm more than happy to hear from you if they are actually perfectly normal.
Though timsort is designed to be stable and I've tried to make it so, I haven't done any benchmark or test for this, because I only have numeric sort in my hand.

Feedbacks I wish to hear

  • Most importantly, is the timsort implementation OK? If so I can start porting it to argsort and so on.
  • Should timsort replace mergesort? (according to benchmark result probably so)
  • Is there any need to improve the numpy benchmark suite for sorting algorithms? BENCH: add benchmark suite for mergesort #12368
  • All other comments, advice and suggestions.

Todo

  • Implement argsort, string sort and generic sort.
  • Tests and docs

@charris
Copy link
Member

charris commented Nov 19, 2018

Quicksort is slower than mergesort for random data,

That's strange, historically it has been about 15-30% faster, improving with recent hardware. I wonder if something has changed.

In [1]: a = numpy.random.random(10**6)

In [2]: %timeit b = a.copy()
512 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit b = a.copy()
509 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit b = a.copy(); b.sort()
69.2 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit b = a.copy(); b.sort()
69.2 ms ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit b = a.copy(); b.sort(kind='mergesort')
79.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit b = a.copy(); b.sort(kind='mergesort')
79.6 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit b = a.copy(); b.sort(kind='heapsort')
125 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit b = a.copy(); b.sort(kind='heapsort')
123 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Seems about the same as before. What hardware are you running on?

@liwt31
Copy link
Contributor Author

liwt31 commented Nov 20, 2018

@charris Thanks for the response. I can reproduce both your result and mine, on the development branch or stable 1.15. We set up the array differently

>>> a = np.random.rand(10**5)  # your way, random float64

>>> %timeit b=a.copy(); b.sort()
8.25 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
10.2 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> a = np.arange(10**5)   # my way, ordered int64 then shuffle

>>> np.random.shuffle(a)

>>> %timeit b=a.copy(); b.sort()
7.46 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
5.71 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Shuffle a twice before sorting doesn't change anything so it's unlikely related to different random number pattern. Meanwhile changing data type does the trick:

>>> a = np.arange(10**5)

>>> np.random.shuffle(a)

>>> a = np.float64(a)

>>> %timeit b=a.copy(); b.sort()
8.49 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
9.88 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> a = np.random.rand(10**5) * 10**5

>>> a = np.int64(a)

>>> %timeit b=a.copy(); b.sort()
7.43 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit b=a.copy(); b.sort(kind='m')
5.95 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

After some digging in the assembly code I think the reason for this is some heavy compiler optimization for mergesort of int64. For integer, the following core mergesort operation

while (pj < pi && pm < pr) {
    if (@TYPE@_LT(*pm, *pj)) {
        *pk++ = *pm++;
    }
    else {
        *pk++ = *pj++;
    }
}

is translated to:

0x00007ffff08bf7d0 <+240>:   cmp    %rbp,%rcx
0x00007ffff08bf7d3 <+243>:   jbe    0x7ffff08bf808 <mergesort0_long+296>
0x00007ffff08bf7d5 <+245>:   mov    0x0(%rbp),%rax
0x00007ffff08bf7d9 <+249>:   mov    (%r12),%rdx
0x00007ffff08bf7dd <+253>:   lea    0x8(%rbp),%rdi
0x00007ffff08bf7e1 <+257>:   lea    0x8(%r12),%rsi
0x00007ffff08bf7e6 <+262>:   cmp    %rdx,%rax
0x00007ffff08bf7e9 <+265>:   cmovle %rdi,%rbp
0x00007ffff08bf7ed <+269>:   add    $0x8,%rbx
0x00007ffff08bf7f1 <+273>:   cmp    %rdx,%rax
0x00007ffff08bf7f4 <+276>:   cmovg  %rsi,%r12
0x00007ffff08bf7f8 <+280>:   cmp    %rax,%rdx
0x00007ffff08bf7fb <+283>:   cmovle %rdx,%rax
0x00007ffff08bf7ff <+287>:   cmp    %r12,%r13
0x00007ffff08bf802 <+290>:   mov    %rax,-0x8(%rbx)
0x00007ffff08bf806 <+294>:   ja     0x7ffff08bf7d0 <mergesort0_long+240>

Thanks to the low cost of integer comparison (cmp between %rdx and %rax appear 3 times), the whole loop contains only 2 jump instructions which are both predictable. This makes the CPU pipeline very efficient. However, when sorting float number, the comparison is expensive, hence more branches are introduced. Also, none of quicksort assembly code shows this pattern.
So for integer type mergesort should be faster than quicksort? that is something new to me.

I didn't expect numeric data type has such a huge impact on sorting efficiency... Benchmark becomes more and more difficult 😭.

BTW, If you are interested, I'm running on a Linux virtual machine on Windows. CPU info:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 94
model name  : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
stepping    : 3
microcode   : 0x33
cpu MHz     : 2592.001
cache size  : 6144 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch epb fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bogomips    : 5184.00
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

EDIT: I wrote a blog on these interesting phenomena: Mergesort is faster than Quicksort including my tests on multiple CPUs and compilers.

@charris
Copy link
Member

charris commented Dec 6, 2018

@liwt31 I'll get back to this once the 1.16 rc is out.

@liwt31
Copy link
Contributor Author

liwt31 commented Dec 8, 2018

It's all right 😄. Take your time.

@charris
Copy link
Member

charris commented Dec 31, 2018

I'm back. Your blog post was very interesting.

@charris
Copy link
Member

charris commented Jan 1, 2019

The benchmarks are impressive, timsort certainly looks like a good replacement for mergesort, and in many cases for quicksort as well. I guess there might be some discussion as to whether it should silently replace mergsort in the same way as introsort has replaced quicksort, but however that goes, it looks like it is worth pursuing. The sorting benchmarks need updating per your issue on the matter, so that might be the next step here.

I find it surprising that a algorithm so complex does so well.

@charris
Copy link
Member

charris commented Jan 1, 2019

There is another PR for radix sorting, #12586. It might be helpful to coordinate, at least for the benchmarks and tests.

@charris charris mentioned this pull request Jan 1, 2019
8 tasks
@liwt31
Copy link
Contributor Author

liwt31 commented Jan 3, 2019

@charris Thanks for your reply! So it seems I can move on to the sibling algorithms then. This can probably be done in 1 or 2 weeks.

There is another PR for radix sorting, #12586. It might be helpful to coordinate, at least for the benchmarks and tests.

I see that he has updated the NumPy sorting benchmark and has noticed my little benchmark work here. If he finds any parts of my work interesting he can probably include these in his PR. I think the PR can also close #12368 which proposes to add benchmark for mergesort.

I find it surprising that a algorithm so complex does so well.

Well, the idea behind Timsort isn't very complex (just as other great ideas), and I've removed some "over-complex" part of it (primarily gallop mode during merging) for NumPy use case.

@charris
Copy link
Member

charris commented Jan 3, 2019

So it seems I can move on to the sibling algorithms then.

Yes, please do.

@charris
Copy link
Member

charris commented Jan 3, 2019

PS, you might look to publish your blog post somewhere noticeable.

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 3, 2019

PS, you might look to publish your blog post somewhere noticeable.

Glad you like the post! I'd also like to have this piece of "new" knowledge been heard, but I can't think of any proper place. Do you have any suggestions?

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 8, 2019

Finished numeric argsort, string sorts and generic sort. New benchmarks here. I think the next to decide is whether replace Mergesort with Timsort or not. Once it's done I can move on to add docs for the new algorithm.

EDIT: Sorry for the force-push, a typo in the commit message.

@mattip
Copy link
Member

mattip commented Jan 18, 2019

This needs documentation in the 1.17 release note, in the docstrings, and in doc/source/*. You can look at what has been done for radix sort #12586.

Should the benchmarks as suggested in #12368 be merged as a condition for this PR?

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 21, 2019

@mattip I'm not sure if Timsort should replace Mergesort. The decision is important for docs.

OTOH, I can't add docstring without rebasing because when I forked there's no 1.17 release note file. I suppose it's OK for me to rebase?

Should the benchmarks as suggested in #12368 be merged as a condition for this PR?

#12368 suggests at least 2 ways to address the issue and as far as I know we haven't decided yet, but probably this PR can be safely merged without updates in the benchmark suite because it's already benchmarked (although elsewhere) and the issue can be solved in another PR.

@mattip
Copy link
Member

mattip commented Jan 21, 2019

Rebasing is fine. This is not a backport candidate to 1.16 as it is a new feature.

I don't think we should replace mergesort as default at this time until we have a chance to harden the code a bit, but others may feel differently.

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 21, 2019

Rebase & add docs assuming not replacing mergesort. If I missed something or mergesort should be replaced, please let me know.

@mattip
Copy link
Member

mattip commented Jan 21, 2019

@hameerabbasi any thoughts?

@hameerabbasi
Copy link
Contributor

I've been through this a few times rather quickly. I agree with @charris that we should make timsort replace merge sort... Also, this will have the side benefit of speeding up quicksort, as that falls back to merge sort as well.

Another (kind of selfish) factor is that my PR will need fewer rebase steps.

@mattip
Copy link
Member

mattip commented Jan 29, 2019

In our last weekly developer conversation, we decided that timsort should become the stable sort. I added a few comments where things need updating

@mattip
Copy link
Member

mattip commented Jan 30, 2019

Why are the azure builds (32 bit and macos) failing? Rerunning them again failed, so it is not a once-off.

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 30, 2019

To me, it seems Run 32-bit Ubuntu Docker Build / Tests run on a windows host. The previous job #20190130.1 met the same problem. Very weird...

@charris
Copy link
Member

charris commented Jan 31, 2019

Azure is doing that for everyone at the moment, it is an azure problem.

@mattip mattip merged commit b5c855a into numpy:master Jan 31, 2019
@mattip
Copy link
Member

mattip commented Jan 31, 2019

Thanks @liwt31

@liwt31
Copy link
Contributor Author

liwt31 commented Jan 31, 2019

Thank you all for your comments and advice.

@charris
Copy link
Member

charris commented Jan 31, 2019

Hmm, I just noticed that this is going to break forwards compatibility of extensions due to PyArray_ArrFuncs containing function pointer arrays whose size depends on NPY_NSORTS. If/when we release a NumPy 2.0 we should fix that, but I think the current upshot is that we just replace mergesort. @hameerabbasi That is going to make putting radix sort in more complicated, but I think the way forward is to make yet a third sort function that in turn calls either radixsort or timsort depending on the data type.

@charris
Copy link
Member

charris commented Jan 31, 2019

Lets call the third function stable_sort if we go that way.

EDIT: Actually, we just need to change the initialization of the pointers depending on type. The various sorts should still be independent for functions that do not call through the function pointers.

@hameerabbasi
Copy link
Contributor

Since this is most likely going into 1.17, and that breaks 2.7 compat either way, wouldn't it be okay to push it in?

@hameerabbasi
Copy link
Contributor

Anyway, I'd be okay with replacing mergesort with radix/timsort based on the dtype as well.

@mattip
Copy link
Member

mattip commented Feb 1, 2019

wouldn't it be okay to push it in

No, since it will move the offset of all the following functions, which will break too many things. We should have a test for the offset of the last element in each of the exported structs to make sure this is spotted earlier ...

@liwt31
Copy link
Contributor Author

liwt31 commented Feb 3, 2019

Not sure I understand how and why a change in the size of PyArray_ArrFuncs affects forwards compatibility. Can I simply take that mergesort should be replaced with timsort to keep the total number of sorting algorithms to 3?

@charris
Copy link
Member

charris commented Feb 3, 2019

@liwt31 The offset of the elements in the structure changes, so code compiled for the old structure will access the wrong elements in the new structure. The idea of forward compatibility is that C extension modules compiled against an older NumPy will run with newer NumPy versions. What we don't guarantee is the extension modules compiled against new NumPy run on old NumPy. If new variables are added to the end of a structure, there is no problem because old code will never access them and the offsets of the old elements doesn't change. Unfortunately, the arrays containing the sort function pointers are in the middle of the PyArray_ArrFuncs structure, so if they change size, all the elements after them are shifted. The upshot is that we can have lots of sort functions, but any given data type can only use 6 of them, three sorts and three argsorts. I think the way to do that is to have more enum types, but fix NPY_NSORTS at three (with a comment). We cannot change the name of that macro because that would force downstream projects to rewrite their code.

@charris
Copy link
Member

charris commented Feb 3, 2019

Note that the PyArray_ArrFuncs structures are initialized in arraytypes.c.src.

@charris
Copy link
Member

charris commented Feb 3, 2019

~ I think the way to do that is to have more enum types,~

Changed my mind. We should leave the enum types as it was, maybe with some added aliases, and just change the structure initializations.

@charris
Copy link
Member

charris commented Feb 3, 2019

@liwt31 Do you want to do this? I was going to do it myself but would be happy if you wanted to pursue it.

@charris
Copy link
Member

charris commented Feb 3, 2019

Another option is to add additional function tables to the end of the structure.

@liwt31
Copy link
Contributor Author

liwt31 commented Feb 4, 2019 via email

@mattip
Copy link
Member

mattip commented Feb 7, 2019

Looking at this again, I am not sure exactly where the change manifests itself. We use PyArray_ArrFuncs in dtype construction, so any users who build their own dtypes will have the wrong sized dtype. But:

  • Won't they need to do so in any case? How will we detect that PyArray_DESCR(arr) is missing the new functions?
  • How many of these dtypes exist in the wild? Checking Scipy and Pandas I did not find any use of PyArray_ArrFuncs.

Note that pickling or np.save/np.load across numpy versions will not be problematic since the dtypes are rebuilt (or the internal singletons reused).

@mattip
Copy link
Member

mattip commented Feb 7, 2019

xref PR #12945

@rgommers rgommers changed the title WIP, ENH: Add timsort to npysort ENH: Add timsort to npysort Mar 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments