Skip to content

BUG: Add timsort without breaking the API.#12945

Merged
mattip merged 2 commits intonumpy:masterfrom
charris:fix-timsort-compat
Feb 13, 2019
Merged

BUG: Add timsort without breaking the API.#12945
mattip merged 2 commits intonumpy:masterfrom
charris:fix-timsort-compat

Conversation

@charris
Copy link
Member

@charris charris commented Feb 7, 2019

In order to maintain forward compatibility it is necessary to keep the
size of PyArray_ArrFuncs struct fixed. The usual trick of adding new
elements to the end of the structure is not available in this case
because the struct may be instanciated by user types and we have no way
to know whether the new or old struct is in play.

The solution adopted here is the reuse the (a)mergesort slots for stable
sorts of all kinds, with the actual kind set when the struct is
initialized. The '(a)mergesort' option thus becomes an alias for
'stable', but we keep it for backwards compatibility.

In order to maintain forward compatibility it is necessary to keep the
size of PyArray_ArrFuncs struct fixed. The usual trick of adding new
elements to the end of the structure is not available in this case
because the struct may be instanciated by user types and we have no way
to know whether the new or old struct is in play.

The solution adopted here is the reuse the (a)mergesort slots for stable
sorts of all kinds, with the actual kind set when the struct is
initialized. The '(a)mergesort' option thus becomes an alias for
'stable', but we keep it for backwards compatibility.
@charris
Copy link
Member Author

charris commented Feb 7, 2019

There will be another commit with documentation updates, but I thought it would get this up for discussion of the approach taken to maintain forward compatibiity.

@charris
Copy link
Member Author

charris commented Feb 7, 2019

Github testing seems to be down.

EDIT: Looks like a clock problem.

@mattip mattip mentioned this pull request Feb 7, 2019
2 tasks
@mattip
Copy link
Member

mattip commented Feb 7, 2019

Repeating comment from #12418

Looking at this again, I am not sure exactly where the change manifests itself. We use PyArray_ArrFuncs in dtype construction, so any users who build their own dtypes will have the wrong sized dtype. But are there any such dtypes in the wild, other than quaternions? Checking Scipy and Pandas I did not find any use of PyArray_ArrFuncs.

Can we perhaps increment NPY_USERDEF in such a way as to indicate which verison of numpy was used to create this dtype, and when checking the dtype->typenum for user defined dtypes warn or error on use of incompatible old dtypes?

Note that pickling or np.save/np.load across numpy versions will not be problematic since the dtypes are rebuilt (or the internal singletons reused).

@mattip
Copy link
Member

mattip commented Feb 7, 2019

Note the same problem exists in PR #12585 where we change the PyArray_Descr struct

@charris
Copy link
Member Author

charris commented Feb 7, 2019

checking the dtype->typenum for user defined dtypes warn or error on use of incompatible old dtypes?

It is possible user types could have either the old or new versions, so that doesn't really tell you what version is in use. Best would be if the structure was versioned, but it isn't. We do need to keep this problem in mind when designing the new types, we might want to make sure everything is versioned or has spare slots if we stick to anything close to the current implementation. Either that, or have the user supply a table of getter functions rather than a table of functions. Any way, needs thought.

I did start a PR adding pointers to function tables to the end of the PyArray_ArrFuncs, and if numpy were the sole supplier of instances that would work just fine. The give away was the PyArray_InitArrFuncs in usertypes.c, a completely useless function to my mind, but in the NumPy API along with PyArray_RegisterDataType, that also expects the new datatype to come with its own instance of PyArray_ArrFuncs.

I think this PR does supply a usable workaround that doesn't break the API, but it is here for discussion :)

@charris
Copy link
Member Author

charris commented Feb 7, 2019

Another possibility is to have Numpy supply the instances to the user through a function call and the user could then fill them in, that would be like the current setup and new elements could be added to the end of the structures as we currently do. The trick here is to ensure that NumPy is the sole supplier of instances.

EDIT:

That could be a safe way forward, perhaps starting with a deprecation of the current functions after an alternative is available. Or we could just figure there aren't enough folks implementing user types to worry about, and just change things. In any case, I think we should set things up so we can modify the structures when we need to.

@charris
Copy link
Member Author

charris commented Feb 8, 2019

I did not find any use of PyArray_ArrFuncs.

_rational_tests.c.src uses them, Intel asked about implementing them, and I don't know who else. We could just change things -- and we might as well make the NumPy 2.0.0 -- with the proviso that only user types will be affected and will not be forward compatible, but will need to be recompiled with the new version.

Might be interesting to see which of the packages in Anaconda have user types.

Update the sorting documentation to reflect the reality that 'mergesort'
and 'stable' are aliases and may refer to any of several stable sorting
algorithms depending on data type.

[ci skip]
@charris charris changed the title WIP, BUG: Add timsort without breaking the API. BUG: Add timsort without breaking the API. Feb 9, 2019
@mattip
Copy link
Member

mattip commented Feb 12, 2019

I tried a version of quaternion compiled against numpy.1.13 and a version of numpy with the changed PyArray_ArrFuncs. There is indeed a problem with PyArray_InitArrFuncs and PyArray_RegisterDataType. I found additional uses of those functions (beyond quaternion). These examples do not override any functionality of dtypes, and might be better served by using the approach taken by pyybind11, which wraps structs via the buffer protocol using the format string

  • xdress which generates dtypes from c/c++ objects, last updated 5 years ago
  • PyTAPS, Python bindings for ITAPS interfaces, last updated 6 years ago

There may be more hiding in private reops somewhere.

... have Numpy supply the instances to the user through a function call and the user could then fill them in, that would be like the current setup and new elements could be added to the end of the structures as we currently do. The trick here is to ensure that NumPy is the sole supplier of instances.

This seems like good practice. Could we

  • Leave the current PyArray_InitArrFuncs which will now zero out a struct the size of the old PyArray_ArrFuncs
  • Leave the current PyArray_RegisterDataType which would copy the appropriate functions to a newly allocated structure.
  • Add a PyArrDescr_From( ???) which would act like the PyUFunc_From* family of functions. We would still need to define the arguments to this new function. It would already register the descr and assign a user typenum.

We already hide the cast functions behind PyArray_RegisterCastFunc and PyArray_RegisterCanCast, perhaps we should hide the other PyArray_Descr fields behind similar functions?

@charris
Copy link
Member Author

charris commented Feb 12, 2019

Let's move this discussion somewhere else, maybe to the dtype NEP and on the list. I'd really appreciate it if some of the CS types would participate.

Anyway, I think the upshot is that we need this PR, which also allows for the addition of radixsort, and if we want all the different sorts user selectable, that can be done later when the machinery is there. The current sorts can be accessed by users, they are in a library.

@mattip mattip merged commit dea8580 into numpy:master Feb 13, 2019
@mattip
Copy link
Member

mattip commented Feb 13, 2019

Thanks Chuck.

@charris charris deleted the fix-timsort-compat branch February 16, 2019 15:52
@charris charris mentioned this pull request Apr 9, 2019
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments