Conversation
|
Thanks for working on this. Unfortunately, I'm not really convinced with the approach.
What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this: but generally not this: or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use. In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as I like this NumPy-style representation better than
So, what I'd suggest is something like this: /// Wrapper around `Array1` that makes sure the elements are in ascending order.
struct Edges<A: Ord> {
edges: Array1<A>,
}
impl<A: Ord> From<Array1<A>> for Edges<A> {
fn from(mut edges: Array1<A>) -> Self {
// sort the array in-place
Edges { edges }
}
}
impl<A: Ord> Edges<A> {
fn view(&self) -> ArrayView1<A> {
self.edges.view()
}
/// Returns the index of the bin containing the given value,
/// or `None` if none of the bins contain the value.
fn bin_index(&self, value: &A) -> Option<usize> {
// binary search for the correct bin
}
/// Returns the range of the bin containing the given value.
fn bin_range(&self, value: &A) -> Option<Range<A>>
where
A: Clone,
{
let i = self.bin_index(value);
Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() }
}
}
struct HistogramCounts<A: Ord> {
counts: ArrayD<usize>,
edges: Vec<Edges<A>>,
}
struct HistogramDensity<A: Ord> {
density: ArrayD<A>,
edges: Vec<Edges<A>>,
}
impl<A: Ord> HistogramCounts<A> {
pub fn new(edges: Vec<Edges<A>>) -> Self {
let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>());
HistogramCounts { counts, edges }
}
pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> {
let bin = observation
.iter()
.zip(&self.edges)
.map(|(v, e)| e.bin_index(v).ok_or(NotFound))
.collect::<Result<Vec<_>, _>>()?;
self.counts[bin] += 1;
}
}A few possible modifications:
What are your thoughts? |
|
Well, I think is an understatement to say that I didn't think it through and that your comment makes total sense 😅 I wanted to allow arbitrary meshes, but we should definitely prioritize memory-consumption and performance with respect to the most common use cases. I'll start over from your skeleton! |
It took me a while, but I managed to pull together a good candidate for histogram-related functionalities (
histogram,histogram2dandhistogramddin Numpy).Bins are formalized in their own submodule,
bins:Bin1d, basic building block, reusesRange*structs from the standard library;Bins1d, collection ofBin1d;BinNd, n-dimensional bin. It supports any region in an n-dimensional space that can be expressed as a product of intervals;BinsNd, collection of n-dimensional bins.With
BinsNdwe can express collections of bins that are more general than the ones supported by NumPy.Two top-level traits are provided to implement an
histogrammethod, forArrayBaseand one-dimensionalArrayBase. Right now the computational complexity is the same, apart from a slightly different API (pointbeingArrayBase<Data<Elem=T>>or a&T, but I plan to enhanceBin1d::containsto enable an implementation ofBins1d::findthat uses binary search. This will probably require to enforce non-overlapping bins at creation time forBins1d.I am now going to implement all the different string options for
binsin np.histogram_bin_edges as builders, but I wanted to get the PR out to get your feedback on the direction @jturner314