Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added density reducer to bin/group/hexbin #2047

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs/transforms/bin.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ Plot.plot({
```
:::

The bin transform works with Plot’s [faceting system](../features/facets.md), partitioning bins by facet. Below, we compare the weight distributions of athletes within each sport using the *proportion-facet* reducer. Sports are sorted by median weight: gymnasts tend to be the lightest, and basketball players the heaviest.
The bin transform works with Plot’s [faceting system](../features/facets.md), partitioning bins by facet. Below, we compare the weight distributions of athletes within each sport using the *density* reducer. Sports are sorted by median weight: gymnasts tend to be the lightest, and basketball players the heaviest.

:::plot defer
```js-vue
Expand All @@ -202,7 +202,7 @@ Plot.plot({
x: {grid: true},
fy: {domain: d3.groupSort(olympians.filter((d) => d.weight), (g) => d3.median(g, (d) => d.weight), (d) => d.sport)},
color: {scheme: "{{$dark ? "turbo" : "YlGnBu"}}"},
marks: [Plot.rect(olympians, Plot.binX({fill: "proportion-facet"}, {x: "weight", fy: "sport", inset: 0.5}))]
marks: [Plot.rect(olympians, Plot.binX({fill: "density"}, {x: "weight", fy: "sport", inset: 0.5}))]
})
```
:::
Expand Down Expand Up @@ -253,6 +253,7 @@ The following named reducers are supported:
* *first* - the first value, in input order
* *last* - the last value, in input order
* *count* - the number of elements (frequency)
* *density* – the number of elements normalized to have total area of 1
wirhabenzeit marked this conversation as resolved.
Show resolved Hide resolved
* *distinct* - the number of distinct values
* *sum* - the sum of values
* *proportion* - the sum proportional to the overall total (weighted frequency)
Expand Down
1 change: 1 addition & 0 deletions docs/transforms/group.md
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,7 @@ The following named reducers are supported:
* *first* - the first value, in input order
* *last* - the last value, in input order
* *count* - the number of elements (frequency)
* *density* – the number of elements normalized to have total area of 1
wirhabenzeit marked this conversation as resolved.
Show resolved Hide resolved
* *sum* - the sum of values
* *proportion* - the sum proportional to the overall total (weighted frequency)
* *proportion-facet* - the sum proportional to the facet total
Expand Down
1 change: 1 addition & 0 deletions docs/transforms/hexbin.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ The following named reducers are supported:
* *first* - the first value, in input order
* *last* - the last value, in input order
* *count* - the number of elements (frequency)
* *density* – the number of elements normalized to have total area of 1
wirhabenzeit marked this conversation as resolved.
Show resolved Hide resolved
* *distinct* - the number of distinct values
* *sum* - the sum of values
* *proportion* - the sum proportional to the overall total (weighted frequency)
Expand Down
2 changes: 2 additions & 0 deletions src/reducer.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ export type ReducerPercentile =
* - *first* - the first value, in input order
* - *last* - the last value, in input order
* - *count* - the number of elements (frequency)
* - *density* – the number of elements normalized to have total area of 1
wirhabenzeit marked this conversation as resolved.
Show resolved Hide resolved
* - *distinct* - the number of distinct values
* - *sum* - the sum of values
* - *proportion* - the sum proportional to the overall total (weighted frequency)
Expand All @@ -36,6 +37,7 @@ export type ReducerName =
| "last"
| "identity"
| "count"
| "density"
| "distinct"
| "sum"
| "proportion"
Expand Down
1 change: 1 addition & 0 deletions src/transforms/bin.js
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ function binn(
if (sort) sort.scope("facet", facet);
if (filter) filter.scope("facet", facet);
for (const [f, I] of maybeGroup(facet, G)) {
for (const o of outputs) o.scope("group", I);
for (const [k, g] of maybeGroup(I, K)) {
for (const [b, extent] of bin(g)) {
if (G) extent.z = f;
Expand Down
15 changes: 15 additions & 0 deletions src/transforms/group.js
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ function groupn(
if (sort) sort.scope("facet", facet);
if (filter) filter.scope("facet", facet);
for (const [f, I] of maybeGroup(facet, G)) {
for (const o of outputs) o.scope("group", I);
for (const [y, gg] of maybeGroup(I, Y)) {
for (const [x, g] of maybeGroup(gg, X)) {
const extent = {data};
Expand Down Expand Up @@ -248,6 +249,8 @@ export function maybeReduce(reduce, value, fallback = invalidReduce) {
return reduceIdentity;
case "count":
return reduceCount;
case "density":
return reduceDensity;
case "distinct":
return reduceDistinct;
case "sum":
Expand Down Expand Up @@ -405,6 +408,18 @@ export const reduceCount = {
}
};

export const reduceDensity = {
label: "Density",
scope: "group",
reduceIndex(I, V, context, extent) {
if (context === undefined) return I.length;
var proportion = I.length / context;
wirhabenzeit marked this conversation as resolved.
Show resolved Hide resolved
if ("y2" in extent) proportion /= extent.y2 - extent.y1;
if ("x2" in extent) proportion /= extent.x2 - extent.x1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ("y2" in extent) proportion /= extent.y2 - extent.y1;
if ("x2" in extent) proportion /= extent.x2 - extent.x1;
if ("y2" in extent && !("x2" in extent)) proportion /= extent.y2 - extent.y1;
else if ("x2" in extent && !("y2" in extent)) proportion /= extent.x2 - extent.x1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to check whether value is null (like in reduceProportion), and if that is not the case, use a weighted formula.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when binning on both x, y it can also make sense to normalise to have total area 1, see example notebook where the fill is the density which is compared to a Gaussian distribution. Without ensuring this normalisation, the density goes to zero with the bin size. For this reason I think it makes more sense to do

if ("y2" in extent) proportion /= extent.y2 - extent.y1;
if ("x2" in extent) proportion /= extent.x2 - extent.x1;

Copy link
Contributor

@Fil Fil Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share the example?

If we want this to be consistent, then we should also normalize by hexagon area (binWidth**2) when doing hexbin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it should be public now.

Regarding hexbin: Probably that would be consistent, although the right thing to divide by would be the area of the bin in the x-y-variable scale. However, binWidth is in pixel scale, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it consistently for hexbin would require to somehow expose the x,y-radii of the hexagonal cells. Basically, this would amount to changing hbin in src/transforms/hexbin.js to (addition highlighted with //NEW)

function hbin(data, I, X, Y, dx, sx, sy) {
    // ...
    if (bin === undefined) {
      const x = (pi + (pj & 1) / 2) * dx + ox,
        y = pj * dy + oy;
      bin = {
        index: [],
        extent: {
          data,
          x,
          y,
          rx: sx.invert(x + dx / 2) - sx.invert(x), //NEW
          ry: sy.invert(y - dy / 2) - sy.invert(y) //NEW
        }
      };
      bins.set(key, bin);
    }
    bin.index.push(i);
  }
  return bins.values();
}

and calling it with hbin(data, I, X, Y, binWidth, scales.x, scales.y).

Then the correct normalisation in reduceDensity would be

if ("rx" in extent) proportion /= 3 * extent.rx * extent.ry;

according to the area of a (hexagon)[https://en.wikipedia.org/wiki/Hexagon]

However, this adds some computational overhead by computing this for each hexagonal cell even when the values rx,ry are not needed. For linear scales rx, ry is the same for every cell but I am not sure whether the scale functions somehow expose the fact that they are linear if this is the case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for researching this. Note that in fact we can't guarantee anything about the x and y scales (they might not be invertible, and they might even not exist, for example when we use a projection).

This makes me think that the normalization by x2-x1 that we do here does in fact not guarantee that the scaling is correct—it will only work if x is linear (or non-existent, because identity is linear). That's the main use case we wanted to solve, but it doesn't generalize: if your histogram's base is against a log scale, the results will be inconsistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think the issue is that this density reduction makes sense before the scales are applied, but probably not afterwards. As far as I can see this is different between bin and hexbin.

What I mean is that for bin the density reducer as a concept also makes sense e.g. for logarithmic scales, see an example here https://observablehq.com/d/0ef7cf5eae234601 (sorry for the dots, I don't know how to produce bars in log scale). Here the reducer is applied before the scaling. The scale just influences the final appearance of the plot. Hence the reducer does not guarantee that the area under the curve is 1 after the scaling, but before the scaling. But I would argue that this is the desired behaviour because the goal is to compare the binned density with some real normalised density, irrespective of the plot scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue the opposite :) If my data warrants using a log scale, it's because the thing that I am interested in analyzing is the log of the value. Scaling is not meant to make things pretty, but to connect to a deeper “truth” about the data.

This is what we do with the linear regression mark: if you use log scales, you get a linear model of the log of the values, which is equivalent to a log regression (or log-log regression if both scales are logarithmic).

To a certain extent, this is also what we do with hexbins: these hexagons are on the scaled values, and do not represent anything meaningful in data space (unless you use similar scales, for example with an aspectRatio of 1).

Linear binning is very different, as it operates in data space: the x/y bins are "rectangles", not squares. And if you want homogenous rectangles you need to manually adapt the bins so that they are equally-spaced (e.g. 1,2,4,8,16,32,64…), this is not provided by the defaults.

And, once you do that, the division by the extent of each bin should end up with the correct result? Here's what it would look like:

Plot.plot({
  marks: [
    Plot.rectY(
      pts,
      Plot.binX(
        { y: "density" },
        { x: "value", fill: "lambda", thresholds: d3.ticks(-3, 10, 20).map(d => 2**d), opacity: 0.5 }
      )
    ),
    Plot.line(density, { x: "x", y: "rho", stroke: "lambda" })
  ],
  x: { type: "log" },
  title: "Log Scale",
  grid: true,
  color: { legend: true, type: "categorical" }
})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the log scale is used to reveals some truth about the data like exponential scaling.

What I meant to argue is that the reducer I am suggesting, i.e. "density of values normalised to have total area 1" should be normalised such that the area measured before the scaling and not after the scaling is equal to 1.

In my notebook I now have the same chart in three coordinate systems. Here all three charts use the "density" reducer as implemented in by pull request, i.e. with proportion /= extent.x2 - extent.x1 which is before applying the log-scale. Note that the areas look differently big, in the first plot yellow has a bigger area than blue, while in the second plot it is the other way round. However, in the space of real coordinates all areas have exactly the same size of 1, and this is what I would like the ensure.

In the applications I have in mind I plot some histogram of values (in various scalings), and compare it to the PDF of some probability measure (like exponential, Gaussian, Weibull). For this to make sense it is important that the "area under the curve" measured in the "real" coordinate system (not e.g. the log-coordinate system) is equal to 1. This is for instance also how it is done in matplotlib with plt.hist(Y, bins=100, density=True, log=True).

I understand that this becomes a bit strange for hexbin because hexbin does the binning with the scaled values other than bin which (by default) does the binning on the raw values (allowing for setting custom thresholds which can be chosen to match the scale). In order for a density reducer to make sense for hexbin there would have to be a canonical way of determining the extent of each bin in the real coordinate space.

return proportion;
}
};

const reduceDistinct = {
label: "Distinct",
reduceIndex(I, X) {
Expand Down
1 change: 1 addition & 0 deletions src/transforms/hexbin.js
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ export function hexbin(outputs = {fill: "count"}, {binWidth, ...options} = {}) {
const binFacet = [];
for (const o of outputs) o.scope("facet", facet);
for (const [f, I] of maybeGroup(facet, G)) {
for (const o of outputs) o.scope("group", I);
for (const {index: b, extent} of hbin(data, I, X, Y, binWidth)) {
binFacet.push(++i);
BX.push(extent.x);
Expand Down
178 changes: 178 additions & 0 deletions test/output/densityReducer.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion test/plots/athletes-sport-weight.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ export async function athletesSportWeight() {
grid: true,
color: {scheme: "YlGnBu", zero: true},
marks: [
Plot.barX(athletes, Plot.binX({fill: "proportion-facet"}, {x: "weight", fy: "sport", thresholds: 60})),
Plot.barX(athletes, Plot.binX({fill: "density"}, {x: "weight", fy: "sport", thresholds: 60})),
Plot.frame({anchor: "bottom", facetAnchor: "bottom"})
]
});
Expand Down
46 changes: 46 additions & 0 deletions test/plots/density-reducer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import * as Plot from "@observablehq/plot";
import * as d3 from "d3";

const pdf_normal = (x, mu = 0, sigma = 1) =>
Math.exp(-0.5 * Math.pow((x - mu) / sigma, 2)) / (sigma * Math.sqrt(2 * Math.PI));

const densities = d3
.range(-6, 10, 0.1)
.map((x) => [0, 3].map((mu) => [1, 2].map((sigma) => ({x, mu, sigma, rho: pdf_normal(x, mu, sigma)}))))
.flat(3);

const n_pts = 100000;

const mus = Array.from({length: n_pts}, d3.randomBernoulli.source(d3.randomLcg(42))(0.2)).map((x) => 3 * x);
const sigmas = Array.from({length: n_pts}, d3.randomBernoulli.source(d3.randomLcg(43))(0.3)).map((x) => 1 + x);
const standardNormals = Array.from({length: n_pts}, d3.randomNormal.source(d3.randomLcg(44))(0, 1)).map(
(x, i) => x * sigmas[i] + mus[i]
);

const pts = standardNormals.map((value, i) => ({mu: mus[i], sigma: sigmas[i], value}));

export async function densityReducer() {
return Plot.plot({
marks: [
Plot.areaY(
pts,
Plot.binX(
{y2: "density"},
{
x: "value",
fill: (x) => `μ = ${x.mu}`,
opacity: 0.5,
fy: (x) => `σ = ${x.sigma}`,
interval: 0.2,
curve: "step"
}
)
),
Plot.line(densities, {x: "x", y: "rho", stroke: (x) => `μ = ${x.mu}`, fy: (x) => `σ = ${x.sigma}`}),
Plot.ruleY([0])
],
fy: {label: null},
color: {legend: true, type: "categorical"},
grid: true
});
}
2 changes: 1 addition & 1 deletion test/plots/hexbin-r.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ export async function hexbinR() {
marks: [
Plot.frame(),
Plot.hexgrid(),
Plot.dot(penguins, Plot.hexbin({title: "count", r: "count", fill: "proportion-facet"}, xy))
Plot.dot(penguins, Plot.hexbin({title: "count", r: "count", fill: "density"}, xy))
]
});
}
1 change: 1 addition & 0 deletions test/plots/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ export * from "./d3-survey-2015-comfort.js";
export * from "./d3-survey-2015-why.js";
export * from "./darker-dodge.js";
export * from "./decathlon.js";
export * from "./density-reducer.js";
export * from "./diamonds-boxplot.js";
export * from "./diamonds-carat-price-dots.js";
export * from "./diamonds-carat-price.js";
Expand Down