Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Improved docs on Transforms #2655

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 3 additions & 34 deletions doc/user_guide/encodings/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ Encoding Shorthands
For convenience, Altair allows the specification of the variable name along
with the aggregate and type within a simple shorthand string syntax.
This makes use of the type shorthand codes listed in :ref:`encoding-data-types`
as well as the aggregate names listed in :ref:`encoding-aggregates`.
as well as the aggregate names listed in :ref:`agg-func-table`.
The following table shows examples of the shorthand specification alongside
the long-form equivalent:

Expand Down Expand Up @@ -369,38 +369,7 @@ represents the mean of a third quantity, such as acceleration:
color='mean(Acceleration):Q'
)

Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``mean``, there are a large number of available
aggregation functions built into Altair:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
valid The count of field values that are not null or undefined. N/A
values A list of data objects in the group. N/A
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
For a full list of available aggregates, see :ref:`agg-func-table`.


Sort Option
Expand Down Expand Up @@ -486,7 +455,7 @@ x-axis, using the barley dataset:
)

The last two charts are the same because the default aggregation
(see :ref:`encoding-aggregates`) is ``mean``. To highlight the
(see :doc:`transform/aggregate`) is ``mean``. To highlight the
difference between sorting via channel and sorting via field consider the
following example where we don't aggregate the data
and use the `op` parameter to specify a different aggregation than `mean`
Expand Down
136 changes: 130 additions & 6 deletions doc/user_guide/transform/aggregate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ There are two ways to aggregate data within Altair: within the encoding itself,
or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate
summary statistics (e.g., median, min, max) over groups of data.
summary statistics (e.g., :code:`median`, :code:`min`, :code:`max`) over groups of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think these should have some markup, but since they aren't functions - median etc seems like the wrong choice.

Something like "median(...)" would link more closely to how you'd use it


If at least one fields in the specified encoding channels contain aggregate,
If any field in the specified encoding channels contains an aggregate,
the resulting visualization will show aggregate data. In this case, all
fields without aggregation function specified are treated as group-by fields
fields without a specified aggregation function are treated as group-by fields
in the aggregation process.

For example, the following bar chart aggregates mean of ``acceleration``,
Expand Down Expand Up @@ -43,9 +43,9 @@ is made available for convenience, and is equivalent to the longer form::
# ...

For more information on shorthand encodings specifications, see
:ref:`encoding-aggregates`.
:ref:`shorthand-description`.
dangotbanned marked this conversation as resolved.
Show resolved Hide resolved

The same plot can be shown using an explicitly computed aggregation, using the
The same plot can be shown via an explicitly computed aggregation, using the
:meth:`~Chart.transform_aggregate` method:

.. altair-plot::
Expand All @@ -58,7 +58,95 @@ The same plot can be shown using an explicitly computed aggregation, using the
groupby=["Cylinders"]
)

For a list of available aggregates, see :ref:`encoding-aggregates`.
The alternative to using aggregate functions is to preprocess the data with
Pandas, and then plot the resulting DataFrame:

.. altair-plot::

cars_df = data.cars()
source = (
cars_df.groupby('Cylinders')
.Acceleration
.mean()
.reset_index()
.rename(columns={'Acceleration': 'mean_acc'})
)

alt.Chart(source).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
)

**Note:** As mentioned in :doc:`../data`, this approach of transforming the
data with Pandas is preferable if we already have the DataFrame at hand.
Comment on lines +80 to +81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider 1) being more explicit about what exactly is meant by the term "at hand" and 2) being upfront in this sentence about the reason or reasons for Pandas transformations being preferable when the DataFrame is "at hand" (automatic type inference? something else also?)

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this suggests that data.html discusses these benefits of when a Pandas transformation is preferable, but it wasn't immediately obvious which part of this section of the docs it is referring to.

I think it should be referencing data-transformations


Because :code:`Cylinders` is of type :code:`int64` in the :code:`source`
DataFrame, Altair would have treated it as a :code:`qualitative` --instead of
:code:`ordinal`-- type, had we not specified it. Making the type of data
explicit is important since it affects the resulting plot; see
:ref:`type-legend-scale` and :ref:`type-axis-scale` for two illustrated
examples. As a rule of thumb, it is better to make the data type explicit,
instead of relying on an implicit type conversion.

Functions Without Arguments
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Aggregate functions can be used without arguments.
In such cases, the function will automatically aggregate
the data from the column specified in the other axis.

The following chart demonstrates this by counting the number of cars with
respect to their country of origin.

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.Y(aggregate='count')
x='count()'
)
Comment on lines +103 to +107
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems like it meant alt.X(aggregate='count'); but I think we can do without

Suggested change
alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.Y(aggregate='count')
x='count()'
)
alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N'
)


**Note:** The :code:`count` aggregate function is of type
:code:`quantitative` by default, it does not matter if the source data is a
DataFrame, URL pointer, CSV file or JSON file.
Comment on lines +109 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Note:** The :code:`count` aggregate function is of type
:code:`quantitative` by default, it does not matter if the source data is a
DataFrame, URL pointer, CSV file or JSON file.
.. note::
The :code:`count` aggregate function is of type :code:`quantitative` by default,
it does not matter if the source data is a DataFrame, URL pointer, CSV file or JSON file.


Functions that handle categorical data (such as :code:`count`,
:code:`missing`, :code:`distinct` and :code:`valid`) are the ones that get
the most out of this feature.

Argmin and Argmax Functions
^^^^^^^^^^^^^^^
The :code:`argmin` and :code:`argmax` functions help you find values from
one field that correspond to the minimum or maximum values in another
field. For example, you might want to find the production budget of
movies that earned the highest gross revenue in each genre.

These functions must be used with the :meth:`~Chart.transform_aggregate`
method rather than their shorthand notations. They return objects that act
as selectors for values in other columns, rather than returning values
directly. You can think of the returned object as a dictionary where the
column serves as a key to retrieve corresponding values.


To illustrate this, let's compare the weights of cars with the highest
horsepower across different regions of origin:

.. altair-plot::

alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)

This visualization reveals an interesting contrast: among cars with the
highest horsepower in their respective regions, Japanese cars are notably
lighter, while American cars are substantially heavier.

See :ref:`gallery_line_chart_with_custom_legend` for another example that uses
:code:`argmax`. The case of :code:`argmin` is completely similar.

Transform Options
^^^^^^^^^^^^^^^^^
Expand All @@ -70,3 +158,39 @@ class, which has the following options:
The :class:`~AggregatedFieldDef` objects have the following options:

.. altair-object-table:: altair.AggregatedFieldDef

.. _agg-func-table:

List of Aggregation Functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to ``count`` and ``average``, there are a large number of available
aggregation functions built into Altair; they are listed in the following table:

========= =========================================================================== =====================================
Aggregate Description Example
========= =========================================================================== =====================================
Comment on lines +170 to +172
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vega-lite docs appear to list these in a more logical (if implicit) order, starting with count-related functions (including count, valid, values, missing, and distinct), moving to basic mathematical operations (sum, product), then to central tendency measures (mean/average, variance/variancep, stdev/stdevp, stderr, median), followed by distribution statistics (q1, q3, ci0, ci1), and finally ending with range functions (min/argmin, max/argmax). The ordering here appears to be in alphabetial order, though it's not strictly so (e.g. ci01). I would have a slight preference for the vega-lite-style functional organization scheme (and with explicit headings for the categories).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on changing the order.

I'd probably need to see the end result of adding categories though.
The naive approach of just adding a category field would add a lot of repetition

argmin An input data object containing the minimum field value. N/A
argmax An input data object containing the maximum field value. :ref:`gallery_line_chart_with_custom_legend`
average The mean (average) field value. Identical to mean. :ref:`gallery_layer_line_color_rule`
count The total count of data objects in the group. :ref:`gallery_simple_heatmap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vega-Lite docs also state

Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field.

Just mentioning in case it's worth adding here as well?

Maybe that phrasing could replace

"... in the other axis" (#2655 (comment))

distinct The count of distinct field values. N/A
max The maximum field value. :ref:`gallery_boxplot`
mean The mean (average) field value. :ref:`gallery_scatter_with_layered_histogram`
median The median field value :ref:`gallery_boxplot`
min The minimum field value. :ref:`gallery_boxplot`
missing The count of null or undefined field values. N/A
q1 The lower quartile boundary of values. :ref:`gallery_boxplot`
q3 The upper quartile boundary of values. :ref:`gallery_boxplot`
ci0 The lower boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
ci1 The upper boundary of the bootstrapped 95% confidence interval of the mean. :ref:`gallery_sorted_error_bars_with_ci`
stderr The standard error of the field values. N/A
stdev The sample standard deviation of field values. N/A
stdevp The population standard deviation of field values. N/A
sum The sum of field values. :ref:`gallery_streamgraph`
product The product of field values. N/A
valid The count of field values that are not null or undefined. N/A
values A list of data objects in the group. N/A
variance The sample variance of field values. N/A
variancep The population variance of field values. N/A
========= =========================================================================== =====================================
Loading