ENH: Add support for executing UDF's using Bodo as the engine #60668

scott-routledge2 · 2025-01-06T22:17:55Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Applying User Defined Functions (UDFs) to a DataFrame can be very slow when evaluated using the default python engine. Passing engine="numba" and leveraging Numba's Just-in-Time (JIT) compiler to transform the UDF application into an optimized binary can improve performance, however there are several limitations to the Numba UDF engine including:

Limited set of dtypes supported (only supports numpy dtypes, does not support ExtensionDtypes)
Parallel execution not supported (unless raw=True)
Difficulty troubleshooting issues due to lengthy stack traces and hard-to-read error messages.

Adding support for the Bodo engine would solve the above issues and provide a good complement to the capabilities of the currently supported engines (Python and Numba).

Bodo uses an auto-parallelizing JIT compiler to transform Python code into highly optimized, parallel binaries with an MPI backend, allowing it to scale to very large data sizes with minimal extra work required from the user (large speedups on both laptops and clusters). Bodo is also built for Pandas and supports DataFrame, Series and Array Extension types natively.

Feature Description

Allow passing the value "bodo" to the engine parameter in DataFrame.apply and add an apply_bodo method which accepts the user defined function and creates a jit function to do the apply and calls it. For example:
In pandas/core/apply.py

class FrameApply(NDFrameApply):
...
    def apply_series_bodo(self) -> DataFrame | Series:
        bodo = import_optional_dependency("bodo")

        engine_kwargs = bodo_get_jit_arguments(self.engine_kwargs)

        @bodo.jit(**engine_kwargs)
        def do_apply(obj, func, axis):
            return obj.apply(func, axis)

        result = do_apply(self.obj, self.func, self.axis)
        return result

This approach could also be applied to other API's that accepts a UDF and engine argument.

Alternative Solutions

Users could execute their UDF using a Bodo JIT'd function. For example:

import bodo
import pandas as pd

def f(x):
  return x.A // x.B if x.B != 0 else 0

@bodo.jit
def apply_udf(df, func):
  return df.apply(func, axis=1)

df = pd.DataFrame({"A": [1,2,3,4,5], "B": [0, 1, 2, 2, 2]})

result = apply_udf(df, f)

While this approach is fine, it has it's downsides such as requiring a larger code rewrite which could make it more difficult to quickly experiment with different engines.

Additional Context

Relevant links:
Bodo's documentation
Bodo's github repo
Proof-of-concept PR that adds support for engine="bodo" in df.apply.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2025-01-07T05:23:59Z

I don’t see why this needs to live in pandas instead of something like bodo.apply(func, df). Same amount of changed code for users.

scott-routledge2 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2025

scott-routledge2 linked a pull request Jan 6, 2025 that will close this issue

[WIP] df.apply: add support for engine='bodo' #60622

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for executing UDF's using Bodo as the engine #60668

ENH: Add support for executing UDF's using Bodo as the engine #60668

scott-routledge2 commented Jan 6, 2025

jbrockmendel commented Jan 7, 2025 •

edited

Loading

ENH: Add support for executing UDF's using Bodo as the engine #60668

ENH: Add support for executing UDF's using Bodo as the engine #60668

Comments

scott-routledge2 commented Jan 6, 2025

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

jbrockmendel commented Jan 7, 2025 • edited Loading

jbrockmendel commented Jan 7, 2025 •

edited

Loading