-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to use pandas 2.x #838
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml
While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow. non-sharrow test timings for pandas 1.x:
non-sharrow test timings for pandas 2.x:
It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in |
@@ -236,6 +236,8 @@ def vehicle_allocation( | |||
logger.info("Running for occupancy = %d", occup) | |||
# setting occup for access in spec expressions | |||
locals_dict.update({"occup": occup}) | |||
if model_settings.sharrow_skip: | |||
locals_dict["disable_sharrow"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My memory might be sloppy. Why possibly opting out sharrow for vehicle allocation?
t = pa.Table.from_pandas(df, preserve_index=True, columns=columns) | ||
except (pa.ArrowTypeError, pa.ArrowInvalid): | ||
# if there are object columns, try to convert them to categories | ||
df = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your latest comment about significantly longer run time with this PR. I noticed you are calling copy() here. In pandas 2.0 copy() defaults to a deep copy. I wonder if this contributed to the run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is causing the problem. This code only executes in the write_tables
step at the end of the model run.
Addresses #794.
The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:
Index
objects are all one class with different datatypes, instead of being different classes (e.g. there is no moreInt64Index
class).read_csv
function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python objectNone
.groupby
operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).df.join()
also potentially sorts the resulting rows differently unless an explicitsort
argument is given.Index
objects no longer can be checked asis_monotonic
but instead needis_monotonic_increasing
.