Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PicklingError: Could not serialize object: TypeError: can't pickle _abc_data objects #5

Open
saichaitanyamolabanti opened this issue May 4, 2022 · 8 comments

Comments

@saichaitanyamolabanti
Copy link

I wanted to try out this package, because this implements pyspark version of shapley value generations.
So, I just copy pasted "simple.ipynb" file into my environment to just observe everything basic is working alright or not, but able to see code is breaking at input cell [32]. Attached are the screenshots, could anyone please look into them?
image
image

@saichaitanyamolabanti
Copy link
Author

@ijoseph @kevinwang @variablenix @prasad-kamat please help

@ijoseph
Copy link
Contributor

ijoseph commented May 9, 2022

Wow, we really should have pinned (and pip compileed, too) our requirements file below. Let me see if I can get something working and try to update the below.
https://github.com/Affirm/shparkley/blob/master/examples/requirements.txt

@ijoseph
Copy link
Contributor

ijoseph commented May 9, 2022

Alright, @saichaitanyamolabanti can you please try to pull this #7 PR, then pip install -r examples/macos-py3.10-requirements.txt if you happen to have macOS and an empty python 3.10 environment, pip install -r examples/requirements.in otherwise? That particular set of third-party requirements worked for me.

@saichaitanyamolabanti
Copy link
Author

Hey @ijoseph, I've noticed to install few libraries as per your comments and began installing them, mainly the install and import of cloudpickle. Here are my observations, I can still find some errors, please help !!

Scenario-1
import cloudpickle
#import pyspark.serializers
#pyspark.serializers.cloudpickle = cloudpickle

then row = dataset.filter(dataset.xxxx == '5').rdd.first() is working fine

Scenario-2:
import cloudpickle
import pyspark.serializers
pyspark.serializers.cloudpickle = cloudpickle
then row = dataset.filter(dataset.xxxx == '5').rdd.first() is throwing below error
image

@saichaitanyamolabanti
Copy link
Author

then tried to pull those import of cloudpicklet and spark.serializers down below the investigation row like:
row = dataset.filter(dataset.xxxx == '5').rdd.first()
import cloudpickle
import pyspark.serializers
pyspark.serializers.cloudpickle = cloudpickle

but, still able to see error like - cloudpickle doesn't have the method 'print_exec'

@saichaitanyamolabanti
Copy link
Author

@ijoseph Or you can consider this scenario:
I've tried the same simpl.ipynb example by installing 'cloud pickle' library and by also importing pyspark.serializers and setting up with cloudpickle like below:
import cloudpickle
import pyspark.serializers
pyspark.serializers.cloudpickle = cloudpickle

getting this error, please help !!:
image

@saichaitanyamolabanti
Copy link
Author

@ijoseph @kevinwang @variablenix @prasad-kamat any help ?

@m-aciek
Copy link

m-aciek commented Jul 5, 2022

Isn't it this issue? It looks like it's solved in pyspark 3.0.0 (PR). So maybe it would be enough to set the lower bound for pyspark dependency in setup.py?

REQUIRED_PACKAGES = [
    …,
    'pyspark>=3.0.0',
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants