-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] webdataset - expand JSON objects into individual samples #48673
[data] webdataset - expand JSON objects into individual samples #48673
Conversation
0159ba0
to
74c3e17
Compare
@@ -342,6 +344,8 @@ def _read_stream(self, stream: "pyarrow.NativeFile", path: str): | |||
Yields: | |||
List[Dict[str, Any]]: List of sample (list of length 1). | |||
""" | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just import at top-level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Missing tests - @Jay-ju could you please add a test? |
bbcd0ba
to
042df0b
Compare
@@ -362,4 +362,17 @@ def get_tar_file_iterator(): | |||
for sample in samples: | |||
if self.decoder is not None: | |||
sample = _apply_list(self.decoder, sample, default=_default_decoder) | |||
if self.expand_json: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this always have json
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. For example, a sample of the JSON column in WebDataset is as follows:
{
"file_name": "00001.jpg",
"height":50
}, Ultimately, it will be formed into something like this:
file_name | height | json |
---|---|---|
00001.jpg | 50 | {"file_name": "00001.jpg","height":50} |
elif isinstance(sample["json"], dict): | ||
parsed_json = sample["json"] | ||
else: | ||
raise TypeError("Unsupported data type for sample['json']") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also print what is the type you got, and the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
f863a14
to
bcc0409
Compare
tests failing? |
737d3c1
to
7211a6f
Compare
Signed-off-by: jukejian <[email protected]>
f68d0e8
to
77b7a6e
Compare
The failed test cases have been completed. Could you please help merge them? |
…project#48673) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <[email protected]> Co-authored-by: srinathk10 <[email protected]>
…project#48673) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: jukejian <[email protected]> Co-authored-by: srinathk10 <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.