-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add read_clickhouse API to read ClickHouse Dataset #48817
Conversation
columns: Optional[List[str]] = None, | ||
filters: Optional[Dict[str, Tuple[str, Any]]] = None, | ||
order_by: Optional[Tuple[List[str], bool]] = None, | ||
client_settings: Optional[Dict[str, Any]] = None, | ||
client_kwargs: Optional[Dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make all optional args as kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to address this in my latest commit.
entity: str, | ||
dsn: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please give an example of the DSN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a DSN example and left a link to relevant ClickHouse documentation in my latest commit.
|
||
def __init__( | ||
self, | ||
entity: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd suggest we employ more common term like table
(and in py-doc expand that this could also be a view of one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made that change in my latest commit.
filters: Optional fields and values mapping to use to filter the data via | ||
WHERE clause. The value should be a tuple where the first element is | ||
one of ('is', 'not', 'less', 'greater') and the second | ||
element is the value to filter by. The default operator | ||
is 'is'. Only strings, ints, floats, booleans, | ||
and None are allowed as values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is requiring predicate in DNF format, let's call it out explicitly and add an example to help with understanding of it.
Also let's add a link to the page of parameters to ClickHouse explaining these in more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to resolve this in my latest commit. I added an example, included a link to ClickHouse documentation, and went much deeper into details.
One item I wanted to callout was filters
currently only supports joining by AND
operators. My thinking was as follows:
- I'm assuming for the vast majority of use-cases that feature engineering work will be done in ClickHouse and the end user would simply want to bring in data from a view. I didn't see the need to build an extensive query builder without it being necessary.
- The main purpose of the filters was to offer the end user a way to reduce the data being transferred into Ray via a simple filtering mechanism.
I did leave myself a TODO to add support for filtering by datetime types in a future PR. Could I also add support for OR
operators along with defining a DNF format in a future PR if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jecsand838 yes, totally we can make this a follow-up.
There are a few things i want to call out though:
- We're currently adding support for generic expressions (to enable future advanced optimizations powered by it) and therefore want to make sure that we consolidate all expression handling onto a single engine (@richardliaw is working on a PR as we speak)
- In the meantime, we also need to make sure we're not flip-flopping on APIs back and forth between releases
As such, i'd recommended we extract adding filtering push-down into a separate PR (stacked on top of this one) that we can do one more iteration on to consolidate the expression handling before we put it out for everyone to use.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin Makes complete sense, I'll take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin The filtering functionality has been extracted from this PR and placed here: jecsand838#1
f"Unsupported operator '{op}' for filter on '{column}'. " | ||
f"Defaulting to 'is'" | ||
) | ||
op = "is" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to resolve this using a ValueError in my latest commit.
if value is None: | ||
operator = validate_non_numeric_ops(key, operator) | ||
if operator == "is": | ||
filter_conditions.append(f"{key} IS NULL") | ||
elif operator == "not": | ||
filter_conditions.append(f"{key} IS NOT NULL") | ||
elif isinstance(value, str): | ||
operator = validate_non_numeric_ops(key, operator) | ||
filter_conditions.append(f"{key} {ops[operator]} '{value}'") | ||
elif isinstance(value, bool): | ||
operator = validate_non_numeric_ops(key, operator) | ||
filter_conditions.append( | ||
f"{key} {ops[operator]} {str(value).lower()}" | ||
) | ||
elif isinstance(value, (int, float)): | ||
filter_conditions.append(f"{key} {ops[operator]} {value}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split up value conversion from filter composition to avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to resolve this in my latest commit.
op = "is" | ||
return op | ||
|
||
ops = {"is": "=", "not": "!=", "less": "<", "greater": ">"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Python operators so that we're not reinventing the wheel here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to resolve this in my latest commit.
ops = { | ||
"==": {"types": ["*"]}, | ||
"!=": {"types": ["*"]}, | ||
"<": {"types": [int, float]}, | ||
">": {"types": [int, float]}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This could be a module level constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be sure to handle that in the follow-up PR if it's still needed.
self._columns = kwargs.get("columns") | ||
self._filters = kwargs.get("filters") | ||
self._order_by = kwargs.get("order_by") | ||
self._client_settings = kwargs.get("client_settings") | ||
self._client_kwargs = kwargs.get("client_kwargs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make all of these kwargs explicit and typed (adding to the func signature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Minor comments around the tests and we should be good-to-go!
@jecsand838 thank you very much for contributing this and patiently working t/h the review with us!
python/ray/data/read_api.py
Outdated
@@ -3249,6 +3250,77 @@ def read_lance( | |||
) | |||
|
|||
|
|||
@PublicAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's annotate as @PublicAPI(stability="alpha")
to it to make it clear this isn't a stable API yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
(None, "SELECT * FROM default.table_name"), | ||
], | ||
) | ||
def test_generate_query_columns(self, datasource, columns, expected_query_part): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add a test generating the full query (so that we certain e2e flow works as expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
…al (ray-project#48811) Current test is failing due to spot instance unavailability. Converting this test to manual right now. Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
…ject#47896) Closes: ray-project#47895 --------- Signed-off-by: Superskyyy <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
HPU resource is already supported in Ray, and there are many examples to guide users to use HPU device in Ray, so this PR adds some instructions for HPU device to the Ray Serve related documents. --------- Signed-off-by: KepingYan <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
…ng down (ray-project#48808) Each compiled graph starts a monitor thread to tear down the DAG upon detecting an error in one of the workers' task loops. Currently, during driver shutdown, this thread can live past the lifetime of the C++ CoreWorker. This causes a silent process exit when the thread later tries to call on the CoreWorker but it has already been destructed. To prevent this from happening, this fix joins the monitor thread *before* destructing the CoreWorker. ## Related issue number Closes ray-project#48288. --------- Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Currently in serve.run the logging_config is not passed to controller. This PR add this arguments into the function call so the logging_config can be correctly specified for system-level logging. ## Related issue number Closes ray-project#48652 <!-- For example: "Closes ray-project#1234" --> ### Example ``` logging_config = {"log_level": "DEBUG", "logs_dir": "./mimi_debug"} handle: DeploymentHandle = serve.run(app, logging_config=logging_config) ``` ### Before controller logs aren't saved in the specified logs_dir <img width="326" alt="image" src="https://github.com/user-attachments/assets/0d316428-e7a7-48e0-8d9d-1692a3045a4a"> ### After controller logs are correctly configured <img width="325" alt="image" src="https://github.com/user-attachments/assets/e05aba0b-75cd-4cd4-9a92-4ef8cdd84cce"> Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
A small change to use `absl::SimpleAtoi` to avoid integer casting to throw exception; Also avoid double map lookup and ignore all invalid values (i.e. negative values). Signed-off-by: dentiny <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
Two benefits for the util macro: - Better branch prediction, better performance - Focus on happy path in code implementation Signed-off-by: dentiny <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
…ray-project#48798) Signed-off-by: Connor Sanders <[email protected]>
@alexeykudinkin I had to rebuild my local development environment and it really messed this up. I'm going to close this PR and start fresh with the latest state of the changes. My apologies about this! |
@alexeykudinkin #49060 is off a clean branch and the current state of the code addresses all of your last requests. |
Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of #48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]>
…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]>
…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
Greetings from ElastiFlow!
This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.
Key Features and Benefits:
Tested locally with a ClickHouse table containing ~12m records.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.