Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

Closed
wants to merge 123 commits into from

Conversation

jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Nov 20, 2024

Why are these changes needed?

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.

Key Features and Benefits:

  1. Seamless Integration: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation.
  2. Custom Query Support: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance.
  3. User-Friendly API: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

Screenshot 2024-11-20 at 3 52 42 AM

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 22 to 29
columns: Optional[List[str]] = None,
filters: Optional[Dict[str, Tuple[str, Any]]] = None,
order_by: Optional[Tuple[List[str], bool]] = None,
client_settings: Optional[Dict[str, Any]] = None,
client_kwargs: Optional[Dict[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all optional args as kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to address this in my latest commit.

Comment on lines 20 to 25
entity: str,
dsn: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please give an example of the DSN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a DSN example and left a link to relevant ClickHouse documentation in my latest commit.


def __init__(
self,
entity: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest we employ more common term like table (and in py-doc expand that this could also be a view of one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that change in my latest commit.

Comment on lines 37 to 42
filters: Optional fields and values mapping to use to filter the data via
WHERE clause. The value should be a tuple where the first element is
one of ('is', 'not', 'less', 'greater') and the second
element is the value to filter by. The default operator
is 'is'. Only strings, ints, floats, booleans,
and None are allowed as values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is requiring predicate in DNF format, let's call it out explicitly and add an example to help with understanding of it.

Also let's add a link to the page of parameters to ClickHouse explaining these in more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to resolve this in my latest commit. I added an example, included a link to ClickHouse documentation, and went much deeper into details.

One item I wanted to callout was filters currently only supports joining by AND operators. My thinking was as follows:

  1. I'm assuming for the vast majority of use-cases that feature engineering work will be done in ClickHouse and the end user would simply want to bring in data from a view. I didn't see the need to build an extensive query builder without it being necessary.
  2. The main purpose of the filters was to offer the end user a way to reduce the data being transferred into Ray via a simple filtering mechanism.

I did leave myself a TODO to add support for filtering by datetime types in a future PR. Could I also add support for OR operators along with defining a DNF format in a future PR if necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jecsand838 yes, totally we can make this a follow-up.

There are a few things i want to call out though:

  • We're currently adding support for generic expressions (to enable future advanced optimizations powered by it) and therefore want to make sure that we consolidate all expression handling onto a single engine (@richardliaw is working on a PR as we speak)
  • In the meantime, we also need to make sure we're not flip-flopping on APIs back and forth between releases

As such, i'd recommended we extract adding filtering push-down into a separate PR (stacked on top of this one) that we can do one more iteration on to consolidate the expression handling before we put it out for everyone to use.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin Makes complete sense, I'll take care of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin The filtering functionality has been extracted from this PR and placed here: jecsand838#1

f"Unsupported operator '{op}' for filter on '{column}'. "
f"Defaulting to 'is'"
)
op = "is"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to resolve this using a ValueError in my latest commit.

Comment on lines 98 to 113
if value is None:
operator = validate_non_numeric_ops(key, operator)
if operator == "is":
filter_conditions.append(f"{key} IS NULL")
elif operator == "not":
filter_conditions.append(f"{key} IS NOT NULL")
elif isinstance(value, str):
operator = validate_non_numeric_ops(key, operator)
filter_conditions.append(f"{key} {ops[operator]} '{value}'")
elif isinstance(value, bool):
operator = validate_non_numeric_ops(key, operator)
filter_conditions.append(
f"{key} {ops[operator]} {str(value).lower()}"
)
elif isinstance(value, (int, float)):
filter_conditions.append(f"{key} {ops[operator]} {value}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split up value conversion from filter composition to avoid duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to resolve this in my latest commit.

op = "is"
return op

ops = {"is": "=", "not": "!=", "less": "<", "greater": ">"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Python operators so that we're not reinventing the wheel here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to resolve this in my latest commit.

@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 2, 2024
@jecsand838 jecsand838 requested a review from a team as a code owner December 3, 2024 17:33
Comment on lines 15 to 20
ops = {
"==": {"types": ["*"]},
"!=": {"types": ["*"]},
"<": {"types": [int, float]},
">": {"types": [int, float]},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be a module level constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be sure to handle that in the follow-up PR if it's still needed.

Comment on lines 139 to 143
self._columns = kwargs.get("columns")
self._filters = kwargs.get("filters")
self._order_by = kwargs.get("order_by")
self._client_settings = kwargs.get("client_settings")
self._client_kwargs = kwargs.get("client_kwargs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all of these kwargs explicit and typed (adding to the func signature)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in.

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor comments around the tests and we should be good-to-go!

@jecsand838 thank you very much for contributing this and patiently working t/h the review with us!

@@ -3249,6 +3250,77 @@ def read_lance(
)


@PublicAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's annotate as @PublicAPI(stability="alpha") to it to make it clear this isn't a stable API yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

(None, "SELECT * FROM default.table_name"),
],
)
def test_generate_query_columns(self, datasource, columns, expected_query_part):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add a test generating the full query (so that we certain e2e flow works as expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

matthewdeng and others added 15 commits December 3, 2024 19:40
…al (ray-project#48811)

Current test is failing due to spot instance unavailability.

Converting this test to manual right now.

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
…ject#47896)

Closes: ray-project#47895

---------

Signed-off-by: Superskyyy <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
HPU resource is already supported in Ray, and there are many examples to
guide users to use HPU device in Ray, so this PR adds some instructions
for HPU device to the Ray Serve related documents.

---------

Signed-off-by: KepingYan <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
…ng down (ray-project#48808)

Each compiled graph starts a monitor thread to tear down the DAG upon
detecting an error in one of the workers' task loops. Currently, during
driver shutdown, this thread can live past the lifetime of the C++
CoreWorker. This causes a silent process exit when the thread later
tries to call on the CoreWorker but it has already been destructed. To
prevent this from happening, this fix joins the monitor thread *before*
destructing the CoreWorker.

## Related issue number

Closes ray-project#48288.

---------

Signed-off-by: Stephanie Wang <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
Currently in serve.run the logging_config is not passed to controller.
This PR add this arguments into the function call so the logging_config
can be correctly specified for system-level logging.

## Related issue number
Closes ray-project#48652
<!-- For example: "Closes ray-project#1234" -->

### Example
```
logging_config = {"log_level": "DEBUG", "logs_dir": "./mimi_debug"}
handle: DeploymentHandle = serve.run(app, logging_config=logging_config)
```

### Before
controller logs aren't saved in the specified logs_dir

<img width="326" alt="image"
src="https://github.com/user-attachments/assets/0d316428-e7a7-48e0-8d9d-1692a3045a4a">

### After
controller logs are correctly configured

<img width="325" alt="image"
src="https://github.com/user-attachments/assets/e05aba0b-75cd-4cd4-9a92-4ef8cdd84cce">

Signed-off-by: Mimi Liao <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
A small change to use `absl::SimpleAtoi` to avoid integer casting to
throw exception;
Also avoid double map lookup and ignore all invalid values (i.e.
negative values).

Signed-off-by: dentiny <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Two benefits for the util macro:
- Better branch prediction, better performance
- Focus on happy path in code implementation

Signed-off-by: dentiny <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
@jecsand838
Copy link
Contributor Author

@alexeykudinkin I had to rebuild my local development environment and it really messed this up. I'm going to close this PR and start fresh with the latest state of the changes. My apologies about this!

@jecsand838
Copy link
Contributor Author

@alexeykudinkin #49060 is off a clean branch and the current state of the code addresses all of your last requests.

bveeramani pushed a commit that referenced this pull request Dec 12, 2024
Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Dec 12, 2024
…t#49060)

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
ray-project#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
…t#49060)

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
ray-project#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.