[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

jecsand838 · 2024-11-20T12:18:58Z

Why are these changes needed?

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.

Key Features and Benefits:

Seamless Integration: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation.
Custom Query Support: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance.
User-Friendly API: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

alexeykudinkin · 2024-11-20T18:57:00Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+        columns: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Tuple[str, Any]]] = None,
+        order_by: Optional[Tuple[List[str], bool]] = None,
+        client_settings: Optional[Dict[str, Any]] = None,
+        client_kwargs: Optional[Dict[str, Any]] = None,


Let's make all optional args as kwargs

I attempted to address this in my latest commit.

python/ray/data/_internal/datasource/clickhouse_datasource.py

alexeykudinkin · 2024-11-20T18:58:11Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+        entity: str,
+        dsn: str,


Can you please give an example of the DSN?

I added a DSN example and left a link to relevant ClickHouse documentation in my latest commit.

alexeykudinkin · 2024-11-20T18:58:51Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+
+    def __init__(
+        self,
+        entity: str,


nit: I'd suggest we employ more common term like table (and in py-doc expand that this could also be a view of one)

I made that change in my latest commit.

alexeykudinkin · 2024-11-20T19:01:12Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+            filters: Optional fields and values mapping to use to filter the data via
+                WHERE clause. The value should be a tuple where the first element is
+                one of ('is', 'not', 'less', 'greater') and the second
+                element is the value to filter by. The default operator
+                is 'is'. Only strings, ints, floats, booleans,
+                and None are allowed as values.


IIUC this is requiring predicate in DNF format, let's call it out explicitly and add an example to help with understanding of it.

Also let's add a link to the page of parameters to ClickHouse explaining these in more details

I attempted to resolve this in my latest commit. I added an example, included a link to ClickHouse documentation, and went much deeper into details.

One item I wanted to callout was filters currently only supports joining by AND operators. My thinking was as follows:

I'm assuming for the vast majority of use-cases that feature engineering work will be done in ClickHouse and the end user would simply want to bring in data from a view. I didn't see the need to build an extensive query builder without it being necessary.

The main purpose of the filters was to offer the end user a way to reduce the data being transferred into Ray via a simple filtering mechanism.

I did leave myself a TODO to add support for filtering by datetime types in a future PR. Could I also add support for OR operators along with defining a DNF format in a future PR if necessary?

@jecsand838 yes, totally we can make this a follow-up.

There are a few things i want to call out though:

We're currently adding support for generic expressions (to enable future advanced optimizations powered by it) and therefore want to make sure that we consolidate all expression handling onto a single engine (@richardliaw is working on a PR as we speak)

In the meantime, we also need to make sure we're not flip-flopping on APIs back and forth between releases

As such, i'd recommended we extract adding filtering push-down into a separate PR (stacked on top of this one) that we can do one more iteration on to consolidate the expression handling before we put it out for everyone to use.

Does that make sense?

@alexeykudinkin Makes complete sense, I'll take care of that.

@alexeykudinkin The filtering functionality has been extracted from this PR and placed here: jecsand838#1

alexeykudinkin · 2024-11-20T19:23:20Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                        f"Unsupported operator '{op}' for filter on '{column}'. "
+                        f"Defaulting to 'is'"
+                    )
+                    op = "is"


Same as below

I attempted to resolve this using a ValueError in my latest commit.

python/ray/data/_internal/datasource/clickhouse_datasource.py

alexeykudinkin · 2024-11-20T19:35:29Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                if value is None:
+                    operator = validate_non_numeric_ops(key, operator)
+                    if operator == "is":
+                        filter_conditions.append(f"{key} IS NULL")
+                    elif operator == "not":
+                        filter_conditions.append(f"{key} IS NOT NULL")
+                elif isinstance(value, str):
+                    operator = validate_non_numeric_ops(key, operator)
+                    filter_conditions.append(f"{key} {ops[operator]} '{value}'")
+                elif isinstance(value, bool):
+                    operator = validate_non_numeric_ops(key, operator)
+                    filter_conditions.append(
+                        f"{key} {ops[operator]} {str(value).lower()}"
+                    )
+                elif isinstance(value, (int, float)):
+                    filter_conditions.append(f"{key} {ops[operator]} {value}")


Let's split up value conversion from filter composition to avoid duplication

I attempted to resolve this in my latest commit.

alexeykudinkin · 2024-11-20T19:36:08Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                    op = "is"
+                return op
+
+            ops = {"is": "=", "not": "!=", "less": "<", "greater": ">"}


Let's use Python operators so that we're not reinventing the wheel here

I attempted to resolve this in my latest commit.

python/ray/data/_internal/datasource/clickhouse_datasource.py

alexeykudinkin · 2024-12-02T20:12:59Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+    ops = {
+        "==": {"types": ["*"]},
+        "!=": {"types": ["*"]},
+        "<": {"types": [int, float]},
+        ">": {"types": [int, float]},
+    }


nit: This could be a module level constant

I'll be sure to handle that in the follow-up PR if it's still needed.

alexeykudinkin · 2024-12-02T20:14:03Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+        self._columns = kwargs.get("columns")
+        self._filters = kwargs.get("filters")
+        self._order_by = kwargs.get("order_by")
+        self._client_settings = kwargs.get("client_settings")
+        self._client_kwargs = kwargs.get("client_kwargs")


Let's make all of these kwargs explicit and typed (adding to the func signature)

Added that in.

alexeykudinkin

LGTM! Minor comments around the tests and we should be good-to-go!

@jecsand838 thank you very much for contributing this and patiently working t/h the review with us!

alexeykudinkin · 2024-12-03T21:22:46Z

python/ray/data/read_api.py

@@ -3249,6 +3250,77 @@ def read_lance(
    )


+@PublicAPI


Let's annotate as @PublicAPI(stability="alpha") to it to make it clear this isn't a stable API yet

alexeykudinkin · 2024-12-03T21:24:22Z

python/ray/data/tests/test_clickhouse.py

+            (None, "SELECT * FROM default.table_name"),
+        ],
+    )
+    def test_generate_query_columns(self, datasource, columns, expected_query_part):


Can we please also add a test generating the full query (so that we certain e2e flow works as expected)

…al (ray-project#48811) Current test is failing due to spot instance unavailability. Converting this test to manual right now. Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

Signed-off-by: Connor Sanders <[email protected]>

Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

…ject#47896) Closes: ray-project#47895 --------- Signed-off-by: Superskyyy <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

HPU resource is already supported in Ray, and there are many examples to guide users to use HPU device in Ray, so this PR adds some instructions for HPU device to the Ray Serve related documents. --------- Signed-off-by: KepingYan <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

…ng down (ray-project#48808) Each compiled graph starts a monitor thread to tear down the DAG upon detecting an error in one of the workers' task loops. Currently, during driver shutdown, this thread can live past the lifetime of the C++ CoreWorker. This causes a silent process exit when the thread later tries to call on the CoreWorker but it has already been destructed. To prevent this from happening, this fix joins the monitor thread *before* destructing the CoreWorker. ## Related issue number Closes ray-project#48288. --------- Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

## Why are these changes needed?  Currently in serve.run the logging_config is not passed to controller. This PR add this arguments into the function call so the logging_config can be correctly specified for system-level logging. ## Related issue number Closes ray-project#48652  ### Example ``` logging_config = {"log_level": "DEBUG", "logs_dir": "./mimi_debug"} handle: DeploymentHandle = serve.run(app, logging_config=logging_config) ``` ### Before controller logs aren't saved in the specified logs_dir <img width="326" alt="image" src="https://github.com/user-attachments/assets/0d316428-e7a7-48e0-8d9d-1692a3045a4a"> ### After controller logs are correctly configured <img width="325" alt="image" src="https://github.com/user-attachments/assets/e05aba0b-75cd-4cd4-9a92-4ef8cdd84cce"> Signed-off-by: Mimi Liao <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

A small change to use `absl::SimpleAtoi` to avoid integer casting to throw exception; Also avoid double map lookup and ignore all invalid values (i.e. negative values). Signed-off-by: dentiny <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

Two benefits for the util macro: - Better branch prediction, better performance - Focus on happy path in code implementation Signed-off-by: dentiny <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

…ray-project#48798) Signed-off-by: Connor Sanders <[email protected]>

jecsand838 · 2024-12-04T02:04:56Z

@alexeykudinkin I had to rebuild my local development environment and it really messed this up. I'm going to close this PR and start fresh with the latest state of the changes. My apologies about this!

jecsand838 · 2024-12-04T04:17:29Z

@alexeykudinkin #49060 is off a clean branch and the current state of the code addresses all of your last requests.

Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of #48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]>

…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]>

…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

jecsand838 requested review from scottjlee, bveeramani, raulchen, stephanie-wang, omatthew98, alexeykudinkin and srinathk10 as code owners November 20, 2024 12:18

jcotant1 added the data Ray Data-related issues label Nov 20, 2024

alexeykudinkin reviewed Nov 20, 2024

View reviewed changes

jecsand838 requested a review from alexeykudinkin November 26, 2024 08:19

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 2, 2024

jecsand838 mentioned this pull request Dec 3, 2024

Add ClickHouse datasource filtering jecsand838/ray#1

Closed

8 tasks

jecsand838 requested a review from a team as a code owner December 3, 2024 17:33

alexeykudinkin reviewed Dec 3, 2024

View reviewed changes

matthewdeng and others added 15 commits December 3, 2024 19:40

Added clickhouse datasource along with tests

01f5fcb

Signed-off-by: Connor Sanders <[email protected]>

lint

b6313e6

Signed-off-by: Connor Sanders <[email protected]>

changed columns param to be list in test_clickhouse

67c14a9

Signed-off-by: Connor Sanders <[email protected]>

Updated read_api.py ClickHouse docstring

8aae679

Signed-off-by: Connor Sanders <[email protected]>

Fixed underline too short warning

b23f33a

Signed-off-by: Connor Sanders <[email protected]>

Update python/ray/data/_internal/datasource/clickhouse_datasource.py

5d5fdf8

Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

[Serve] Add more nuanced checks for http proxy status errors (ray-pro…

4642146

…ject#47896) Closes: ray-project#47895 --------- Signed-off-by: Superskyyy <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

[Core] GCS FT with redis sentinel (ray-project#47335)

3558655

Signed-off-by: Kan Wang <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

[core] Ray log if error (ray-project#48752)

7f3e56b

Two benefits for the util macro: - Better branch prediction, better performance - Focus on happy path in code implementation Signed-off-by: dentiny <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

[RLlib] APPO enhancements (new API stack) vol 01: Add circular buffer (…

fe9da76

…ray-project#48798) Signed-off-by: Connor Sanders <[email protected]>

jecsand838 requested review from a team, edoakes, zcin, GeneDer, akshay-anyscale, hongpeng-guo, justinvyu, matthewdeng, woshiyyya, maxpumperla, pcmoritz, kevin85421, sven1977, simonsays1980, SongGuyang, kfstorm, richardliaw, aslonnie and hongchaodeng as code owners December 4, 2024 02:00

jecsand838 closed this Dec 4, 2024

This was referenced Dec 4, 2024

[Data] Add read_clickhouse API to read ClickHouse Dataset #49058

Closed

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

Merged

jecsand838 deleted the clickhouse_datasource branch December 4, 2024 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

jecsand838 commented Nov 20, 2024 •

edited

Loading

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Dec 2, 2024

jecsand838 Dec 2, 2024

jecsand838 Dec 3, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Nov 20, 2024

jecsand838 Nov 26, 2024

alexeykudinkin Dec 2, 2024

jecsand838 Dec 3, 2024

alexeykudinkin Dec 2, 2024

jecsand838 Dec 4, 2024

alexeykudinkin left a comment

alexeykudinkin Dec 3, 2024

jecsand838 Dec 4, 2024

alexeykudinkin Dec 3, 2024

jecsand838 Dec 4, 2024

jecsand838 commented Dec 4, 2024

jecsand838 commented Dec 4, 2024

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

Conversation

jecsand838 commented Nov 20, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jecsand838 commented Dec 4, 2024

jecsand838 commented Dec 4, 2024

jecsand838 commented Nov 20, 2024 •

edited

Loading