-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update/kafka implementations [WIP] #3683
Update/kafka implementations [WIP] #3683
Conversation
…dynamic kafka producer
…luent-kafka package
📝 Walkthrough📝 WalkthroughWalkthroughThe changes involve significant enhancements to the Changes
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (1)
src/workflows/airqo_etl_utils/message_broker_utils.py (1)
Line range hint
75-101
: UpdateProducer
initialization and message sending forconfluent_kafka
In the
__send_data
method, the initialization of theProducer
and the message sending logic are using syntax fromkafka-python
, notconfluent_kafka
. This can lead to runtime errors.Specific issues:
- The
Producer
fromconfluent_kafka
should be initialized with a configuration dictionary.- The
producer.send()
method andadd_callback
/add_errback
are not available inconfluent_kafka
. Instead, useproducer.produce()
with thecallback
parameter.- Ensure to call
producer.flush()
after sending messages to ensure they're delivered.Apply this diff to correct the
Producer
usage:data.to_csv("message_broker_data.csv", index=False) - producer = Producer( - bootstrap_servers=self.__bootstrap_servers, - api_version_auto_timeout_ms=300000, - retries=5, - request_timeout_ms=300000, - ) + producer = Producer(self.conf) print("Dataframe info : ") print(data.info()) print("Dataframe description : ") print(data.describe()) chunks = int(len(data) / 50) chunks = chunks if chunks > 0 else 1 dataframes = np.array_split(data, chunks) current_partition = -1 for dataframe in dataframes: dataframe = pd.DataFrame(dataframe).replace(np.nan, None) message = {"data": dataframe.to_dict("records")} current_partition = ( partition if partition or partition == 0 else self.__get_partition(current_partition=current_partition) ) kafka_message = json.dumps(message, allow_nan=True).encode("utf-8") - producer.send( + producer.produce( topic=topic, value=kafka_message, partition=current_partition, - ).add_callback(self.__on_success).add_errback(self.__on_error) + callback=self.__on_delivery, + ) + + producer.flush()This ensures the
Producer
is properly initialized and messages are sent correctly usingconfluent_kafka
.🧰 Tools
🪛 Ruff
8-8:
time
imported but unusedRemove unused import:
time
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- src/workflows/airqo_etl_utils/message_broker_utils.py (3 hunks)
- src/workflows/airqo_etl_utils/setup.py (1 hunks)
🧰 Additional context used
🪛 Ruff
src/workflows/airqo_etl_utils/message_broker_utils.py
8-8:
time
imported but unusedRemove unused import:
time
(F401)
🔇 Additional comments (1)
src/workflows/airqo_etl_utils/message_broker_utils.py (1)
227-244
: Confirm proper logging in__on_delivery
callbackThe
__on_delivery
method correctly handles success and error scenarios using the logging framework. This enhances visibility into message delivery outcomes.
"confluent-kafka==2.5.0", | ||
"lz4==4.3.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Incomplete Removal of kafka-python
Detected
It appears that kafka-python
is still being utilized in several parts of the codebase:
src/workflows/airqo_etl_utils/message_broker.py:from kafka import KafkaProducer
src/kafka-connectors/bigquery-connector/message_broker.py:from kafka import KafkaConsumer
src/meta-data/message-broker.py:from kafka import KafkaConsumer
src/gp-model/message-broker.py:from kafka import KafkaConsumer
While confluent-kafka
has been introduced:
src/workflows/airqo_etl_utils/message_broker_utils.py:from confluent_kafka import Producer, Consumer, TopicPartition
Recommendations:
- Update Imports: Replace all instances of
kafka-python
imports withconfluent-kafka
equivalents in the listed files. - Remove Dependency: Ensure that
kafka-python
is removed from theinstall_requires
section insetup.py
to prevent potential conflicts. - Refactor Code: Verify that the Kafka-related code aligns with the
confluent-kafka
library's API, making necessary adjustments for compatibility. - Add Documentation: Consider updating comments or documentation to reflect the transition to
confluent-kafka
for future maintainers.
🔗 Analysis chain
Excellent choice of Kafka client upgrade, but let's ensure compatibility.
The addition of confluent-kafka==2.5.0
and lz4==4.3.3
is a positive step towards improving your Kafka integration. The confluent-kafka
library offers better performance and more features compared to kafka-python
. The inclusion of lz4
suggests you're planning to implement message compression, which is great for optimizing network bandwidth and storage.
However, I have a few points to consider:
-
The removal of
kafka-python
isn't visible in this diff. Was it removed in a separate commit? If not, you might want to explicitly remove it to avoid potential conflicts. -
The switch from
kafka-python
toconfluent-kafka
likely requires changes in your Kafka-related code. Have all relevant parts of the codebase been updated to use the new library? -
Consider adding a comment explaining the purpose of
lz4
for future maintainers.
Let's verify the Kafka client usage in the codebase:
This will help ensure that the transition to the new Kafka client is complete across the codebase.
Need to Correct the File Type in the Verification Script
It appears there was an issue with the previous script regarding the file type specification. Let's adjust the script to use the correct file type identifier for Python. Please execute the following updated script to verify the usage of kafka-python
and confluent-kafka
in the codebase:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining usage of kafka-python and new usage of confluent-kafka
echo "Checking for kafka-python usage:"
rg --type python 'from\s+kafka\s+import' || echo "No kafka-python usage found."
echo -e "\nChecking for confluent-kafka usage:"
rg --type python 'from\s+confluent_kafka\s+import' || echo "No confluent-kafka usage found."
Length of output: 459
Script:
#!/bin/bash
# Description: Check for any remaining usage of kafka-python and new usage of confluent-kafka
echo "Checking for kafka-python usage:"
rg --type py 'from\s+kafka\s+import' || echo "No kafka-python usage found."
echo -e "\nChecking for confluent-kafka usage:"
rg --type py 'from\s+confluent_kafka\s+import' || echo "No confluent-kafka usage found."
Length of output: 663
|
||
import numpy as np | ||
import pandas as pd | ||
from kafka import KafkaProducer | ||
from confluent_kafka import Producer, Consumer, TopicPartition | ||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import time
The time
module is imported but not used anywhere in the code. Removing unused imports helps keep the code clean and maintainable.
Apply this diff to remove the unused import:
import json
import logging
import sys
import numpy as np
import pandas as pd
-from confluent_kafka import Producer, Consumer, TopicPartition
-import time
+from confluent_kafka import Producer, Consumer, TopicPartition
from .config import configuration
Committable suggestion was skipped due to low confidence.
🧰 Tools
🪛 Ruff
8-8:
time
imported but unusedRemove unused import:
time
(F401)
def _send_message( | ||
self, producer: Producer, topic: str, key: Any, message: str, partition=None | ||
) -> None: | ||
""" | ||
Sends a message to the specified Kafka topic using the provided producer. | ||
|
||
Args: | ||
producer: The Kafka Producer instance used to send the message. | ||
topic: The Kafka topic to send the message to. | ||
key: The key of the message (can be None if no key is needed). | ||
message: The message to send, typically serialized to JSON. | ||
partition: Optionally specify the partition to send the message to. If None, Kafka will decide the partition. | ||
|
||
Raises: | ||
Exception: If an error occurs while sending the message, the exception is logged. | ||
""" | ||
try: | ||
if partition is not None: | ||
producer.produce( | ||
topic=topic, | ||
key=key, | ||
value=message, | ||
partition=partition, | ||
on_delivery=self.__on_delivery, | ||
) | ||
else: | ||
producer.produce( | ||
topic=topic, key=key, value=message, on_delivery=self.__on_delivery | ||
) | ||
except Exception as e: | ||
logger.exception(f"Error while sending message to topic {topic}: {e}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure messages are flushed in _send_message
method
After producing messages with producer.produce()
, it's important to call producer.poll(0)
to serve the delivery callback queue, and producer.flush()
after all messages are produced to ensure they're delivered.
Apply this diff to flush messages:
except Exception as e:
logger.exception(f"Error while sending message to topic {topic}: {e}")
+ producer.poll(0)
And after all messages are sent (outside the method), ensure producer.flush()
is called.
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being done in the caller method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasTurner23, thank you for the clarification. I understand that message flushing is handled in the caller method.
✏️ Learnings added
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3683
File: src/workflows/airqo_etl_utils/message_broker_utils.py:271-302
Timestamp: 2024-10-17T21:49:15.884Z
Learning: In the `MessageBrokerUtils` class, messages are flushed in the caller method after invoking `_send_message`, so it's not necessary to call `producer.flush()` within the `_send_message` method.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
def publish_to_topic( | ||
self, | ||
topic: str, | ||
data: pd.DataFrame, | ||
column_key: str = None, | ||
auto_partition: bool = True, | ||
): | ||
""" | ||
Publishes data to a Kafka topic. If a `column_key` is provided, each row's key will be | ||
extracted from the specified column, otherwise data is split into chunks and sent without keys. | ||
|
||
Args: | ||
topic: The Kafka topic to publish data to. | ||
data: A Pandas DataFrame containing the data to be published. | ||
column_key: Optional column name to be used as the message key. If None, data is chunked and sent without keys. | ||
auto_partition: If True, Kafka will automatically select the partition. If False, partitions are selected manually based on load. | ||
|
||
Raises: | ||
Exception: If an error occurs while sending a message, the exception is logged. | ||
""" | ||
producer = Producer(self.conf) | ||
|
||
logger.info(f"Preparing to publish data to topic: {topic}") | ||
data.replace(np.nan, None, inplace=True) | ||
dataframe_list = data.to_dict("records") | ||
|
||
if column_key: | ||
logger.info(f"Using '{column_key}' as the key for messages") | ||
for row in dataframe_list: | ||
key = row.pop(column_key, None) | ||
if key is None: | ||
logger.warning( | ||
f"No key found for column '{column_key}' in row: {row}" | ||
) | ||
continue | ||
message = json.dumps(row, allow_nan=True).encode("utf-8") | ||
|
||
selected_partition = ( | ||
None if auto_partition else self.__get_least_loaded_partition() | ||
) | ||
self._send_message(producer, topic, key, message, selected_partition) | ||
|
||
if not auto_partition: | ||
self.partition_loads[selected_partition] += 1 | ||
|
||
else: | ||
logger.info("No key provided, splitting data into chunks and publishing") | ||
for chunk_data in self._generate_chunks(dataframe_list): | ||
message = json.dumps({"data": chunk_data}).encode("utf-8") | ||
|
||
selected_partition = ( | ||
None if auto_partition else self.__get_least_loaded_partition() | ||
) | ||
self._send_message(producer, topic, None, message, selected_partition) | ||
|
||
if not auto_partition: | ||
self.partition_loads[selected_partition] += 1 | ||
|
||
producer.close() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call producer.flush()
after publishing messages
In the publish_to_topic
method, after all messages are sent, it's crucial to call producer.flush()
to ensure all messages have been delivered before closing the producer.
Apply this diff to flush and close the producer appropriately:
producer = Producer(self.conf)
logger.info(f"Preparing to publish data to topic: {topic}")
data.replace(np.nan, None, inplace=True)
dataframe_list = data.to_dict("records")
if column_key:
logger.info(f"Using '{column_key}' as the key for messages")
for row in dataframe_list:
key = row.pop(column_key, None)
if key is None:
logger.warning(
f"No key found for column '{column_key}' in row: {row}"
)
continue
message = json.dumps(row, allow_nan=True).encode("utf-8")
selected_partition = (
None if auto_partition else self.__get_least_loaded_partition()
)
self._send_message(producer, topic, key, message, selected_partition)
if not auto_partition:
self.partition_loads[selected_partition] += 1
else:
logger.info("No key provided, splitting data into chunks and publishing")
for chunk_data in self._generate_chunks(dataframe_list):
message = json.dumps({"data": chunk_data}).encode("utf-8")
selected_partition = (
None if auto_partition else self.__get_least_loaded_partition()
)
self._send_message(producer, topic, None, message, selected_partition)
if not auto_partition:
self.partition_loads[selected_partition] += 1
- producer.close()
+ producer.flush()
+ producer.close()
This ensures all messages are successfully sent before the producer is closed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def publish_to_topic( | |
self, | |
topic: str, | |
data: pd.DataFrame, | |
column_key: str = None, | |
auto_partition: bool = True, | |
): | |
""" | |
Publishes data to a Kafka topic. If a `column_key` is provided, each row's key will be | |
extracted from the specified column, otherwise data is split into chunks and sent without keys. | |
Args: | |
topic: The Kafka topic to publish data to. | |
data: A Pandas DataFrame containing the data to be published. | |
column_key: Optional column name to be used as the message key. If None, data is chunked and sent without keys. | |
auto_partition: If True, Kafka will automatically select the partition. If False, partitions are selected manually based on load. | |
Raises: | |
Exception: If an error occurs while sending a message, the exception is logged. | |
""" | |
producer = Producer(self.conf) | |
logger.info(f"Preparing to publish data to topic: {topic}") | |
data.replace(np.nan, None, inplace=True) | |
dataframe_list = data.to_dict("records") | |
if column_key: | |
logger.info(f"Using '{column_key}' as the key for messages") | |
for row in dataframe_list: | |
key = row.pop(column_key, None) | |
if key is None: | |
logger.warning( | |
f"No key found for column '{column_key}' in row: {row}" | |
) | |
continue | |
message = json.dumps(row, allow_nan=True).encode("utf-8") | |
selected_partition = ( | |
None if auto_partition else self.__get_least_loaded_partition() | |
) | |
self._send_message(producer, topic, key, message, selected_partition) | |
if not auto_partition: | |
self.partition_loads[selected_partition] += 1 | |
else: | |
logger.info("No key provided, splitting data into chunks and publishing") | |
for chunk_data in self._generate_chunks(dataframe_list): | |
message = json.dumps({"data": chunk_data}).encode("utf-8") | |
selected_partition = ( | |
None if auto_partition else self.__get_least_loaded_partition() | |
) | |
self._send_message(producer, topic, None, message, selected_partition) | |
if not auto_partition: | |
self.partition_loads[selected_partition] += 1 | |
producer.close() | |
def publish_to_topic( | |
self, | |
topic: str, | |
data: pd.DataFrame, | |
column_key: str = None, | |
auto_partition: bool = True, | |
): | |
""" | |
Publishes data to a Kafka topic. If a `column_key` is provided, each row's key will be | |
extracted from the specified column, otherwise data is split into chunks and sent without keys. | |
Args: | |
topic: The Kafka topic to publish data to. | |
data: A Pandas DataFrame containing the data to be published. | |
column_key: Optional column name to be used as the message key. If None, data is chunked and sent without keys. | |
auto_partition: If True, Kafka will automatically select the partition. If False, partitions are selected manually based on load. | |
Raises: | |
Exception: If an error occurs while sending a message, the exception is logged. | |
""" | |
producer = Producer(self.conf) | |
logger.info(f"Preparing to publish data to topic: {topic}") | |
data.replace(np.nan, None, inplace=True) | |
dataframe_list = data.to_dict("records") | |
if column_key: | |
logger.info(f"Using '{column_key}' as the key for messages") | |
for row in dataframe_list: | |
key = row.pop(column_key, None) | |
if key is None: | |
logger.warning( | |
f"No key found for column '{column_key}' in row: {row}" | |
) | |
continue | |
message = json.dumps(row, allow_nan=True).encode("utf-8") | |
selected_partition = ( | |
None if auto_partition else self.__get_least_loaded_partition() | |
) | |
self._send_message(producer, topic, key, message, selected_partition) | |
if not auto_partition: | |
self.partition_loads[selected_partition] += 1 | |
else: | |
logger.info("No key provided, splitting data into chunks and publishing") | |
for chunk_data in self._generate_chunks(dataframe_list): | |
message = json.dumps({"data": chunk_data}).encode("utf-8") | |
selected_partition = ( | |
None if auto_partition else self.__get_least_loaded_partition() | |
) | |
self._send_message(producer, topic, None, message, selected_partition) | |
if not auto_partition: | |
self.partition_loads[selected_partition] += 1 | |
producer.flush() | |
producer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The close()
method also performs a flush if I am not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasTurner23, you are correct. The close()
method of Producer
performs a flush internally. Thank you for the clarification.
✏️ Learnings added
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3683
File: src/workflows/airqo_etl_utils/message_broker_utils.py:303-362
Timestamp: 2024-10-17T21:44:36.359Z
Learning: In the `confluent_kafka` Producer class, calling `close()` performs a flush internally, so an explicit `flush()` before closing is unnecessary.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
def consume_from_topic( | ||
self, | ||
topic: str, | ||
group_id: str, | ||
auto_offset_reset: str = "latest", | ||
max_messages: Optional[int] = None, | ||
auto_commit: bool = True, | ||
from_beginning: bool = False, | ||
offset: Optional[int] = None, | ||
wait_time_sec: int = 30, | ||
streaming: bool = False, | ||
) -> Any: | ||
""" | ||
Consume messages from a Kafka topic and return them. | ||
|
||
Args: | ||
topic: The Kafka topic to consume from. | ||
group_id: The consumer group ID. | ||
auto_offset_reset: Determines where to start reading when there's no valid offset. Default is 'latest'. | ||
max_messages: Limit on the number of messages to consume. If None, consume all available messages. | ||
auto_commit: Whether to auto-commit offsets. | ||
from_beginning: Whether to start consuming from the beginning of the topic. | ||
offset: Start consuming from a specific offset if provided. | ||
wait_time_sec: How long to wait for messages (useful for one-time data requests). | ||
streaming: If True, run as a continuous streaming job. | ||
|
||
Return: | ||
A generator that yields consumed messages from the topic. | ||
""" | ||
|
||
self.conf.update( | ||
{ | ||
"group.id": group_id, | ||
"auto.offset.reset": auto_offset_reset, | ||
"enable.auto.commit": "true" if auto_commit else "false", | ||
} | ||
) | ||
|
||
consumer = Consumer(self.conf) | ||
consumer.subscribe([topic]) | ||
|
||
assigned = False | ||
while not assigned and wait_time_sec > 0: | ||
logger.info("Waiting for partition assignment...") | ||
msg = consumer.poll(timeout=1.0) | ||
if msg is not None and msg.error() is None: | ||
assigned = True | ||
wait_time_sec -= 1 | ||
|
||
if from_beginning: | ||
logger.info("Seeking to the beginning of all partitions...") | ||
partitions = [ | ||
TopicPartition(topic, p.partition, offset=0) | ||
for p in consumer.assignment() | ||
] | ||
consumer.assign(partitions) | ||
for partition in partitions: | ||
consumer.seek(partition) | ||
|
||
if offset is not None: | ||
logger.info(f"Seeking to offset {offset} for all partitions...") | ||
partitions = [ | ||
TopicPartition(topic, p.partition, p.offset) | ||
for p in consumer.assignment() | ||
] | ||
consumer.assign(partitions) | ||
|
||
message_count = 0 | ||
try: | ||
while streaming or (message_count < max_messages if max_messages else True): | ||
msg = consumer.poll(timeout=1.0) | ||
|
||
if msg is None: | ||
logger.info("No messages in this poll.") | ||
if not streaming: | ||
break | ||
continue | ||
|
||
if msg.error(): | ||
logger.exception(f"Consumer error: {msg.error()}") | ||
continue | ||
msg_value = msg.value().decode("utf-8") | ||
yield { | ||
"key": msg.key().decode("utf-8"), | ||
"value": msg_value, | ||
} if msg.key() else {"value": msg_value} | ||
message_count += 1 | ||
|
||
except Exception as e: | ||
logger.exception(f"Error while consuming messages from topic {topic}: {e}") | ||
|
||
finally: | ||
consumer.close() | ||
logger.info( | ||
f"Closed consumer. No more messages to consume from topic: {topic}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate loop termination conditions in consume_from_topic
In the consume_from_topic
method, when streaming
is False
and max_messages
is None
, the loop might run indefinitely.
Ensure that the loop exits appropriately by modifying the condition:
try:
while streaming or (message_count < max_messages if max_messages else False):
msg = consumer.poll(timeout=1.0)
This change prevents an infinite loop when not streaming and max_messages
is not set.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def consume_from_topic( | |
self, | |
topic: str, | |
group_id: str, | |
auto_offset_reset: str = "latest", | |
max_messages: Optional[int] = None, | |
auto_commit: bool = True, | |
from_beginning: bool = False, | |
offset: Optional[int] = None, | |
wait_time_sec: int = 30, | |
streaming: bool = False, | |
) -> Any: | |
""" | |
Consume messages from a Kafka topic and return them. | |
Args: | |
topic: The Kafka topic to consume from. | |
group_id: The consumer group ID. | |
auto_offset_reset: Determines where to start reading when there's no valid offset. Default is 'latest'. | |
max_messages: Limit on the number of messages to consume. If None, consume all available messages. | |
auto_commit: Whether to auto-commit offsets. | |
from_beginning: Whether to start consuming from the beginning of the topic. | |
offset: Start consuming from a specific offset if provided. | |
wait_time_sec: How long to wait for messages (useful for one-time data requests). | |
streaming: If True, run as a continuous streaming job. | |
Return: | |
A generator that yields consumed messages from the topic. | |
""" | |
self.conf.update( | |
{ | |
"group.id": group_id, | |
"auto.offset.reset": auto_offset_reset, | |
"enable.auto.commit": "true" if auto_commit else "false", | |
} | |
) | |
consumer = Consumer(self.conf) | |
consumer.subscribe([topic]) | |
assigned = False | |
while not assigned and wait_time_sec > 0: | |
logger.info("Waiting for partition assignment...") | |
msg = consumer.poll(timeout=1.0) | |
if msg is not None and msg.error() is None: | |
assigned = True | |
wait_time_sec -= 1 | |
if from_beginning: | |
logger.info("Seeking to the beginning of all partitions...") | |
partitions = [ | |
TopicPartition(topic, p.partition, offset=0) | |
for p in consumer.assignment() | |
] | |
consumer.assign(partitions) | |
for partition in partitions: | |
consumer.seek(partition) | |
if offset is not None: | |
logger.info(f"Seeking to offset {offset} for all partitions...") | |
partitions = [ | |
TopicPartition(topic, p.partition, p.offset) | |
for p in consumer.assignment() | |
] | |
consumer.assign(partitions) | |
message_count = 0 | |
try: | |
while streaming or (message_count < max_messages if max_messages else True): | |
msg = consumer.poll(timeout=1.0) | |
if msg is None: | |
logger.info("No messages in this poll.") | |
if not streaming: | |
break | |
continue | |
if msg.error(): | |
logger.exception(f"Consumer error: {msg.error()}") | |
continue | |
msg_value = msg.value().decode("utf-8") | |
yield { | |
"key": msg.key().decode("utf-8"), | |
"value": msg_value, | |
} if msg.key() else {"value": msg_value} | |
message_count += 1 | |
except Exception as e: | |
logger.exception(f"Error while consuming messages from topic {topic}: {e}") | |
finally: | |
consumer.close() | |
logger.info( | |
f"Closed consumer. No more messages to consume from topic: {topic}" | |
) | |
def consume_from_topic( | |
self, | |
topic: str, | |
group_id: str, | |
auto_offset_reset: str = "latest", | |
max_messages: Optional[int] = None, | |
auto_commit: bool = True, | |
from_beginning: bool = False, | |
offset: Optional[int] = None, | |
wait_time_sec: int = 30, | |
streaming: bool = False, | |
) -> Any: | |
""" | |
Consume messages from a Kafka topic and return them. | |
Args: | |
topic: The Kafka topic to consume from. | |
group_id: The consumer group ID. | |
auto_offset_reset: Determines where to start reading when there's no valid offset. Default is 'latest'. | |
max_messages: Limit on the number of messages to consume. If None, consume all available messages. | |
auto_commit: Whether to auto-commit offsets. | |
from_beginning: Whether to start consuming from the beginning of the topic. | |
offset: Start consuming from a specific offset if provided. | |
wait_time_sec: How long to wait for messages (useful for one-time data requests). | |
streaming: If True, run as a continuous streaming job. | |
Return: | |
A generator that yields consumed messages from the topic. | |
""" | |
self.conf.update( | |
{ | |
"group.id": group_id, | |
"auto.offset.reset": auto_offset_reset, | |
"enable.auto.commit": "true" if auto_commit else "false", | |
} | |
) | |
consumer = Consumer(self.conf) | |
consumer.subscribe([topic]) | |
assigned = False | |
while not assigned and wait_time_sec > 0: | |
logger.info("Waiting for partition assignment...") | |
msg = consumer.poll(timeout=1.0) | |
if msg is not None and msg.error() is None: | |
assigned = True | |
wait_time_sec -= 1 | |
if from_beginning: | |
logger.info("Seeking to the beginning of all partitions...") | |
partitions = [ | |
TopicPartition(topic, p.partition, offset=0) | |
for p in consumer.assignment() | |
] | |
consumer.assign(partitions) | |
for partition in partitions: | |
consumer.seek(partition) | |
if offset is not None: | |
logger.info(f"Seeking to offset {offset} for all partitions...") | |
partitions = [ | |
TopicPartition(topic, p.partition, p.offset) | |
for p in consumer.assignment() | |
] | |
consumer.assign(partitions) | |
message_count = 0 | |
try: | |
while streaming or (message_count < max_messages if max_messages else False): | |
msg = consumer.poll(timeout=1.0) | |
if msg is None: | |
logger.info("No messages in this poll.") | |
if not streaming: | |
break | |
continue | |
if msg.error(): | |
logger.exception(f"Consumer error: {msg.error()}") | |
continue | |
msg_value = msg.value().decode("utf-8") | |
yield { | |
"key": msg.key().decode("utf-8"), | |
"value": msg_value, | |
} if msg.key() else {"value": msg_value} | |
message_count += 1 | |
except Exception as e: | |
logger.exception(f"Error while consuming messages from topic {topic}: {e}") | |
finally: | |
consumer.close() | |
logger.info( | |
f"Closed consumer. No more messages to consume from topic: {topic}" | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consume method can be used for streaming where it runs indefinitely as well as consuming data once when triggered. At least that is the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasTurner23, thank you for the clarification. I understand that the consume_from_topic
method is intended to run indefinitely when streaming
is enabled, for continuous streaming, and to consume data once when triggered otherwise. I've noted this for future reviews.
🐰❓
✏️ Learnings added
Learnt from: NicholasTurner23
PR: airqo-platform/AirQo-api#3683
File: src/workflows/airqo_etl_utils/message_broker_utils.py:363-458
Timestamp: 2024-10-17T21:48:34.724Z
Learning: In the `src/workflows/airqo_etl_utils/message_broker_utils.py` file, the `consume_from_topic` method is intentionally designed to run indefinitely when `streaming` is `True`, for continuous streaming, and to consume data once when triggered otherwise.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @NicholasTurner23 , for this amazing contribution. I sincerely appreciate your efforts.
Regarding these proposed changes:
- Backward Compatibility: Could you please provide insights on the potential impact of these modifications on existing functionality? Specifically:
- Are there any known issues or conflicts with currently deployed systems?
- Have compatibility tests been conducted with our production environment?
- Critical Security Systems: Our organization has several critical security systems heavily reliant on Kafka. These systems need to remain operational throughout the ongoing work. Therefore, I would like to request:
- Confirmation that the proposed changes will not disrupt the current Kafka-based systems.
- Any mitigation strategies or rollback plans in place to ensure uninterrupted service during the transition period.
|
Description
This PR add dynamic kafka producer and consumer to be used in the kafka integration.
Related Issues
Summary by CodeRabbit
New Features
Bug Fixes
Chores
kafka-python
and addingconfluent-kafka
andlz4
.