You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am planning on working on an archiver consumer this week, that will create an archive file for each run. I think that this scenario is a problem because because we don't want documents from one Run, to end up in the archive of a different Run.
The Producer publishes runs to a partition based on a UID. If the consumer subscribes to a topic it will receive documents from multiple partitions. If it passes documents from all partitions directly to a RunRouter, this is equivalent to merging all of the partitions together. The RunRouter cannot perfectly separate the Runs due to missing run_start in the resource documents.
If we pass the msg.partition to process_document we can then keep the partitions isolated, which can solve this problem.
Is it possible to do two BlueskyRuns at the same time and have them end up on the same partition, so that we would need to separate runs that are part of the same stream? If not then we know that each time we see a new start document on a partition a new run has started.
The text was updated successfully, but these errors were encountered:
RunRouter cannot always tell which Run a datum or resource belongs to because resource documents don't always have a run_start uid. When a resource or datum's run can't be identified then it is distributed to all runs.
https://github.com/bluesky/event-model/blob/master/event_model/__init__.py#L1295
I am planning on working on an archiver consumer this week, that will create an archive file for each run. I think that this scenario is a problem because because we don't want documents from one Run, to end up in the archive of a different Run.
We can solve this problem by passing msg.partition to the BlueskyConsumer.process_document.
https://github.com/bluesky/bluesky-kafka/blob/master/bluesky_kafka/__init__.py#L405
The Producer publishes runs to a partition based on a UID. If the consumer subscribes to a topic it will receive documents from multiple partitions. If it passes documents from all partitions directly to a RunRouter, this is equivalent to merging all of the partitions together. The RunRouter cannot perfectly separate the Runs due to missing run_start in the resource documents.
If we pass the msg.partition to process_document we can then keep the partitions isolated, which can solve this problem.
Is it possible to do two BlueskyRuns at the same time and have them end up on the same partition, so that we would need to separate runs that are part of the same stream? If not then we know that each time we see a new start document on a partition a new run has started.
The text was updated successfully, but these errors were encountered: