Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support New Segmenter #101

Merged
merged 42 commits into from
Mar 11, 2020
Merged

Support New Segmenter #101

merged 42 commits into from
Mar 11, 2020

Conversation

bitsofbits
Copy link
Contributor

@bitsofbits bitsofbits commented Oct 13, 2019

General Support

  • Pass last course and speed and first course, speed, lat, lon through pipeline
    so that they are available on segments imported from previous day.
  • Collect identity stats for each segment (shipnames, callsigns, etc).
  • Generate two sets of segment output -- one that is compatible with previous
    pipeline and one that is for the new stitcher and dump to different output files.
  • Keep segments alive, even with no messages, until closed
  • Factor out core of segmenter implements so that it's easier to call externally for testing.

Software updates.

  • Use pipe-tools 3.0: this fixed a bunch of versioning problems and
    needed to be done eventually.
  • Make tests run under Python 3 and Python 2. Doesn't necessarily mean whole
    pipeline will run under Python 3 yet.

Reduce Memory Usage

See Issue #100

Improves memory usage / performance in three ways:

  • Filters out noise segments before grouping. The new segmenter generates more
    noise segments so this results in a large improvement in memory footprint.'
  • Cogroup segments and messages rather than passing segments in as a side argument.
    This means each worker machine only gets the segments related to the messages it is
    processing further reducing memory footprint and slightly speeding operations.
  • Use shards when writing segments as well as when writing messages. Improves the write time
    segments, and may reduce per worker memory usage.

Net effect is that I can run over 1 month with 8192 MiB per worker, and perhaps less (that's the minimum I've tried), when previously I was still getting crashes with 64,000 MiB per worker!

@bitsofbits bitsofbits changed the title Reduce Memory Usage Support New Segmenter Oct 14, 2019
Copy link
Contributor

@enriquetuya enriquetuya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bitsofbits check my questions and comments below.
Also, we need to add an entry to the CHANGES.log stating the changes.

pipe_segment/transform/segment.py Outdated Show resolved Hide resolved
airflow/post_install.sh Outdated Show resolved Hide resolved
airflow/pipe_segment_dag.py Outdated Show resolved Hide resolved
pipe_segment/options/segment.py Show resolved Hide resolved
pipe_segment/transform/segment_implementation.py Outdated Show resolved Hide resolved
@enriquetuya enriquetuya changed the base branch from gfw-tasks-1143-new-segmenter to develop March 11, 2020 20:14
@enriquetuya enriquetuya merged commit 49ee506 into develop Mar 11, 2020
@enriquetuya enriquetuya deleted the performance-improvements branch March 11, 2020 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants