Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Support running Heron on Amazon Ecs. #1837

Open
wants to merge 62 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
370873e
Added Ecs Docker Host Support
ananthgs May 3, 2017
8a10b03
Adding the ecs scheduler
ananthgs May 3, 2017
cee04fe
AWS ECS schedulers added
ananthgs May 3, 2017
c5b7f3e
Adding AWS ECS files
ananthgs May 3, 2017
8c02617
adding ecs to build scripts
ananthgs May 3, 2017
725891d
adding ecs to build
ananthgs May 3, 2017
5336c3c
Adding AMI Check dependancy
ananthgs May 3, 2017
e368e9a
Adding ecs conf
ananthgs May 4, 2017
9a80784
Added Compose Commands
ananthgs May 4, 2017
0aae022
Cleaned up non ecs functions
ananthgs May 4, 2017
83bf1a1
Added temp file for compose
ananthgs May 4, 2017
f1379f6
Removed reference to temp dir
ananthgs May 4, 2017
c38322c
Removed Working Dir reference
ananthgs May 4, 2017
7e575b9
Removed commented lines for unused imports
ananthgs May 5, 2017
4c6de0c
emoved commented lines on unused imports
ananthgs May 5, 2017
fefbbc8
Removing un referenced EcsKeys class
ananthgs May 5, 2017
edaa5fa
Removed the host env setting if its an Amazon ECS instance
ananthgs May 5, 2017
8efc62e
remving had coded values
ananthgs May 12, 2017
0b74033
removing hard coded values and ports
ananthgs May 12, 2017
fceacfa
new file: EcsKey.java
ananthgs May 12, 2017
532fffe
modified: EcsLauncher.java
ananthgs May 12, 2017
2d08fe5
fixing java and other paths
ananthgs May 12, 2017
57d9c91
new file: ecs_compose_template.yaml
ananthgs May 12, 2017
3db478f
tunneling set to false
ananthgs May 13, 2017
4a36f40
update gethost to handle docker and ecs AMI
ananthgs May 13, 2017
36dd734
passing esc ami host param
ananthgs May 14, 2017
9582f54
adding kill to schedulers
ananthgs May 14, 2017
3a2af21
modified: EcsKey.java
ananthgs May 14, 2017
406f298
removed hard coding values
ananthgs May 14, 2017
5b16369
added list tasks context
ananthgs May 16, 2017
3f321cf
added List Tasks Keys
ananthgs May 16, 2017
ac7ca93
added getJobLinks with ecs Task ID's
ananthgs May 16, 2017
5204be2
adde jackson JSON jars for ecs Schedulers
ananthgs May 16, 2017
52f40bb
new file: heron/config/src/yaml/conf/ecs/setupEcs.sh
ananthgs May 16, 2017
e74b0ef
new file: heron/config/src/yaml/conf/ecs/set-ecs-cluster-name.sh
ananthgs May 16, 2017
0824fa0
new file: heron/config/src/yaml/conf/ecs/ecs-heron-policy.json
ananthgs May 16, 2017
d3a7df5
new file: heron/config/src/yaml/conf/ecs/ecs-heron-role.json
ananthgs May 16, 2017
f65f865
fixed echo messages
ananthgs May 16, 2017
15a91f7
Fixed Log messages
ananthgs May 16, 2017
96a150d
new file: EcsSchedulerTest.java
ananthgs May 19, 2017
f2e4267
modified: setupEcs.sh
ananthgs May 19, 2017
be514cb
fixed version number
ananthgs May 19, 2017
0ece9cb
fixed version number
ananthgs May 19, 2017
2ef6665
modified: set-ecs-cluster-name.sh
ananthgs May 19, 2017
c224c82
new file: README
ananthgs May 21, 2017
30f4348
Clean up file for AWS
ananthgs May 21, 2017
b362b4d
modified: ecs_compose_template.yaml
ananthgs May 21, 2017
4b05a87
modified: setupEcs.sh
ananthgs May 21, 2017
2d9dbec
modified: scheduler.yaml
ananthgs May 21, 2017
0119231
added cluster binary context
ananthgs May 21, 2017
986e7f7
Added scheduler realated changes for scheduler info to be put in zook…
ananthgs May 21, 2017
f13b137
Added files for on schedule testing
ananthgs May 21, 2017
4cc598b
JAVA home for enable AWS scheduler
ananthgs May 21, 2017
168a474
Updating with correct document link
ananthgs May 22, 2017
f19d354
Adding Ecs AMI check as non mandatory field
ananthgs May 22, 2017
034eb1a
modified: tools/rules/heron_core.bzl
ananthgs May 22, 2017
246b922
modified: scripts/packages/BUILD
ananthgs May 22, 2017
f6f52d9
merging with changes on upstream
ananthgs May 22, 2017
3767043
Merge remote-tracking branch 'upstream/master'
ananthgs May 22, 2017
84b669c
Fixed duplicate lines due to merge with head branch
ananthgs May 22, 2017
88d5ae6
Fixed Cluster name to ecs-heron-cluster
ananthgs May 22, 2017
2f1bde1
modified: heron/config/src/yaml/conf/ecs/setupEcs.sh
ananthgs May 22, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions heron/config/src/yaml/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ filegroup(
srcs = glob(["conf/local/*.yaml"]),
)

filegroup(
name = "conf-ecs-yaml",
srcs = glob(["conf/ecs/*.yaml"]),
)

filegroup(
name = "conf-aurora-yaml",
srcs = glob(["conf/aurora/*"]),
Expand Down
272 changes: 272 additions & 0 deletions heron/config/src/yaml/conf/ecs/heron_internals.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
################################################################################
# Default values for various configs used inside Heron.
################################################################################
# All the config associated with time is in the unit of milli-seconds,
# unless otherwise specified.
################################################################################
# All the config associated with data size is in the unit of bytes, unless
# otherwise specified.
################################################################################

################################################################################
# System level configs
################################################################################

### heron.* configs are general configurations over all componenets

# The relative path to the logging directory
heron.logging.directory: "log-files"

# The maximum log file size in MB
heron.logging.maximum.size.mb: 100

# The maximum number of log files
heron.logging.maximum.files: 5

# The interval in seconds after which to check if the tmaster location has been fetched or not
heron.check.tmaster.location.interval.sec: 120

# The interval in seconds to prune logging files in C++
heron.logging.prune.interval.sec: 300

# The interval in seconds to flush log files in C++
heron.logging.flush.interval.sec: 10

# The threshold level to log error
heron.logging.err.threshold: 3

# The interval in seconds for different components to export metrics to metrics manager
heron.metrics.export.interval.sec: 60

# The maximum count of exceptions in one MetricPublisherPublishMessage protobuf
heron.metrics.max.exceptions.per.message.count: 1024

################################################################################
# Configs related to Stream Manager, starts with heron.streammgr.*
################################################################################

# Maximum size in bytes of a packet to be send out from stream manager
heron.streammgr.packet.maximum.size.bytes: 102400

# The tuple cache (used for batching) can be drained in two ways:
# (a) Time based
# (b) size based

# The frequency in ms to drain the tuple cache in stream manager
heron.streammgr.cache.drain.frequency.ms: 10

# The sized based threshold in MB for draining the tuple cache
heron.streammgr.cache.drain.size.mb: 100

# For efficient acknowledgements
heron.streammgr.xormgr.rotatingmap.nbuckets: 3

# The reconnect interval to other stream managers in secs for stream manager client
heron.streammgr.client.reconnect.interval.sec: 1

# The reconnect interval to tamster in second for stream manager client
heron.streammgr.client.reconnect.tmaster.interval.sec: 10

# The maximum packet size in MB of stream manager's network options
heron.streammgr.network.options.maximum.packet.mb: 100

# The interval in seconds to send heartbeat
heron.streammgr.tmaster.heartbeat.interval.sec: 10

# Maximum batch size in MB to read by stream manager from socket
heron.streammgr.connection.read.batch.size.mb: 1

# Maximum batch size in MB to write by stream manager to socket
heron.streammgr.connection.write.batch.size.mb: 1

# Number of times we should wait to see a buffer full while enqueueing data
# before declaring start of back pressure
heron.streammgr.network.backpressure.threshold: 3

# High water mark on the num in MB that can be left outstanding on a connection
heron.streammgr.network.backpressure.highwatermark.mb: 100

# Low water mark on the num in MB that can be left outstanding on a connection
heron.streammgr.network.backpressure.lowwatermark.mb: 50

################################################################################
# Configs related to Topology Master, starts with heron.tmaster.*
################################################################################

# The maximum interval in minutes of metrics to be kept in tmaster
heron.tmaster.metrics.collector.maximum.interval.min: 180

# The maximum time to retry to establish the tmaster
heron.tmaster.establish.retry.times: 30

# The interval to retry to establish the tmaster
heron.tmaster.establish.retry.interval.sec: 1

# Maximum packet size in MB of tmaster's network options to connect to stream managers
heron.tmaster.network.master.options.maximum.packet.mb: 16

# Maximum packet size in MB of tmaster's network options to connect to scheduler
heron.tmaster.network.controller.options.maximum.packet.mb: 1

# Maximum packet size in MB of tmaster's network options for stat queries
heron.tmaster.network.stats.options.maximum.packet.mb: 1

# The interval for tmaster to purge metrics from socket
heron.tmaster.metrics.collector.purge.interval.sec: 60

# The maximum # of exceptions to be stored in tmetrics collector, to prevent potential OOM
heron.tmaster.metrics.collector.maximum.exception: 256

# Should the metrics reporter bind on all interfaces
heron.tmaster.metrics.network.bindallinterfaces: False

# The timeout in seconds for stream mgr, compared with (current time - last heartbeat time)
heron.tmaster.stmgr.state.timeout.sec: 60

################################################################################
# Configs related to Topology Master, starts with heron.metricsmgr.*
################################################################################

# The size of packets to read from socket will be determined by the minimal of:
# (a) time based
# (b) size based

# Time based, the maximum batch time in ms for metricsmgr to read from socket
heron.metricsmgr.network.read.batch.time.ms: 16

# Size based, the maximum batch size in bytes to read from socket
heron.metricsmgr.network.read.batch.size.bytes: 32768

# The size of packets to write to socket will be determined by the minimum of
# (a) time based
# (b) size based

# Time based, the maximum batch time in ms for metricsmgr to write to socket
heron.metricsmgr.network.write.batch.time.ms: 16

# Size based, the maximum batch size in bytes to write to socket
heron.metricsmgr.network.write.batch.size.bytes: 32768

# The maximum socket's send buffer size in bytes
heron.metricsmgr.network.options.socket.send.buffer.size.bytes: 6553600

# The maximum socket's received buffer size in bytes of metricsmgr's network options
heron.metricsmgr.network.options.socket.received.buffer.size.bytes: 8738000

################################################################################
# Configs related to Heron Instance, starts with heron.instance.*
################################################################################

# The queue capacity (num of items) in bolt for buffer packets to read from stream manager
heron.instance.internal.bolt.read.queue.capacity: 128

# The queue capacity (num of items) in bolt for buffer packets to write to stream manager
heron.instance.internal.bolt.write.queue.capacity: 128

# The queue capacity (num of items) in spout for buffer packets to read from stream manager
heron.instance.internal.spout.read.queue.capacity: 1024

# The queue capacity (num of items) in spout for buffer packets to write to stream manager
heron.instance.internal.spout.write.queue.capacity: 128

# The queue capacity (num of items) for metrics packets to write to metrics manager
heron.instance.internal.metrics.write.queue.capacity: 128

# The size of packets read from stream manager will be determined by the minimal of
# (a) time based
# (b) size based

# Time based, the maximum batch time in ms for instance to read from stream manager per attempt
heron.instance.network.read.batch.time.ms: 16

# Size based, the maximum batch size in bytes to read from stream manager
heron.instance.network.read.batch.size.bytes: 32768

# The size of packets written to stream manager will be determined by the minimum of
# (a) time based
# (b) size based

# Time based, the maximum batch time in ms for instance to write to stream manager per attempt
heron.instance.network.write.batch.time.ms: 16

# Size based, the maximum batch size in bytes to write to stream manager
heron.instance.network.write.batch.size.bytes: 32768

# The maximum socket's send buffer size in bytes
heron.instance.network.options.socket.send.buffer.size.bytes: 6553600

# The maximum socket's received buffer size in bytes of instance's network options
heron.instance.network.options.socket.received.buffer.size.bytes: 8738000

# The maximum # of data tuple to batch in a HeronDataTupleSet protobuf
heron.instance.set.data.tuple.capacity: 1024

# The maximum size in bytes of data tuple to batch in a HeronDataTupleSet protobuf
heron.instance.set.data.tuple.size.bytes: 8388608

# The maximum # of control tuple to batch in a HeronControlTupleSet protobuf
heron.instance.set.control.tuple.capacity: 1024

# The maximum time in ms for a spout to do acknowledgement per attempt, the ack batch could
# also break if there are no more ack tuples to process
heron.instance.ack.batch.time.ms: 128

# The maximum time in ms for an spout instance to emit tuples per attempt
heron.instance.emit.batch.time.ms: 16

# The maximum batch size in bytes for an spout to emit tuples per attempt
heron.instance.emit.batch.size.bytes: 32768

# The maximum time in ms for an bolt instance to execute tuples per attempt
heron.instance.execute.batch.time.ms: 16

# The maximum batch size in bytes for an bolt instance to execute tuples per attempt
heron.instance.execute.batch.size.bytes: 32768

# The time interval for an instance to check the state change,
# for example, the interval a spout uses to check whether activate/deactivate is invoked
heron.instance.state.check.interval.sec: 5

# The time to wait before the instance exits forcibly when uncaught exception happens
heron.instance.force.exit.timeout.ms: 2000

# Interval in seconds to reconnect to the stream manager, including the request timeout in connecting
heron.instance.reconnect.streammgr.interval.sec: 5
heron.instance.reconnect.streammgr.times: 60

# Interval in seconds to reconnect to the metrics manager, including the request timeout in connecting
heron.instance.reconnect.metricsmgr.interval.sec: 5
heron.instance.reconnect.metricsmgr.times: 60

# The interval in second for an instance to sample its system metrics, for instance, cpu load.
heron.instance.metrics.system.sample.interval.sec: 10

heron.instance.slave.fetch.pplan.interval.sec: 1

# For efficient acknowledgement
heron.instance.acknowledgement.nbuckets: 10

################################################################################
# For dynamically tuning the available sizes in the interval read & write queues
# to provide high performance while avoiding GC issues
################################################################################

# The expected size on read queue in bolt
heron.instance.tuning.expected.bolt.read.queue.size: 8

# The expected size on write queue in bolt
heron.instance.tuning.expected.bolt.write.queue.size: 8

# The expected size on read queue in spout
heron.instance.tuning.expected.spout.read.queue.size: 512

# The exepected size on write queue in spout
heron.instance.tuning.expected.spout.write.queue.size: 8

# The expected size on metrics write queue
heron.instance.tuning.expected.metrics.write.queue.size: 8

heron.instance.tuning.current.sample.weight: 0.8

# Interval in ms to tune the size of in & out data queue in instance
heron.instance.tuning.interval.ms: 100
81 changes: 81 additions & 0 deletions heron/config/src/yaml/conf/ecs/metrics_sinks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
########### These all have default values as shown

# We would specify the unique sink-id first
sinks:
- file-sink
- tmaster-sink

########### Now we would specify the detailed configuration for every unique sink
########### Syntax: sink-id: - option(s)

########### option class is required as we need to instantiate a new instance by reflection
########### option flush-frequency-ms is required to invoke flush() at interval
########### option sink-restart-attempts, representsing # of times to restart a sink when it throws exceptions and dies.
########### If this option is missed, default value 0 would be supplied; negative value represents to restart it forever.

########### Other options would be constructed as an immutable map passed to IMetricsSink's init(Map conf) as argument,
########### We would be able to fetch value by conf.get(options), for instance:
########### We could get "com.twitter.heron.metricsmgr.sink.FileSink" if conf.get("class") is called inside file-sink's instance

### Config for file-sink
file-sink:
class: "com.twitter.heron.metricsmgr.sink.FileSink"
flush-frequency-ms: 60000 # 1 min
sink-restart-attempts: -1 # Forever
filename-output: "metrics.json" # File for metrics to write to
file-maximum: 5 # maximum number of file saved in disk

### Config for tmaster-sink
tmaster-sink:
class: "com.twitter.heron.metricsmgr.sink.tmaster.TMasterSink"
flush-frequency-ms: 60000
sink-restart-attempts: -1 # Forever
tmaster-location-check-interval-sec: 5
tmaster-client:
reconnect-interval-second: 5 # The re-connect interval to TMaster from TMasterClient
# The size of packets written to TMaster will be determined by the minimal of: (a) time based (b) size based
network-write-batch-size-bytes: 32768 # Size based, the maximum batch size in bytes to write to TMaster
network-write-batch-time-ms: 16 # Time based, the maximum batch time in ms for Metrics Manager to write to TMaster per attempt
network-read-batch-size-bytes: 32768 # Size based, the maximum batch size in bytes to write to TMaster
network-read-batch-time-ms: 16 # Time based, the maximum batch time in ms for Metrics Manager to write to TMaster per attempt
socket-send-buffer-size-bytes: 6553600 # The maximum socket's send buffer size in bytes
socket-received-buffer-size-bytes: 8738000 # The maximum socket's received buffer size in bytes
tmaster-metrics-type:
"__emit-count": SUM
"__execute-count": SUM
"__fail-count": SUM
"__ack-count": SUM
"__complete-latency": AVG
"__execute-latency": AVG
"__process-latency": AVG
"__jvm-uptime-secs": LAST
"__jvm-process-cpu-load": LAST
"__jvm-memory-used-mb": LAST
"__jvm-memory-mb-total": LAST
"__jvm-gc-collection-time-ms": LAST
"__server/__time_spent_back_pressure_initiated": SUM
"__time_spent_back_pressure_by_compid": SUM

### Config for scribe-sink
# scribe-sink:
# class: "com.twitter.heron.metricsmgr.sink.ScribeSink"
# flush-frequency-ms: 60000
# sink-restart-attempts: -1 # Forever
# scribe-host: "127.0.0.1" # The host of scribe to be exported metrics to
# scribe-port: 1463 # The port of scribe to be exported metrics to
# scribe-category: "scribe-category" # The category of the scribe to be exported metrics to
# service-namespace: "heron" # The service name of the metrics in scribe-category
# scribe-timeout-ms: 200 # The timeout in seconds for metrics manager to write metrics to scribe
# scribe-connect-server-attempts: 2 # The maximum retry attempts to connect to scribe server
# scribe-retry-attempts: 5 # The maximum retry attempts to write metrics to scribe
# scribe-retry-interval-ms: 100 # The interval to retry to write metrics to scribe

### Config for graphite-sink
### Currently the graphite-sink is disabled
# graphite-sink:
# class: "com.twitter.heron.metricsmgr.sink.GraphiteSink"
# flush-frequency-ms: 60000
# graphite_host: "127.0.0.1" # The host of graphite to be exported metrics to
# graphite_port: 2004 # The port of graphite to be exported metrics to
# metrics_prefix: "heron" # The prefix of every metrics
# server_max_reconnect-attempts: 20 # The max reconnect attempts when failing to connect to graphite server
2 changes: 2 additions & 0 deletions heron/config/src/yaml/conf/ecs/packing.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# packing algorithm for packing instances into containers
heron.class.packing.algorithm: com.twitter.heron.packing.roundrobin.RoundRobinPacking
8 changes: 8 additions & 0 deletions heron/config/src/yaml/conf/ecs/scheduler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# scheduler class for distributing the topology for execution
heron.class.scheduler: com.twitter.heron.scheduler.ecs.EcsScheduler

# launcher class for submitting and launching the topology
heron.class.launcher: com.twitter.heron.scheduler.ecs.EcsLauncher

# location of java - pick it up from shell environment
heron.directory.sandbox.java.home: ${JAVA_HOME}
Loading