All notable changes to this project will be documented in this file. This project adheres to Semantic Versioning.
- #595 - Not able to import compressionFormat enum in other repository which uses hyperion
#578 - Correctly serialize the default value of datetime params.
- #590 - Hyperion Level support to merge and split files for Bz2 compression format
- Script
run-python.sh
now expects arequirements.txt
file instead ofrequirements.pip
.
- #583 - House keeping updates. SBT 1.3.0, scala 2.12.9, and stubborn 1.4.1. Also update readmes.
- #579 - Update build to scala 2.12.8 and sbt 1.2.8
- #576 - Support configuration of EBS volumes for EMR clusters
- #571 - HyperionCLI and AwsClient now uses the default configuration for region instead of fallback to us-east-1 as defaults
- #568 - Generalize Schedule and AdpSchedule to support expressions.
- #566 - Add withPreStepCommands and withPostStepCommands in S3DistCpActivity
- #565 - Add script to deploy hyperion scripts to S3.
- #557 - Fix Delayed scheduler.
- #560 - Keep consistency of
PipelineActivity[_ <: ResourceObject]
. Change WorkerGroup to trait.
- #546 - Change Ec2Resource to set security group ids only when security groups are not set
- #558 - Change Ec2Resource to set security groups and ids only when subnet id is not set
- #554 - Handle Parquet and ORC formats in Redshift Copy
- #550 - Update sbt to 1.1.6
- #449 - Update AWS SDK dependency to use 1.11.x instead of specific version of 1.11
- #547 - Added support for AwsS3CpActivity to fail silently if the copy script fails
- #545 - Add profile support for the overwrite script in AwsS3CpActivity
- (doc) Updated test case message and update Scala 2.12.4 to 2.12.6
- #460 - Add scheduler delay feature to DataPipelinePipelineDefGroup
- #521 - Upate AWS SDK dependency to 1.11.238 and deprecate support for AWS_SECURITY_TOKEN
- #503 - Updated dependencies including scopt, json4s, etc.
- #515 - Remove support for Scala 2.10 and Java 7. Add Scala 2.12 and start build with Java 8
- #532 - Update commons-io dependency to 2.6
- #465 - Update default EC2 AMI to HVM-IS 2017.09.1
- Introduce a BaseEmrCluster which is the base trait of EmrCluster and LegacyEmrCluster
- LegacyEmrCluster is for pre emr release label 4.x.x
- EmrCluster is for post release label 4.x.x
- MapReduceCluster has been removed and replaced by the above two different clusters
- Introduce BaseEmrStep
- EmrStep is a generic step that can construct any script runner or command runner based activities
- HadoopStep is a step that runs hadoop based jobs where one can optionally specify main class in additional to arguments
- SparkStep is reworked to better support command-runner (Use
SparkStep.legacyScriptRunner
to run spark steps on pre emr-4.0.0 EMR clusters)
- EmrActivity is no longer a trait but a case class it should be used for all EMR based activities including Spark where the formal SparkActivity now is simply EmrActivity with spark steps.
- SparkActivity is removed
- SparkTaskActivity has been reworked to closer follow the new SparkStep approach. Use
SparkTaskActivity.legacyScriptRunner
to run spark activities on pre emr-4.0.0 EMR clusters. SparkCommandRunner
trait is removedEmrConfiguration
now always require aclassification
, emptyclassification
is not marked as deprecated- Relaxes spark related activities to be able to run on any EMR cluster, the compiler will not check the validity, and leave this to the developer
- Added default
hyperion.emr.release_label
to emr-5.12.0
- #540 - Fixed a bug that the script that handling spark options does not work well with spaces
- #527 - Updated all serialize calls on pipeline objects to lazy val to fix an issue that very deep dependencies will cause slow serialization
- #524 - Updated sbt major version and Scala patch version
- #517 - Fixed header option when 1 input file is used in SplitMergeFilesActivity
- #507 - Support STS token when the environment variable is set
- #512 - Fix invalid syntax bug for AVRO format in RedshiftCopyOption
- #510 - Add AVRO to RedshiftCopyOption
- #502 - Throw exception when pipeline defines no workflows
- #497 - Replace the Retry Implementation with Stubborn
- #495 - Fixed the program name for SendSlackMessage
- #492 - Downgrade to AWS SDK to 1.10.75
- #489 - Download Spark jar to
/mnt/hyperion
instead of~/hyperion
- #487 - Update GoogleStorageUploadActivity to allow recursive copy
- #484 - Update to use AWS SDK to 1.11.93
- #479 - Make cli.Reads public
- #480 - Add ability to select distinct in SelectTableQuery
- #353 - Enable .reduceLeft(_ ~> _) on a list of pipeline activities
- #477 - Make
initTimeout
as a optional global configuration
- #475 - Duplicate arguments for Hadoop Activity
- #459 - Add support of setting maximumRetries for resources
- #471 - Add support for
AWS S3 cp
CLI arguments in AwsS3CpActivity
- #467 - Handle InvalidRequestException properly during pipeline creation
- #468 - Update to Scala 2.11.8
- #463 - Add the number of objects logging back
- #461 - Fixed GPG encrypt and decrypt on first run
- #457 - Add support for "APPEND" insert mode to redshift copy activity
- #455 - Use emrManaged*SecurityGroupId instead of *SecurityGroupId
- #452 - EMR cluster when using configuration to specify release label, the ami field is not overriden
- #448 - Fix a bug that SparkActivity does not work with EMR release label
- #450 - Update the default AMI to use instance store AMIs instead of EBS backed
- #289 - Allow JarActivity to have environment variables and additional classpath JARs
- #443 - Allow HiveActivity to accept multiple input and output parameters
- #439 - Schedule.ondemand results in pipeline creation failure
- #436 - PgpActivity should expose withInput, withOutput and markSuccessful
- #434 - Allow S3DistCpActivity to receive Parameters.
Please refer to the wiki page for details of migrating from v3 to v4.
- #344 - Add support for defining multiple pipelines with shared schedules within one definition with DataPipelineDefGroup
com.krux.hyperion.HyperionAwsClient
is rewritten and replaced bycom.krux.hyperion.client.AwsClient
com.krux.hyperion.WorkflowExpression
is moved tocom.krux.hyperion.workflow.WorkflowExpression
- #403 - Updated the default ec2 instance AMI to Amazon Linux AMI 2016.03.2 released on 2016-06-09
- #356 - Escape
,
in arguments of Emr and Spark steps
- #430 - Add a
--no-check
flag to not check for existence of pipeline before creating
- #333 - Allow explicitly converting an Activity to a WorkflowExpression
- #380 - Support EMR release label 4.x
- #374 - PgpEncryptActivity to encrypt files using GNU implementation of OpenPGP
- #375 - PgpDecryptActivity to decrypt OpenPGP-encrypted files
- #420 - Fix the incorrect retry message
- #418 - Max retry should be configurable and use exponential backoff and jitter instead of fixed interval
- #416 - SparkStep and SparkTaskActivity needs to be able to pass a HdfsUri to withArguments()
- #404 - PythonActivity script now uses the correct virtualenv path
- #410 - support all redshift unload options for RedshiftUnloadActivity
- #401 - Handle
.compare(_)
on parameters without default values
- #398 - Add support for securityGroupIds in Ec2Resource
- #388 - Make hyperion.log.uri optional
- #397 - Ability to set maxActiveInstances optional field for activities
- #394 - Allow using both s3 and hdfs URIs in S3DistCpActivity
- #390 - Show more detail on validation errors and warnings
- #373 - CLI and the aws client should retry with some delay on throttling exception
- #386 - Use JavaConverters instead of JavaConversions
- #381 - S3DistCpActivity fails when using emr-release label 4.X
- #376 - Adds Multiple EmrConfiguration Support
- #370 - The standard bootstrap script now fails on older version of AMI (< 3.x)
- #365 - Change standard bootstrap action to use aws cli instead of hadoop
- #362 - Do not emit empty arrays for EmrConfiguration properties
- #360 - Unable to create MapReduceCluster with release Label
- #357 - Add recursive option to GoogleStorageDownloadActivity
- #321 - Add overloaded methods accepting HS3Uri for Activities
- #351 - DeleteS3PathActivity - add option to check for existence of S3 path
- #349 - SplitMergeFilesActivity needs to pass
temporary-directory
- #345 - The default alarm message now contains a link to the pipeline
- #337 - Implement SendSlackMessage
- #335 - WorkflowGraphRenderer use name instead of id
- #323 - Extend
DateTimeExp
to includeformat
- #327 - Add a --param option to handle override parameters with a comma in the value
- #328 - SftpActivity was broken in 3.0 - hard-coded to 'download'
- #324 - Workflow should be evaluated at the last minute possible
- #318 - SendFlowdockMessageActivity should use the corresponding HType in apply
- #320 - A few shell command based activity is missing input / output
- #315 - fixed a bug input and output reference in CopyActivity is not included
- #313 - added option to startThisHourAt schedule
- #310 - fix a bug where preconditions missing the referenced objects
- #213 - Start use the
name
field instead of forcingid
andname
to be the same
- #304 - Add the missing options to preconditions
- #300 - value option in encrypted and unencrypted method to create new parameters through the Parameter object
- #299 - Fixes ConstantExpression implicits to avoid unnecessary import
- #298 - Make sequence of native type to sequence of HType implicitly available
- #295 - Refactor parameter with adhoc polymorphism with type class instead of reflection TypeTags
- #248 - Refactor parameter to have EncryptedParameter and UnencryptedParameter
- #281 - Support for not failing on un-defined pipeline parameters
- #291 - Clean up the implicits
- #285 - SnsAlarm requires topic arn and added default subject and message
- #286 - Fix a bug in 3.0 that main class in jar activity is incorrect
- #282 - Add support for getting hyperion aws client by pipeline name
- #280 - Upgrade to scala 2.10.6
- #243 - Revisit and refactor expression and parameter
- The actionOnTaskFailure and actionOnResource failure is removed from emr activities, they do not belong there.
- Database objects are changed to be consistent with other objects, this means that one needs to initialize a database object instead of extending a trait
- Removed hadoopQueue from
HiveCopyActivity
andPigActivity
as it is not documented by AWS SparkJobActivity
is renamed toSparkTaskActivity
to be consistent with thepreActivityTaskConfig
field for similar activity naming from AWS
- #271 - Separate CLI with DataPipelineDef
- #214 - Extend CLI to be able to read parameters to be passed from pipeline
- #291 - Upgrade AWS SDK to 1.10.43
- #277 - InsertTableQuery actually needs the values placeholders
- #275 - Schedule is not honouring settings in non-application.conf config
- #273 - Add
ACCEPTINVCHARS
and the rest of Data Conversion Parameters to redshift copy options
- #269 - Sftp download auth cancel when using username and password
- #267 - Passing 0 to stopAfter should reset end to None
- #264 - CLI schedule override only the explicitly specified part
- #262 - Add slf4j-simple to examples
- #240 - Support EmrConfiguration and Property
- #241 - Support HttpProxy
- #255 - Provide explanations for CLI options
- #256 - Use a logging framework instead of println
- #209 - Override start activation time on command line
- #249 - Implement a simpleName value on MainClass to get just the class name itself
- #252 - Add option to Graph to exclude data nodes (or make it the default)
- #251 - Graph still emits resources (just not resource dependencies) when not using --include-resources
- #239 - Capability to generate graph of workflow
- #237 - Allow Spark*Activity to override driver-memory
- #234 - SplitMergeFiles should allow ignoring cases where there is no input files
- #224 - Spark*Activity should allow setting parameters for spark jobs
- #229 - Convert S3DistCpActivity to a HadoopActivity instead of EmrActivity
- #229 - Convert S3DistCpActivity to a HadoopActivity instead of EmrActivity
- #228 - Allow specifying options to S3DistCpActivity
- #226 - Improves SetS3AclActivity with canned acl enum and more flexible apply
- #223 - Contrib activity that sets S3 ACL
- #220 - Make SparkActivity download jar to different directory to avoid race condition of jobs running in parallel.
- #217 - DateTimeExpression methods returns the wrong expression.
- #211 - RedhishiftUnloadActivity fail when containing expressions with
'
- #207 - Make workflow expression DSL avaible to pipeline def by default.
- #204 - HadoopActivity and SparkJobActivity should support input and output data nodes
- #202 - WorkflowGraph fails with assertion if not using named
- #200 - SendEmailActivity must allow setting of debug and starttls
- #191 - Create a SparkActivity-type step that runs a single step using HadoopActivity instead of MapReduceActivity
- #160 - Better SNS alarm format support
- #197 - Update the default EMR AMI version to 3.7 and Spark version to 1.4.0
- #195 - RepartitionFile emitting empty files
- #192 - StringParameter should have implicit conversion to String
- #186 - Change collection constructors to use
.empty
- #188 - SftpDownloadActivity should obey skip-empty as well and it needs to properly handle empty compressed files
- #189 - SftpUploadActivity, SftpDownloadActivity and SplitMergeFilesActivity should be able to write a _SUCCESS file
- #184 - Properties for new notification activities are not properly exposed in the Activity definition
- #181 - Remove
spark.yarn.user.classpath.first
conf for running Spark
- #172 - Create activity to send generic SNS message
- #173 - Create activity to send generic SQS message
- #174 - Create activity to send Flowdock notifications
- 179 - Single quotes in SFTP Activitys date format breaks DataPipeline
- 177 - The SFTP activity should support a --since to download files since a date
- 175 - Need to be able to pass options to java in addition to arguments to the main class
- #166 - If the input is empty, split-merge should not create an empty file with headers
- #167 - SftpActivity needs an option to not upload empty files
- #157 - Use a separate workflow/dependency graph to manage dependency building
- #162 - Need way to specify no activity, to allow omitting steps in a workflow expression
- #155 - Workflow breaks when having ArrowDependency on the right hand side.
- #153 - The create --force action doesnt detect existing pipelines if there are more than 25 active pipelines
- #150 - The whenMet method returns DataNode instead of S3DataNode
- #149 - Preconditions are not returned in objects for DataNodes
- #146 - RepartitionFile doesnt properly add header if creating a single merged file
- #144 - SplitMergeFileActivity isnt properly compressing final merged output
- #142 - Arguments to SFTP activity are incorrect
- #140 - SendEmailActivity runner isnt being published
- #138 - Make parameter key work for starting letter with lower case
- #136 - Fix a bug that database object is not included
- #133 - SftpActivity needs to support S3 URLs for identity file and download as appropriate
- #131 - SplitMergeFiles should take strings for bufferSize and bytesPerFile
- #2 - Implement SftpUploadActivity
- #3 - Implement SftpDownloadActivity
- #98 - Add an activity to use SES to send emails rather than mailx
- #103 - Provide an activity to split files
- #107 - Support Worker Groups
- #108 - Add attemptTimeout
- #109 - Add lateAfterTimeout
- #110 - Add maximumRetries
- #111 - Add retryDelay
- #112 - Add failureAndRerunMode
- #115 - Add ShellScriptConfig
- #116 - Add HadoopActivity
- #125 - Support collections on WorkflowExpression
- #127 - Better type safety for MainClass
- #106 - Upgrade to Scala 2.11.7
- #113 - Reorder parameters for consistency
- #114 - Move non-core activities to a contrib project
- #117 - Better type safety for PipelineObjectId
- #118 - Better type safety for DpPeriod
- #119 - Better type safety for S3 URIs
- #120 - Better type safety for scripts/scriptUris
- #121 - RedshiftUnloadActivitys Access Key Id/Secret be encrypted StringParameters
- #122 - AdpS3DataNode should be a 1:1 match to AWS objects
- #123 - Rename S3DataNode.fromPath to apply
- #128 - Schedule to be constructed via cron/timeSeries/onceAtActivation
- #129 - Merge ExpressionDSL into Expression classes and expand functions available
- #130 - Rename DateTimeRef to RuntimeSlot to denote real uses
- #99 - Hyperion CLI driver should exit with appropriate error codes
- #91 - workflow dsl broken when the right hand side of andThen have dependencies. Note that
act1 + act2
is no longer the same asSeq(act1, act2)
any more.
- #101 - Allow workflow DSL to have duplicated activities.
- #25 - Added a run-python runner script and PythonActivity
- #89 - Added an activity to email input staging folders
- #90 - Added an activity to merge input staging folders and upload to output staging folders
- #80 - Change jar-based activities/steps to require a jar
- #83 - Remove dependency assertion in WorkflowDSL
- #84 - Drop dependsOn and require WorkflowDSL
- #81 - Regression: --region parameter is now effectively required on non-EC2 instances due to call to
getCurrentRegion
.
- #78 - Strip trailing $ from MainClass
- #65 - Ability to use roles via STS assume-role
- #68 - No longer specify AWS keys in configuration for RedshiftUnloadActivity - now must specify as arguments to activity
- #74 - DataNode should return path using toString
- #64 - Supports non-default region
- #69 - Role and ResourceRole were not getting properly defaulted on resources
- #4 - Added S3DistCpActivity
- #63 - ActionOn* and SchedulerType case objects properly inherit from trait
- #62 - role and resourceRole to EmrCluster types as well as additional missing properties
- #59 - workflow DSL
- #54 - with* methods that take a sequence are now additive, and replaced withColumns(Seq[String]) with withColumns(String...)
- #56 - reorganize objects into packages by type
- #50 - In ShellCommandActivity, make command and scriptUri Either
- #51 - When taskInstanceCount == 0 need to make sure other taskInstance parameters are set to None
- #48 - Pipeline blows up if sns.topic is not set
- #46 - Support remaining properties on resources
- #45 - Support VPC by adding subnetId
- Use Option to construct options instead of Some
- #40 - Hyperion CLI continue retry to delete the pipeline when --force is used
- #41 - Refactor Option to Option[Seq] functions
- #33 - Added support for tags
- #6 - Support remaining schedule aspects
- #14 - Make datapipelineDef be able to have an CLI and remove the Hyperion executable
- #5 - Support parameters
- #26 - ShellCommandActivity input and output should actually be a sequence of DataNodes.
- #18 - Renamed runCopyActivity on EC2Resource to runCopy
- #13 - Support SQL related databases and the relevant data nodes
- #20 - Support Actions
- #9 - Additional activity types (PigActivity, HiveActivity, HiveCopyActivity, CopyActivity)
- #15 - downgrade json4s to 3.2.10
- #11 - Spark and MapReduce should dependOn PipelineActivity
- First public release