rel: Prep for release of v1.20 (#635)

Final doc tweaks and prep for release of 1.20.
honeycombio · Mar 10, 2023 · a032b4d · a032b4d
1 parent 29ef0f0
commit a032b4d
Show file tree

Hide file tree

Showing 5 changed files with 168 additions and 9 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,58 @@
 # Refinery Changelog
 
+## 1.20.0 2023-03-10
+
+### Summary
+This is a significant new release of Refinery, with several features designed to help when operating Refinery at scale:
+
+For details on all of the new features, please see the [new Release Notes document](./RELEASE_NOTES.md)
+New features must be enabled by adjusting configuration.
+
+### Enhancements
+- feat: Add configuration for trace and parent ID field names (#630) | [Davin Taddeo](https://github.com/tdarwin)
+- feat: allow ability to add new attributes to refinery data (#621) | [Faith Chikwekwe](https://github.com/fchikwekwe)
+- feat: Add ability to set Redis database and prefix in config (#614) | [Kent Quirk](https://github.com/kentquirk)
+- perf: Improve performance of stress relief (#604) | [Kent Quirk](https://github.com/kentquirk)
+- feat: Stress Relief system (#594) | [Kent Quirk](https://github.com/kentquirk)
+- feat: extend and unify metrics system (#593) | [Kent Quirk](https://github.com/kentquirk)
+- feat: allow user to convert datatype if valid (#585) | [Faith Chikwekwe](https://github.com/fchikwekwe)
+- feat: Implement alternative sharding using rendezvous hash to improve dynamic scalability (#570) | [Kent Quirk](https://github.com/kentquirk)
+- feat: On shutdown, remove ourself from the peers list (#569) | [Kent Quirk](https://github.com/kentquirk)
+- feat: Add cuckoo-based drop cache (#567) | [Kent Quirk](https://github.com/kentquirk)
+- feat: Extract Sent Cache to an interface for future expansion (#561) | [Kent Quirk](https://github.com/kentquirk)
+
+### Bug fixes
+- fix: do not send sample rate in dry run (#611) | [Faith Chikwekwe](https://github.com/fchikwekwe)
+- fix: Remove API key logging (#606) | [Tyler Helmuth](https://github.com/TylerHelmuth)
+- fix: Fix flaky tests, clean up logic on rules (#596) | [Kent Quirk](https://github.com/kentquirk)
+- fix: Add missing done channel to fix build (#573) | [Kent Quirk](https://github.com/kentquirk)
+
+### Maintenance
+- chore: publish should only happen on main (#627) | [Kent Quirk](https://github.com/kentquirk)
+- chore: Publish every build to honeycomb's ecr (#613) | [Kent Quirk](https://github.com/kentquirk)
+- docs: update FieldList (#591) | [Tyler Helmuth](https://github.com/TylerHelmuth)
+- docs: add environment variables (#589) | [Tyler Helmuth](https://github.com/TylerHelmuth)
+- chore: Update CODEOWNERS (#588) | [Tyler Helmuth](https://github.com/TylerHelmuth)
+- chore: Change workflow to use Collections board (#587) | [Kent Quirk](https://github.com/kentquirk)
+- chore: update dependabot (#583) | [Kent Quirk](https://github.com/kentquirk)
+- chore: update validate PR title workflow (#572) | [Purvi Kanal](https://github.com/pkanal)
+- chore: validate PR title (#571) | [Purvi Kanal](https://github.com/pkanal)
+- refactor: Change Router to use TraceServer (#607) | [Tyler Helmuth](https://github.com/TylerHelmuth)
+- maint(deps): bump golang.org/x/net from 0.4.0 to 0.7.0 (#628) | dependabot[bot]
+- maint(deps): bump github.com/pelletier/go-toml/v2 from 2.0.6 to 2.0.7 (#620) | dependabot[bot]
+- maint(deps): bump github.com/honeycombio/husky from 0.19.0 to 0.21.0 (#619) | dependabot[bot]
+- maint(deps): bump github.com/klauspost/compress from 1.15.15 to 1.16.0 (#618) | dependabot[bot]
+- maint(deps): bump github.com/stretchr/testify from 1.8.1 to 1.8.2 (#616) | dependabot[bot]
+- maint(deps): bump github.com/honeycombio/husky from 0.17.0 to 0.19.0 (#603) | dependabot[bot]
+- maint(deps): bump github.com/hashicorp/golang-lru from 0.5.4 to 1.0.1 (#602) | dependabot[bot]
+- maint(deps): bump github.com/klauspost/compress from 1.15.12 to 1.15.15 (#601) | dependabot[bot]
+- maint(deps): bump github.com/honeycombio/dynsampler-go from 0.2.1 to 0.3.0 (#600) | dependabot[bot]
+- maint(deps): bump grpc to 1.52.3 (#599) | [Kent Quirk](https://github.com/kentquirk)
+- maint(deps): bump github.com/spf13/viper from 1.13.0 to 1.15.0 (#597) | dependabot[bot]
+- maint(deps): Bump github.com/prometheus/client_golang from 1.13.0 to 1.14.0 (#576) | dependabot[bot]
+- maint(deps): Bump github.com/tidwall/gjson from 1.14.3 to 1.14.4 (#575) | dependabot[bot]
+- maint(deps): Bump github.com/hashicorp/golang-lru from 0.5.4 to 1.0.1 (#574) | dependabot[bot]
+
 ## 1.19.0 2022-11-09
 
 Adds new query command to retrieve configuration metadata, and also allows for a new (optional) cache management strategy that should be more effective at preventing OOM crashes in situations where memory is under pressure.

diff --git a/README.md b/README.md
@@ -5,6 +5,11 @@
 [![OSS Lifecycle](https://img.shields.io/osslifecycle/honeycombio/refinery?color=success)](https://github.com/honeycombio/home/blob/main/honeycomb-oss-lifecycle-and-practices.md)
 [![Build Status](https://circleci.com/gh/honeycombio/refinery.svg?style=shield)](https://circleci.com/gh/honeycombio/refinery)
 
+## Release Information
+
+For a detailed list of linked pull requests merged in each release, see [CHANGELOG.md](./CHANGELOG.md).
+For more readable information about recent changes, please see [RELEASE_NOTES.md](./RELEASE_NOTES.md).
+
 ## Purpose
 
 Refinery is a trace-aware sampling proxy. It collects spans emitted by your application, gathers them into traces, and examines them as a whole. This enables Refinery to make an intelligent sampling decision (whether to keep or discard) based on the entire trace. Buffering the spans allows you to use fields that might be present in different spans within the trace to influence the sampling decision. For example, the root span might have HTTP status code, whereas another span might have information on whether the request was served from a cache. Using Refinery, you can choose to keep only traces that had a 500 status code and were also served from a cache.
@@ -90,7 +95,7 @@ Note, `REFINERY_HONEYCOMB_METRICS_API_KEY` takes precedence over `REFINERY_HONEY
 
 ### Mixing Classic and Environment & Services Rule Definitions
 
-With the change to support environemt and services in Honeycomb, some users will want to support both sending telemetry to a classic dataset and a new environment called the same thing (eg `production`).
+With the change to support Environments in Honeycomb, some users will want to support both sending telemetry to a classic dataset and a new environment called the same thing (eg `production`).
 
 This can be accomplished by leveraging the new `DatasetPrefix` configuration property and then using that prefix in the rules definitions for the classic datasets.
 
@@ -127,7 +132,7 @@ For more detail on how this algorithm works, please refer to the `dynsampler` pa
 
 ## Dry Run Mode
 
-When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic. By enabling dry run mode, all spans in each trace will be marked with the sampling decision in a field called `refinery_kept`. All traces will be sent to Honeycomb regardless of the sampling decision. You can then run queries in Honeycomb on this field to check your results and verify that the rules are working as intended. Enable dry run mode by adding `DryRun = true` in your configuration, as noted in `rules_complete.toml`.
+When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic. By enabling dry run mode, all spans in each trace will be marked with the sampling decision in a field called `refinery_kept`. All traces will be sent to Honeycomb regardless of the sampling decision. The SampleRate will not be changed, but the calculated SampleRate will be stored in a field called `meta.dryrun.sample_rate`. You can then run queries in Honeycomb to check your results and verify that the rules are working as intended. Enable dry run mode by adding `DryRun = true` in your configuration, as noted in `rules_complete.toml`.
 
 When dry run mode is enabled, the metric `trace_send_kept` will increment for each trace, and the metric for `trace_send_dropped` will remain 0, reflecting that we are sending all traces to Honeycomb.
 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -0,0 +1,100 @@
+# Release Notes
+
+While [CHANGELOG.md](./CHANGELOG.md) contains detailed documentation and links to all of the source code changes in a given release, this document is intended to be aimed at a more comprehensible version of the contents of the release from the point of view of users of Refinery.
+
+## Version 1.20.0
+
+This is a significant new release of Refinery, with several features designed to help when operating Refinery at scale:
+
+### Stress Relief
+
+It has been hard to operate Refinery efficiently at scale. Because of the way it works, it can quickly transition into instability during a spike in traffic, and it has been hard to decide what to change to keep it stable.
+
+In v1.20, a "Stress Relief" system has been added. When properly configured, it tracks refinery's load, and if it gets in danger of instability, switches into a high-performance load-shedding mode designed to relieve stress on the system.
+
+When Stress Relief is Activated, Refinery stops collecting and distributing traces for evaluation after the trace is complete. Instead, it samples spans deterministically based on the TraceID and immediately forwards (or drops) them without further evaluation. It will continue doing so until the load subsides.
+
+It also indicates in the logs which of its configuration values is most under stress, which should help tune it.
+
+Stress Relief is controlled by the [StressRelief](https://github.com/honeycombio/refinery/blob/main/config_complete.toml#L512) section of the configuration.
+
+Stress Relief generally operates by comparing specific metrics for memory and queue sizes to their configured maximum values. Each metric is treated differently, but in general the heuristic is an attempt to detect problems as they are about to happen rather than waiting for them to be in crisis. The new `stress_level` metric will show the results of this calculation on a scale from 0 to 100, and Stress Relief determines its activation by this metric.
+
+The Stress Relief `Mode` can be set to:
+- `never` -- It will never activate -- this is the default.
+- `always` -- It is always active -- useful for testing or in an emergency
+- `monitor` -- Refinery monitors its own status and adjusts its activity according to the `ActivationLevel` and `DeactivationLevel`.
+
+The `ActivationLevel` and `DeactivationLevel` values control when the stress relief system turns on or off.
+
+When Stress Relief activates, then its logs will indicate the activation along with the name of the particular configuration value that could be adjusted to reduce future stress.
+
+Stress Relief currently monitors these metrics:
+- `collector_peer_queue_length`
+- `collector_incoming_queue_length`
+- `libhoney_peer_queue_length`
+- `libhoney_upstream_queue_length`
+- `memory_heap_allocation`
+
+### New Metrics
+
+Some new metrics have been added and the internal metrics systems have been unified and made easier to update. This is to support the stress relief system (see above) and future plans.
+
+New metrics include:
+- `stress_level` -- a gauge from 0 to 100
+- `stress_relief_activated` -- a gauge at 0 or 1
+
+### New "Datatype" parameter in rule conditions
+
+- An additional field may be specified in refinery's rule conditions -- if `Datatype` is specified (must be one of "bool", "int", "float", or "string") both the field and the comparison value are converted to that datatype before the comparison. This allows a single rule to handle multiple datatypes. Probably the best example is `http.status` which is sometimes a string and sometimes an integer, depending on the programming environment.
+
+Example:
+```toml
+		[[myworld.rule.condition]]
+			field = "status_code"
+			operator = ">"
+			value = "400"
+			datatype = "int"
+```
+
+### Configurable Trace and Parent IDs
+
+The names that Refinery uses for traceID and parentID are now configurable.
+
+Se `TraceIdFieldNames` and `ParentIdFieldNames` in the [configuration](https://github.com/honeycombio/refinery/blob/main/config_complete.toml#L160) to the list of field names you prefer. The default values are those that Refinery has used to date: `trace.trace_id`, `trace.parent_id`, `traceId`, and `parentId`.
+
+### Inject specific constant values to telemetry
+
+`AdditionalAttributes` in config is a map that can be used to inject specific user-defined attributes, such as a cluster ID. These attributes will be added to all spans that are sent to Honeycomb. Both keys and values must be strings.
+
+Example:
+```toml
+[[AdditionalAttributes]]
+	ClusterName="MyCluster"
+```
+
+### Trace Decision Caching lasts longer
+
+Refinery keeps a record of its trace decisions -- whether it kept or dropped a given trace. Past versions of refinery had a trace decision cache that was fixed to 5x the size of the trace cache. In v1.20, Refinery has a new cache strategy (called "cuckoo") that separates drop decisions (where all it needs to remember is that the trace was dropped) from kept decisions (where it also tracks some metadata about the trace, such as the number of spans in the trace). It can now cache millions of drop decisions, and many thousands of kept decisions, which should help ensure trace integrity for users with long-lived traces.
+
+It is controlled by the [SampleCacheConfig](https://github.com/honeycombio/refinery/blob/main/config_complete.toml#L466) section of the config file.
+
+To turn it on, set `Type = "cuckoo"`. For compatibility, it is disabled by default.
+
+The defaults are reasonable for most configurations, but it has several options. To control it, set `KeptSize`, `DroppedSize`, and `SizeCheckInterval`. See the config for details.
+
+### Late Span Metadata
+
+If `AddRuleReasonToTrace` is specified, Refinery already adds metadata to spans indicating which Refinery rule caused the keep decision. In v1.20, spans arriving after the trace's sampling decision has already been made will have their `meta.refinery.reason` set to `late` before sending to Honeycomb. This should help in diagnosing trace timeout issues.
+
+### Improved cluster operations
+
+- When refinery shuts down, it will try to remove itself from the peers list, which should shorten the time of instability in the cluster.
+- The algorithm controlling how traces are distributed to peers in the cluster has been revamped so that traces are much more likely to stay on the same peer during reconfiguration. In previous releases, a change in the peer count would affect roughly half of the traces in flight. With this new algorithm, only 1/N (where N is the number of peers) will be affected.Set the [Peer Management](https://github.com/honeycombio/refinery/blob/main/config_complete.toml#L183) `Strategy` to `hash` to enable it.
+- More Redis configuration is available to make it possible for multiple deployments to share a single Redis instance. Adjust the `RedisPrefix` and `RedisDatabase` parameters in the `PeerManagement` section of the config.
+
+### Dry Run works better
+
+- Dry Run mode no longer sets the Sample Rate, which means that Honeycomb queries will still be accurate in this mode. Instead, it sets `meta.dryrun.sample_rate` to the calculated sample rate.
+
+
diff --git a/config_complete.toml b/config_complete.toml
@@ -585,10 +585,10 @@ MetricsReportingInterval = 3
 # MinimumStartupDuration = 3s
 
 
-# AdditionalAttributes is a map that can be used for defining user defined 
-# attributes. This could be used for naming refinery clusters or other uses. 
-# The map is currently limted to both string keys and string values. 
+# AdditionalAttributes is a map that can be used for injecting user-defined
+# attributes. For example, it could be used for naming a refinery cluster.
+# Both keys and values must be strings.
 
-[[AdditionalAttributes]] 
-	ClusterName="MyCluster"
-	environment="production"
+# [[AdditionalAttributes]]
+# 	ClusterName="MyCluster"
+#   environment="production"
diff --git a/rules_complete.toml b/rules_complete.toml
@@ -248,13 +248,14 @@ SampleRate = 1
 
 	# Note that Refinery comparisons are type-dependent. If you are operating in an environment where different
 	# telemetry may send the same field with different types (for example, some systems send status codes as "200"
-	# instead of 200), you may need to create additional rules to cover these cases.
+	# instead of 200), you may wish to use the "datatype" setting to force them all to the same type.
 	[[dataset4.rule]]
 		name = "dynamically sample 200 string responses"
 		[[dataset4.rule.condition]]
 			field = "status_code"
 			operator = "="
 			value = "200"
+			datatype = "int"
 		[dataset4.rule.sampler.EMADynamicSampler]
 			Sampler = "EMADynamicSampler"
 			GoalSampleRate = 15