Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add State & Event Versioning #43

Open
chelma opened this issue May 1, 2023 · 17 comments
Open

Add State & Event Versioning #43

chelma opened this issue May 1, 2023 · 17 comments
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources

Comments

@chelma
Copy link
Collaborator

chelma commented May 1, 2023

Description

This task is to decide on a versioning strategy and implement it. Per convo in PR (#42), @awick said:

Whenever I'm using an event bus I like to upfront at least discuss how I'm going to do event versioning when I discover I need more/less/different parameters in the messages. The two most common solutions are either a version field in every message or the name of the event changes (such as appending _v2 _v3 etc). Then the discussion is, should the initial implementation have this version marker or not. Such as version: 1 or _v1. The general issue is eventually you'll either have a newer version of the lambda or cli depending on upgrade order.

While this was originally focused on event shapes, this is also applicable to the format of the state currently stored AWS as well (both Parameter Store and CloudFormation).

Related Tasks

Acceptance Criteria

  • Our repo's is able to gracefully handle changing state version and event shapes
@chelma chelma added the Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources label May 1, 2023
@chelma
Copy link
Collaborator Author

chelma commented May 1, 2023

Thinking about this briefly - I'm initially inclined to manage the complexity of multiple versions in-code rather than in-infrastructure. While CDK/CloudFormation makes infrastructure easier, it's still not precisely easy. Having multiple, similar copies of our AWS Resources (EventBridge Rules, Lambda Functions, etc) brings up issues like resource naming collisions, longer deployment times, more opportunity to hit account limits, more opportunity for transient AWS issues to break the deployment of a given resource. I feel more confident about our ability to have the Lambda code handle this gracefully than doing it at the AWS-Resource level.

@awick
Copy link
Contributor

awick commented May 1, 2023

So does this mean having a version in the message instead of different event names? Sorry if I'm misunderstanding.

@chelma
Copy link
Collaborator Author

chelma commented May 1, 2023

My thinking lines up more with having the version in the message rather than different event names, as embedding it in the message means that (I suspect) we'll have less versioned AWS Resources we need to deal with.

@chelma chelma changed the title Add Event Versioning Add State & Event Versioning Jun 27, 2023
@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

I thinking about this a bit more, this is more than just event shapes changing. In fact, I think that's probably the easier part of the problem - if we have automated scans to bring the mirror infrastructure up to date (see: #36), then it's fine if we lose events during a transition because they'll be backfilled in a minute or two. If we have the automated scans we might even say it's fine to lose some events during a transition and not bother versioning the events/event handlers themselves.

In my mind, the bigger issue is the state we're storing in Parameter Store and its link to the CloudFormation stack templates. For that, it seems like we'll probably use some combination of versioning in our code and "transitional commits" that users can "pass through" by running a (hopefully) idempotent update of their existing resources (add-vpc, create-cluster) to stage the stuff for the next version without making breaking changes.

Updated the issue description to encapsulate the larger problem.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Actually, it seems like if we're willing to do at least one "transitional commit" at some point in the future, we can solve this problem when it actually becomes a problem, rather than tackling it preemptively. I'm not currently against a preemptive approach, just pointing out an additional option.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Thinking ahead a bit - I think this task (#65) to encapsulate our CDK context in compound objects is effectively a pre-req for this, as that encapsulation will allow us to more easily version individual bundles of state/context.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Thinking ahead even more - we know we have a scaling bottleneck with how we store our state in AWS Systems Manager Parameter Store. If we move to a more "serious" storage solution, how does that affect our approach to state versioning?

The free tier (standard) of Parameter Store which we're currently using can store 10k items in a given region; after that, you need to upgrade to the paid advanced tier, which can store 100k items [1]. Currently, we store ~10 items per cluster plus ~1 item per ENI, which means we could capture traffic for ~10k instances in a given region on the standard tier. Flipping the advanced bit gives us ~100k instances. However, for large numbers of items, the advanced tier can get quite pricey - $0.05/item-month in us-east-2 means $5k/month if you max it out [2].

Given our use-case of just having these items sit around most of the time without being used, and the small amount of data involved, there's much cheaper options. If we want to keep the same data format of loosely-structured JSON, then we could do something like AWS DocumentDB for an order of magnitude less, but we'd be paying for instances just sitting around most of the time [3].

Another option would be DynamoDB and just using it as a simple K/V store (e.g. dumping our JSON as string into a single column). It seems extremely unlikely we'd ever exceed the 400 kB size limit for a single item [4] and we're already using a K/V store for our state so the transition seems easy. While complex, it *seems* the on-demand pricing [5] will gives us the flexibility we need for our use-case (occasional burst of large numbers of writes/reads, nothing most of the time, relatively small amount of data overall).

[1] https://docs.aws.amazon.com/general/latest/gr/ssm.html
[2] https://aws.amazon.com/systems-manager/pricing/
[3] https://aws.amazon.com/documentdb/pricing/
[4] https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
[5] https://aws.amazon.com/dynamodb/pricing/on-demand/

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Actually, DDB seems like a clear winner here. We can keep the JSON format, keep the K/V paradigm, and the pricing seems VERY reasonable for our use-case [1]. In us-east-2:

  • $1.25 per million write request units of 1 kB
  • $0.25 per million read request units of 4 kB
  • The first 25 GB stored per month is free (we'll never exceed that)
  • $0.20 per GB-month for continuous backup

Most of our cost would come from serving the continuous scans of the User VPCs for changes in infrastructure, but there's ways to optimize that.

[1] https://aws.amazon.com/dynamodb/pricing/on-demand/

@awick
Copy link
Contributor

awick commented Jun 27, 2023

Would it make sense to separate the configuration vs state items we have in parameter store, and maybe either leave the configuration items in there or look at something like AWS App Config?

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Would it make sense to separate the configuration vs state items we have in parameter store, and maybe either leave the configuration items in there or look at something like AWS App Config?

Good question. When I talk about state, I'm referring to bits of data that are required for orchestration/enabling the parts of the solution to communicate with each other across time. Some parts of that state are also what I'd consider configuration, which I guess would be things used specifically at runtime of the capture/viewer processes, etc. An example would be the DNS Name of the OpenSearch Domain. We need to store it in a location that the orchestration bits of the code can access it during different control plane operations but eventually it's turned into a bit of configuration embedded in the Capture/Viewer Docker containers to enable them to do their thing.

We'll probably want to "master" all our state in a real storage solution like DDB, and then create projections of it consumed by something like AWS App Config or stuck in config files placed in S3.

I'm not aware of an argument to retain Parameter Store as a part of our solution other than it requires some work to move off of. If we're going to move off of it, and I think we definitely will want to do so, then it seems better to move sooner than later in order to create less user-pain in a migration.

@awick
Copy link
Contributor

awick commented Jun 27, 2023

Good question. When I talk about state, I'm referring to bits of data that are required for orchestration/enabling the parts of the solution to communicate with each other across time. Some parts of that state are also what I'd consider configuration, which I guess would be things used specifically at runtime of the capture/viewer processes, etc. An example would be the DNS Name of the OpenSearch Domain. We need to store it in a location that the orchestration bits of the code can access it during different control plane operations but eventually it's turned into a bit of configuration embedded in the Capture/Viewer Docker containers to enable them to do their thing.

So I think we have similar definitions then. Configuration = stuff required by capture/viewer/OS setup/etc that the user can change or needs to directly influence, state = everything else

We'll probably want to "master" all our state in a real storage solution like DDB, and then create projections of it consumed by something like AWS App Config or stuck in config files placed in S3.

Ah App Config can't be the source of truth?

I'm not aware of an argument to retain Parameter Store as a part of our solution other than it requires some work to move off of. If we're going to move off of it, and I think we definitely will want to do so beyond just the scaling issues, then it seems better to move sooner than later in order to create less user-pain in a migration.

agree

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Ah App Config can't be the source of truth?

I think there's a difference between "can" and "should" in this instance. I'd say, that all state should be mastered in a real storage solution. If there is configuration that is NOT state, then it's fine for it to be mastered in AWS AppConfig. An example would be we have items A and B in our state and use them to compute item C which is configuration. It's fine to me if the only place C lives is in AWS AppConfig. In other words - keep a single source of base truth, but projections of it can live elsewhere as needed.

Given the nature of our application, I'm not sure it's possible for some configuration item D to exist that isn't either also state in DDB or derived from some state in DDB. If we find such a case, I'm OK with having a discussion at that point.

@awick
Copy link
Contributor

awick commented Jun 27, 2023

Given the nature of our application, I'm not sure it's possible for some configuration item D to exist that isn't either also state in DDB or derived from some state in DDB. If we find such a case, I'm OK with having a discussion at that point.

I guess it depends where you wanted to keep things that are Arkime only config, like for example Arkime Rules or the OIDC configuration.

App Config seemed like it already has done the work of publishing changes, but maybe I misunderstand what it does.

I don't think we should have 2 sources of truth for items, or having to keep them in sync. I'm just worried we are reimplementing parts of App Config, but maybe that is easy to do with ddb. My main concern is Arkime configuration, and keeping the viewer/capture processes updated, if easy to do with ddb, having everything there is good.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

I think we're on the same page, just focusing on different parts of the overall problem. I'm not proposing we create our own publication solution just so we can master things solely in DDB.

Maybe a heuristic we can use is: "if something other than the capture/viewer container would ever need to pull the data, then it's state that should live in DDB". For the specific scenario of Arkime configuration, quite a bit of that is already state (such as the OpenSearch Domain, the ARN of the Secret Manager Secret storing its password, etc) that will be in DDB. I'm guessing we'll need to pull the previous OICD configuration during CLI operations.

Do we want the CLI to read from both DDB and AppConfig in order to compute the next iteration of the configuration we store in AppConfig, then write that to AppConfig so it's available for the containers? To my mind, that seems less preferable than just storing everything in DDB, pulling everything from there, computing the new AppConfig version, then writing to AppConfig. The containers will just be pulling from AppConfig either way, but I think it makes the component responsibilities clearer (AppConfig is always downstream of DDB) and I'm not too worried about syncing since the data flow would always be one direction. Another benefit would be that a single process would never need to read from multiple sources places to do it's job. I think it also makes things easier to understand where to look for stuff as an operator/maintainer.

I guess another way to phrase it is, I don't necessarily see data stored in AppConfig being a separate "copy" to be "synchronized" so much as a re-projection from DDB.

@awick
Copy link
Contributor

awick commented Jun 27, 2023

The question I ask myself, if I hit control-c at the wrong time (or something else bad happens) will ddb and appconfig have different values.

I'm fine with everything in ddb, as long as its easy to fetch the values in capture/viewer instances also.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

I think thinks will become clearer once I start looking more closely at AppConfig in the context of solving the runtime/dynamic configuration problem as part of the OICD work.

@chelma
Copy link
Collaborator Author

chelma commented Jun 27, 2023

Two obvious ways handle different versions of the same entity in a data store are:

  • (1) having separate versioned entries
  • (2) having the entry store all versions of the entity as sub-items.

Parameter Store imposes a 4kB limit for standard tier items and an 8 kB limit on advanced tier items. Our current, largest entry is ~700 characters which, at 1-4 bytes per character (depending on encoding), puts us perilously close to the max for standard tier items. This means if we want (2), then we'll want to move to DynamoDB first (it has a 400 kB item-size limit).

The benefit of storing all versions in the same entry is that you get them all without needing to specifically know to look for them. However, I think that's a benefit primarily in the case of Parameter Store. With DDB, you can include the version as sort key so that it's easy to get all versioned copies of the same entity even though they're separate entries. This operation can be efficiently performed with the Query API call [2].

Otherwise, (1) seems like the better option. With separate entries, you don't have to worry about things like two differently-versioned processes trying to write to the same data entry.

[1] https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey
[2] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb/client/query.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Capture Resilience Work to make traffic capture more resilient to changes in load, configuration, and sources
Projects
None yet
Development

No branches or pull requests

2 participants