Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

Closed
mjeffryes opened this issue Nov 23, 2023 · 8 comments
Closed
Assignees
Labels
impact/flaky-test A test that is unreliable impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@mjeffryes
Copy link
Member

mjeffryes commented Nov 23, 2023

What happened?

Pulumi-cdk has been experiencing flaky test runs (pulumi/pulumi-cdk#96) due to the following error:

 aws-native:ec2:SubnetRouteTableAssociation (VPCPublicSubnet1RouteTableAssociation0B0896DC):
      error: reading resource state: reading resource state: operation error CloudControl: GetResource, https response error StatusCode: 400, RequestID: fca75fd6-63d7-4569-8853-49f8558a8242, ResourceNotFoundException: AWS::EC2::SubnetRouteTableAssociation Handler returned status FAILED: No route tables Found with association rtbassoc-03189a84b8caa818f (HandlerErrorCode: NotFound, RequestToken: 81dddeca-f9da-44a5-9b93-9ff127dfb1c4)

This error is encountered here: https://github.com/pulumi/pulumi-aws-native/blob/master/provider/pkg/provider/provider.go#L803 when cloud control fails to find the resource it has just created.

We'll need to work this down to a repro case to send to the CC API maintainers; we can also look into a workaround in aws-native (eg. adding our own retries for this particular case).

Example

https://github.com/pulumi/pulumi-cdk/tree/main/examples/alb is our best repro case at the moment.

Output of pulumi about

N/A

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@mjeffryes mjeffryes added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Nov 23, 2023
@mikhailshilkov mikhailshilkov added impact/reliability Something that feels unreliable or flaky impact/flaky-test A test that is unreliable and removed needs-triage Needs attention from the triage team labels Nov 23, 2023
mjeffryes added a commit to pulumi/pulumi-cdk that referenced this issue Nov 28, 2023
As per, pulumi/pulumi-aws-native#1186 we're
seeing intermittent errors in the creation of
`aws-native:ec2:SubnetRouteTableAssociation` This change enables retries
for the affected tests to reduce the noise until we have an upstream fix

Fixes #96
@t0yv0 t0yv0 self-assigned this Oct 14, 2024
@mjeffryes mjeffryes added this to the 0.112 milestone Oct 30, 2024
@t0yv0
Copy link
Member

t0yv0 commented Nov 5, 2024

I have been looking a this a little bit, it appears notoriously difficult to reproduce locally. I have boiled down the relevant resource-set from the CDK examples to something minimal as follows:

import * as aws from "@pulumi/aws-native";

const vpc = new aws.ec2.Vpc("my-vpc", {
    cidrBlock: "10.0.0.0/16",
    enableDnsHostnames: true,
    enableDnsSupport: true,
    instanceTenancy: "default",
    tags: [
        {
            key: "Name",
            value: "fargatestack/MyVpc"
        }
    ]
});


const mySubnet = new aws.ec2.Subnet("my-subnet", {
    availabilityZone: "us-west-2a",
    cidrBlock: "10.0.0.0/18",
    mapPublicIpOnLaunch: true,
    tags: [
        {
            key: "aws-cdk:subnet-name",
            value: "Public"
        },
        {
            key: "aws-cdk:subnet-type",
            value: "Public"
        },
        {
            key: "Name",
            value: "fargatestack/MyVpc/PublicSubnet1"
        }
    ],
    vpcId: vpc.id,
});

const myRT = new aws.ec2.RouteTable("my-rt", {
    vpcId: vpc.id,
    tags: [
        {
            key: "Name",
            value: "fargatestack/MyVpc/MyRouteTable"
        }
    ]
})

const myRTA = new aws.ec2.SubnetRouteTableAssociation("my-rta", {
    routeTableId: myRT.id,
    subnetId: mySubnet.id,
})

export const routeTableID = myRTA.id;

Unfortunately standing this up and down does not quite reproduce the issue.

@t0yv0
Copy link
Member

t0yv0 commented Nov 5, 2024

Linking some more context in.

Control reaches this code:

return nil, nil, fmt.Errorf("reading resource state: %w", err)

We have been staring at this with @flostadler . The code seems to be written well. What is happening is that the awaiter c.awaiter.WaitForResourceOpCompletion succeeded. Therefore waitErr == nil. We hit this code:

		// Creation succeeded but reading failed - return the read error.
		return nil, nil, fmt.Errorf("reading resource state: %w", err)

The message is

No route tables Found with association rtbassoc-03189a84b8caa818f.

@t0yv0
Copy link
Member

t0yv0 commented Nov 5, 2024

Something that feels suspect is that the CDK workflow is scheduled with

  schedule:
    - cron: '0 7 * * *'

https://github.com/pulumi/pulumi-cdk/blob/main/.github/workflows/main.yml#L89

It seems not entirely impossible that this would race with the aws-account-cleanup lambda scheduled to run every 12hrs:

https://github.com/pulumi/aws-account-cleanup/blob/master/pkg/cleanvpc/cleanvpc.go#L208

Specifically there is no code to cleanup RTAs, but there is code to cleanup RTs.

@t0yv0
Copy link
Member

t0yv0 commented Nov 5, 2024

We could try tagging resources with "ExpiresBy": "2025-01-01" and teach the cleaner to respect this tag and avoid deleting resources until the expiry date is reached. In this case, we could ensure the VPC is tagged, which will bypass premature cleanup by the lambda.

@t0yv0
Copy link
Member

t0yv0 commented Nov 7, 2024

Based on standup: Anton to search in CloudTrail by resource name and try to see what that gives it.

@t0yv0
Copy link
Member

t0yv0 commented Nov 7, 2024

We looked at CloudControl events, and it seems consistent with an eventual consistency issue.

Adding aws-cloudformation/cloudformation-coverage-roadmap#2178

t0yv0 added a commit that referenced this issue Nov 7, 2024
Introduces a retry for NotFound errors from GetResource executed right
after Create.

Based on the logs from our CI runs we suspect eventual consistency in
AWS may cause some resources to fail with NotFound in Get even after
WaitForResourceOpCompletion succeeded.

Relates: #1186
@t0yv0
Copy link
Member

t0yv0 commented Nov 11, 2024

Expecting this to be fixed by #1809 introducing retry but not being able to repro outside of pulumi-cdk CI.

I will close for now and we reopen as needed.

@t0yv0 t0yv0 added the resolution/fixed This issue was fixed label Nov 11, 2024
@t0yv0 t0yv0 closed this as completed Nov 11, 2024
@t0yv0
Copy link
Member

t0yv0 commented Nov 23, 2024

Looks like what finally solved it is moving tests to another region. @corymhall suspected interference with pulumi-eks tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/flaky-test A test that is unreliable impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

3 participants