Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

mjeffryes · 2023-11-23T01:05:50Z

What happened?

Pulumi-cdk has been experiencing flaky test runs (pulumi/pulumi-cdk#96) due to the following error:

 aws-native:ec2:SubnetRouteTableAssociation (VPCPublicSubnet1RouteTableAssociation0B0896DC):
      error: reading resource state: reading resource state: operation error CloudControl: GetResource, https response error StatusCode: 400, RequestID: fca75fd6-63d7-4569-8853-49f8558a8242, ResourceNotFoundException: AWS::EC2::SubnetRouteTableAssociation Handler returned status FAILED: No route tables Found with association rtbassoc-03189a84b8caa818f (HandlerErrorCode: NotFound, RequestToken: 81dddeca-f9da-44a5-9b93-9ff127dfb1c4)

This error is encountered here: https://github.com/pulumi/pulumi-aws-native/blob/master/provider/pkg/provider/provider.go#L803 when cloud control fails to find the resource it has just created.

We'll need to work this down to a repro case to send to the CC API maintainers; we can also look into a workaround in aws-native (eg. adding our own retries for this particular case).

Example

https://github.com/pulumi/pulumi-cdk/tree/main/examples/alb is our best repro case at the moment.

Output of `pulumi about`

N/A

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

As per, pulumi/pulumi-aws-native#1186 we're seeing intermittent errors in the creation of `aws-native:ec2:SubnetRouteTableAssociation` This change enables retries for the affected tests to reduce the noise until we have an upstream fix Fixes #96

t0yv0 · 2024-11-05T21:32:02Z

I have been looking a this a little bit, it appears notoriously difficult to reproduce locally. I have boiled down the relevant resource-set from the CDK examples to something minimal as follows:

import * as aws from "@pulumi/aws-native";

const vpc = new aws.ec2.Vpc("my-vpc", {
    cidrBlock: "10.0.0.0/16",
    enableDnsHostnames: true,
    enableDnsSupport: true,
    instanceTenancy: "default",
    tags: [
        {
            key: "Name",
            value: "fargatestack/MyVpc"
        }
    ]
});


const mySubnet = new aws.ec2.Subnet("my-subnet", {
    availabilityZone: "us-west-2a",
    cidrBlock: "10.0.0.0/18",
    mapPublicIpOnLaunch: true,
    tags: [
        {
            key: "aws-cdk:subnet-name",
            value: "Public"
        },
        {
            key: "aws-cdk:subnet-type",
            value: "Public"
        },
        {
            key: "Name",
            value: "fargatestack/MyVpc/PublicSubnet1"
        }
    ],
    vpcId: vpc.id,
});

const myRT = new aws.ec2.RouteTable("my-rt", {
    vpcId: vpc.id,
    tags: [
        {
            key: "Name",
            value: "fargatestack/MyVpc/MyRouteTable"
        }
    ]
})

const myRTA = new aws.ec2.SubnetRouteTableAssociation("my-rta", {
    routeTableId: myRT.id,
    subnetId: mySubnet.id,
})

export const routeTableID = myRTA.id;

Unfortunately standing this up and down does not quite reproduce the issue.

t0yv0 · 2024-11-05T21:38:18Z

Linking some more context in.

Control reaches this code:

pulumi-aws-native/provider/pkg/client/client.go

Line 92 in c2f11ba

return nil, nil, fmt.Errorf("reading resource state: %w", err)

We have been staring at this with @flostadler . The code seems to be written well. What is happening is that the awaiter c.awaiter.WaitForResourceOpCompletion succeeded. Therefore waitErr == nil. We hit this code:

		// Creation succeeded but reading failed - return the read error.
		return nil, nil, fmt.Errorf("reading resource state: %w", err)

The message is

No route tables Found with association rtbassoc-03189a84b8caa818f.

t0yv0 · 2024-11-05T21:42:40Z

Something that feels suspect is that the CDK workflow is scheduled with

  schedule:
    - cron: '0 7 * * *'

https://github.com/pulumi/pulumi-cdk/blob/main/.github/workflows/main.yml#L89

It seems not entirely impossible that this would race with the aws-account-cleanup lambda scheduled to run every 12hrs:

https://github.com/pulumi/aws-account-cleanup/blob/master/pkg/cleanvpc/cleanvpc.go#L208

Specifically there is no code to cleanup RTAs, but there is code to cleanup RTs.

t0yv0 · 2024-11-05T21:44:59Z

We could try tagging resources with "ExpiresBy": "2025-01-01" and teach the cleaner to respect this tag and avoid deleting resources until the expiry date is reached. In this case, we could ensure the VPC is tagged, which will bypass premature cleanup by the lambda.

t0yv0 · 2024-11-07T14:41:50Z

Based on standup: Anton to search in CloudTrail by resource name and try to see what that gives it.

t0yv0 · 2024-11-07T16:35:25Z

We looked at CloudControl events, and it seems consistent with an eventual consistency issue.

Adding aws-cloudformation/cloudformation-coverage-roadmap#2178

Introduces a retry for NotFound errors from GetResource executed right after Create. Based on the logs from our CI runs we suspect eventual consistency in AWS may cause some resources to fail with NotFound in Get even after WaitForResourceOpCompletion succeeded. Relates: #1186

t0yv0 · 2024-11-11T16:06:48Z

Expecting this to be fixed by #1809 introducing retry but not being able to repro outside of pulumi-cdk CI.

I will close for now and we reopen as needed.

t0yv0 · 2024-11-23T00:15:59Z

Looks like what finally solved it is moving tests to another region. @corymhall suspected interference with pulumi-eks tests.

mjeffryes added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Nov 23, 2023

mikhailshilkov added impact/reliability Something that feels unreliable or flaky impact/flaky-test A test that is unreliable and removed needs-triage Needs attention from the triage team labels Nov 23, 2023

mjeffryes mentioned this issue Nov 27, 2023

Retry flakey integration tests pulumi/pulumi-cdk#97

Merged

corymhall mentioned this issue Sep 17, 2024

AWS::EC2::SubnetRouteTableAssociation creation is flaky #1714

Closed

corymhall mentioned this issue Oct 11, 2024

Workflow failure: main pulumi/pulumi-cdk#171

Closed

t0yv0 self-assigned this Oct 14, 2024

mjeffryes added this to the 0.112 milestone Oct 30, 2024

t0yv0 mentioned this issue Nov 7, 2024

Retry GetResource NotFound when creating #1809

Merged

t0yv0 added the resolution/fixed This issue was fixed label Nov 11, 2024

t0yv0 closed this as completed Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

mjeffryes commented Nov 23, 2023 •

edited

Loading

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 7, 2024

t0yv0 commented Nov 7, 2024

t0yv0 commented Nov 11, 2024

t0yv0 commented Nov 23, 2024

Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

Intermitent errors creating aws-native:ec2:SubnetRouteTableAssociation #1186

Comments

mjeffryes commented Nov 23, 2023 • edited Loading

What happened?

Example

Output of pulumi about

Additional context

Contributing

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 5, 2024

t0yv0 commented Nov 7, 2024

t0yv0 commented Nov 7, 2024

t0yv0 commented Nov 11, 2024

t0yv0 commented Nov 23, 2024

mjeffryes commented Nov 23, 2023 •

edited

Loading

Output of `pulumi about`