Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portmgrd] prevent runtime exception (crash) in setting MTU on portchannel member #3432

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bradh352
Copy link
Contributor

@bradh352 bradh352 commented Dec 19, 2024

What I did
Do not attempt to set the MTU directly on PortChannel members as it will likely fail. The MTU gets inherited as part of the PortChannel.

Why I did it

portmgr goes down which causes swss container to go down

2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)
  • NOTE: MTU 9100 is the default, it is not specified in my PORT config. But if it is specified it still crashes.

How I verified it

Apply patch and verify this config no longer causes crash on Dell S5248F (Broadcom Trident3).

Tested on 202411 and master.

{
    "VLAN": {
        "Vlan2": {
            "mtu": "9100",
            "vlanid": "2"
        }
    },
    "PORT": {
        "Ethernet0": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/1",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "1",
            "lanes": "49",
            "speed": "25000"
        },
        "Ethernet1": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/2",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "2",
            "lanes": "50",
            "speed": "25000"
        },
        "Ethernet2": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/3",
            "autoneg": "off",
            "description": "PortChannel2 loop",
            "fec": "rs",
            "index": "3",
            "lanes": "51",
            "speed": "25000"
        },
        "Ethernet3": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/4",
            "autoneg": "off",
            "description": "PortChannel2 loop",
            "fec": "rs",
            "index": "4",
            "lanes": "52",
            "speed": "25000"
        }
    },
    "PORTCHANNEL": {
        "PortChannel0001": {
            "admin_status": "up",
            "description": "management interface",
            "lacp_key": "auto",
            "min_links": "1",
            "mode": "routed",
            "mtu": "1500"
        },
        "PortChannel0002": {
            "admin_status": "up",
            "description": "redundant access port for management interface loop",
            "lacp_key": "auto",
            "min_links": "1",
            "mode": "access",
            "mtu": "9100"
        }
    },
    "PORTCHANNEL_INTERFACE": {
        "PortChannel0001": {
            "ipv6_use_link_local_only": "enable",
            "mac_addr": "02:d3:ab:fe:fd:c4"
        },
        "PortChannel0001|10.0.0.11/24": {}
    },
    "PORTCHANNEL_MEMBER": {
        "PortChannel0001|Ethernet0": {},
        "PortChannel0001|Ethernet1": {},
        "PortChannel0002|Ethernet2": {},
        "PortChannel0002|Ethernet3": {}
    }
}

Details if related
Signed-off-by: Brad House (@bradh352)

@bradh352 bradh352 requested a review from prsunny as a code owner December 19, 2024 00:37
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 changed the title portmgrd: prevent runtime failure in setting MTU on portchannel member [portmgrd] prevent runtime exception (crash) in setting MTU on portchannel member Dec 19, 2024
@bradh352
Copy link
Contributor Author

@prsunny please review

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@prsunny prsunny requested review from dgsudharsan and prgeor January 6, 2025 18:36
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 7, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
Copy link
Collaborator

@dgsudharsan dgsudharsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UT to cover this scenario.

cfgmgr/portmgr.cpp Outdated Show resolved Hide resolved
@dgsudharsan
Copy link
Collaborator

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

@bradh352 We do have this check at CLI level https://github.com/sonic-net/sonic-utilities/blob/80d469886f120bfe9bc60024f608c039dce06646/config/main.py#L4948

Why do we need such checks at multiple places? @prsunny what are your thoughts on this?

People using things like Ansible, don't use the CLI to set configuration. They modify the /etc/sonic/config_db.json which does nothing to prevent this. ALSO, in this case, as you can see from the /etc/sonic/config_db.json example I provided, no MTU is provided at all in the PORT configuration. Its being autopopulated somewhere as a default. I didn't try to track that down.

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 277e0ae to 201d5b1 Compare January 7, 2025 02:32
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352 bradh352 force-pushed the bradh352/portchannel-crash branch from 84669f9 to d4b4b98 Compare January 7, 2025 02:55
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

Please add UT to cover this scenario.

I committed one, no idea if its right.

@bradh352 bradh352 requested a review from dgsudharsan January 7, 2025 09:45
@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

coverage looks good, any other comments?

@prsunny
Copy link
Collaborator

prsunny commented Jan 7, 2025

Agree with @dgsudharsan , this is checked in the CLI and Yang level. I dont think its really needed in swss.

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

@prsunny

Agree with @dgsudharsan , this is checked in the CLI and Yang level. I dont think its really needed in swss.

Your evaluation is not correct.

First, its not occurring at the yang level as per:

https://github.com/sonic-net/sonic-buildimage/blob/d28b734ce545762cd7b49d110575eec8549458cf/src/sonic-yang-models/yang-models/sonic-port.yang#L162-L166

There are no restrictions on MTU being set for a port.

But that's besides the point. My configuration that triggered this behavior, as can be seen in the PR summary, did not have an MTU set at all on the port:

{
    "PORT": {
        "Ethernet0": {
            "admin_status": "up",
            "alias": "twentyfiveGigE1/1/1",
            "autoneg": "off",
            "description": "PortChannel1 mgmt",
            "fec": "rs",
            "index": "1",
            "lanes": "49",
            "speed": "25000"
        },
       ...
}

That means its getting inherited somewhere (I didn't look where), triggering it to call /sbin/ip link set dev "Ethernet0" mtu "9100"

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

Ok, found where its being set by default to 9100, a hardcoded code path:

/* If this is the first time we set port settings
* assign default admin status and mtu
*/
if (!configured)
{
admin_status = DEFAULT_ADMIN_STATUS_STR;
mtu = DEFAULT_MTU_STR;
m_portList.insert(alias);
}

Which is right above:

if (!mtu.empty())
{
setPortMtu(alias, mtu);
SWSS_LOG_NOTICE("Configure %s MTU to %s", alias.c_str(), mtu.c_str());
}

Which is where it crashes due to the exception caused by setPortMtu().

The macro for the default value is here:

#define DEFAULT_MTU_STR "9100"

There's no way that any CLI or Yang verification can prevent a hardcoded default in the code path.

@prsunny
Copy link
Collaborator

prsunny commented Jan 8, 2025

Ok, found where its being set by default to 9100, a hardcoded code path:

/* If this is the first time we set port settings
* assign default admin status and mtu
*/
if (!configured)
{
admin_status = DEFAULT_ADMIN_STATUS_STR;
mtu = DEFAULT_MTU_STR;
m_portList.insert(alias);
}

Which is right above:

if (!mtu.empty())
{
setPortMtu(alias, mtu);
SWSS_LOG_NOTICE("Configure %s MTU to %s", alias.c_str(), mtu.c_str());
}

Which is where it crashes due to the exception caused by setPortMtu().

The macro for the default value is here:

#define DEFAULT_MTU_STR "9100"

There's no way that any CLI or Yang verification can prevent a hardcoded default in the code path.

Not sure I understand. This code path is legacy and we don't see an issue so far. So what exactly is triggering this?

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 8, 2025

Ok, found where its being set by default to 9100, a hardcoded code path:

/* If this is the first time we set port settings
* assign default admin status and mtu
*/
if (!configured)
{
admin_status = DEFAULT_ADMIN_STATUS_STR;
mtu = DEFAULT_MTU_STR;
m_portList.insert(alias);
}

Which is right above:

if (!mtu.empty())
{
setPortMtu(alias, mtu);
SWSS_LOG_NOTICE("Configure %s MTU to %s", alias.c_str(), mtu.c_str());
}

Which is where it crashes due to the exception caused by setPortMtu().
The macro for the default value is here:

#define DEFAULT_MTU_STR "9100"

There's no way that any CLI or Yang verification can prevent a hardcoded default in the code path.

Not sure I understand. This code path is legacy and we don't see an issue so far. So what exactly is triggering this?

Well, I can 100% confirm without this patch, trying to bring up a portchannel interface with the config in the summary on a Dell S5248F (Broadcom Trident 3) fails due to the aforementioned error of:

2024 Dec 17 19:26:20.964259 sw1 INFO swss#supervisord: portmgrd RTNETLINK answers: Operation not permitted
2024 Dec 17 19:26:20.965353 sw1 ERR swss#portmgrd: :- main: Runtime error: /sbin/ip link set dev "Ethernet0" mtu "9100" : 
2024 Dec 17 19:26:20.967020 sw1 INFO swss#supervisord 2024-12-17 19:26:20,966 WARN exited: portmgrd (exit status 255; not expected)

Once portmgrd exits, its all over, the switch is down on both 202411 and master. This is observed on a fresh boot with the configuration, I did observe that if adding the ports to a portchannel with a switch already running the error does not occur, presumable some order of operations aspect of bringing up a portchannel at boot vs configuring it at runtime.

I haven't tried portchannel on any other releases. Since you aren't aware of any issue, I'm assuming this is a recent issue with the 'ip link' command returning failure these days where as previously maybe it did not return a failure and simply was a no-op ... or maybe there was a change the order in the way portchannels vs ports are brought up during boot. I can't say.

Either way this PR corrects the issue.

github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 9, 2025
…annel member (PR sonic-net#3432)

Do not attempt to set the MTU directly on PortChannel members as it will
likely fail.  The MTU gets inherited as part of the PortChannel.

Signed-off-by: Brad House (@bradh352)
@bradh352
Copy link
Contributor Author

bradh352 commented Jan 9, 2025

@prsunny is my explanation sufficient to get this merged? or do you want any changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants