Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchagent exiting due to unsupported SAI_*_ATTR_SELECTIVE_COUNTER_LIST attr #20725

Closed
bofish-arista opened this issue Nov 7, 2024 · 20 comments · Fixed by #21068
Closed

Orchagent exiting due to unsupported SAI_*_ATTR_SELECTIVE_COUNTER_LIST attr #20725

bofish-arista opened this issue Nov 7, 2024 · 20 comments · Fixed by #21068
Assignees
Labels
MSFT Triaged this issue has been triaged

Comments

@bofish-arista
Copy link
Contributor

Description

Sonic-buildimage PR#20540 has incorporated SAI changes which includes latest sairedis, this in turn features changes to SAI, including SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST, which does not appear to be supported yet by Broadcom.

As a result of this change, orchagent is exiting early in startup process.

Relevant links:
#20540
sonic-net/sonic-sairedis#1431
opencomputeproject/SAI#1941

Steps to reproduce the issue:

Issue is seen during inialization of a standalone device.

Describe the results you received:

Orchagent exited with runtime error logged as shown below:

2024 Nov 4 17:44:07.612509 up500 ERR syncd#syncd: :- run: Runtime error: :- discover: when query SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST (on SAI_OBJECT_TYPE_PORT RID oid:0x100000001) got value oid:0x7ffc4a7b45f0 objectTypeQuery returned NULL object type

Describe the results you expected:

Output of show version:

root@up322:~# show version

SONiC Software Version: SONiC.branch.master-ars.7cd2518e-buildimage.origin.master-nightly-slim-2024.10.31.20.12
SONiC OS Version: 12
Distribution: Debian 12.7
Kernel: 6.1.0-22-2-amd64
Build commit: 7f44814d7
Build date: Fri Nov 1 00:59:18 UTC 2024
Built by: jenkins@jenkins-arsonic-k8s-1-vfqtx

Platform: x86_64-arista_7060_cx32s
HwSKU: Arista-7060CX-32S-C32
ASIC: broadcom
ASIC Count: 1
Serial Number: SGD20254417
Model Number: DCS-7060CX-32S
Hardware Revision: 03.00
Uptime: 06:20:23 up 9 min, 2 users, load average: 0.26, 0.98, 0.75
Date: Sat 02 Nov 2024 06:20:23

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@kcudnik
Copy link
Contributor

kcudnik commented Nov 8, 2024

Hi:

f23185d5 (Rajkumar-Marvell         2024-10-07 11:55:21 +0530 2627)     SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST,

that attribute was recently added in 2024/10/7

SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST (on SAI_OBJECT_TYPE_PORT RID oid:0x20100000000) got value oid:0x559ac8beda80 objectTypeQuery returned NULL object type

this attribute is LIST of counters, and it returned some OID value 0x559ac8beda80 on which OT returned NULL, and it should be SAI_OBJECT_TYPE_COUNTER as specified in SAI headers saiport.h, if it's returning NULL, that's a vendor bug

and currently syncd on such error crashes, since we got OID but we don't know what type it is, this is invalid

so what should happened, when we have newer headers in SAI sairedis/syncd and we are using older vendor SAI, all unsupported attributes should return not implemented or not supported, instead of success, but it seams that vendor is returning some invalid value and success on this attribute

error comes from here: https://github.com/sonic-net/sonic-sairedis/blob/master/syncd/SaiDiscovery.cpp#L237

error could be also caused by DASH extensions (if vendor support DASH) since extensions range changed, and that change is not backward compatible: opencomputeproject/SAI#2028

so what i expect is happening, vendor have some custom/private attribute after SAI_PORT_ATTR_END, on older version of SAI headers which have the same enum value as SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST, which causes syncd think that SAI_PORT_ATTR_SELECTIVE_COUNTER_LIST is implemented when actually this is private internal attribute

@bofish-arista can you confirm ?

@tjchadaga
Copy link
Contributor

@JaiOCP - could you please confirm the analysis above?

@tjchadaga
Copy link
Contributor

@rajkumar38 - fyi

@bofish-arista
Copy link
Contributor Author

Hi @kcudnik - your assessment is correct. There are some extensions added by the vendor which are causing the conflict. The vendor implementation currently does not allow SAI to grow without introducting conflicts. We have a ticket opened with them to get this addressed.

@kcudnik
Copy link
Contributor

kcudnik commented Nov 8, 2024

Nice, I have proposal to update syncd and SAI metadata to add version on each attribute so discovery process will not exceed SAI_PORT_ATTR_END and any attr end if vendor is lover version than supported syncd headers, this will solve this issue for the future, I will work on this to address this issue

@lguohan for visibility

@kcudnik
Copy link
Contributor

kcudnik commented Nov 8, 2024

Hi @kcudnik - your assessment is correct. There are some extensions added by the vendor which are causing the conflict. The vendor implementation currently does not allow SAI to grow without introducting conflicts. We have a ticket opened with them to get this addressed.

We introduced custom attribute range for each attribute for vendors to prevent this behavior https://github.com/opencomputeproject/SAI/blob/master/inc/saiport.h#L2635 SAI_PORT_ATTR_CUSTOM_RANGE_START

@kcudnik
Copy link
Contributor

kcudnik commented Nov 19, 2024

I created partial one step to partial solution opencomputeproject/SAI#2100, with this we will have attributes and SAI version per attribute, you can read explanation here: opencomputeproject/SAI#2099

i will work on this to support that on syncd with a cmd enable/flag

@prgeor
Copy link
Contributor

prgeor commented Nov 20, 2024

@ying can you check with @kcudnik how to unblock master with SAI 11.x as 12.x may introduce regression?

@prgeor prgeor added Triaged this issue has been triaged MSFT labels Nov 20, 2024
kcudnik added a commit to kcudnik/sonic-sairedis that referenced this issue Nov 21, 2024
By default disabled. If enabled, it will use SAI metadata and libsai
version returned by sai_api_version_query to verify if given attribute
during SAI discovery process was introduced in the current libsai
version or below that version.

This feature is partial solution to this problem:
sonic-net/sonic-buildimage#20725,
more here: opencomputeproject/SAI#2099
kcudnik added a commit to sonic-net/sonic-sairedis that referenced this issue Nov 22, 2024
By default disabled. If enabled, it will use SAI metadata and libsai
version returned by sai_api_version_query to verify if given attribute
during SAI discovery process was introduced in the current libsai
version or below that version.

This feature is partial solution to this problem:
sonic-net/sonic-buildimage#20725,
more here: opencomputeproject/SAI#2099
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 27, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 28, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 28, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 28, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 28, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 28, 2024
@bradh352
Copy link
Contributor

Anyone tracking this issue, a simple workaround until a proper fix is created is here: bradh352/sonic-sairedis@9f2a0b9

@kcudnik
Copy link
Contributor

kcudnik commented Nov 28, 2024

on sairedis master this is already fixed partially: sonic-net/sonic-sairedis#1470, this is checked in, but needs to be enabled by "-a" parameter in syncd init script

@kcudnik
Copy link
Contributor

kcudnik commented Nov 28, 2024

also entire problem is not that simple, since custom attributes on brcm will be moved from v1.13 and v1.14 to different range

@bradh352
Copy link
Contributor

on sairedis master this is already fixed partially: sonic-net/sonic-sairedis#1470, this is checked in, but needs to be enabled by "-a" parameter in syncd init script

I saw that but I didn't see the commit to SAI that actually tagged each attribute with a version. That said, now that I'm looking harder at it, I see opencomputeproject/SAI#2100 which is actually auto-generating a version header based on git tags.

Out of curiosity why make it opt-in? Are there expected incompatibilities?

@kcudnik
Copy link
Contributor

kcudnik commented Nov 28, 2024

for backward compatibility

bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 29, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 29, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 30, 2024
bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Nov 30, 2024
@bradh352
Copy link
Contributor

bradh352 commented Dec 1, 2024

I can confirm that adding the "-a" does indeed work on my TD3 switches.

bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Dec 1, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2024

I can confirm that adding the "-a" does indeed work on my TD3 switches.

nice !

@kcudnik
Copy link
Contributor

kcudnik commented Dec 1, 2024

can we enable tahat ?

@bradh352
Copy link
Contributor

bradh352 commented Dec 1, 2024

I think it would make sense for it to be enabled by default, that said I don't fully understand the backwards compatibility implications, perhaps there may be specific issues with various closed vendor blobs. In my own branch (which I only use on Broadcom Trident 3) I always enable -a: bradh352/sonic-sairedis@4fe35ae9

However, I don't see why it couldn't be moved into each ASIC-specific section that is validated to be ok:
https://github.com/bradh352/sonic-sairedis/blob/4fe35ae9ecb96f73568e3df318fabfd7bb2c124f/syncd/scripts/syncd_init_common.sh#L476

@kcudnik
Copy link
Contributor

kcudnik commented Dec 2, 2024

can you make PR for sonic-sairedis to enable this feature by default ? i think there are no implications for not enabling that in production, from other hand, if there are new attributes that are not in official release then this needs to be disabled, but currently this feature is only implemented in discovery, user can still make invalid create/set/get of attrbitue that is vendor extention after ATTR_END, but in normal operation this will not happen, since release versions are tested before production

but maybe it would make sense to also implement check on those operations too, in case some old libsai would be used in some weird case

@bofish-arista
Copy link
Contributor Author

I've tried enabling -a flag in syncd init script and it is working for me.

bradh352 added a commit to bradh352/sonic-sairedis that referenced this issue Dec 2, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
@bradh352
Copy link
Contributor

bradh352 commented Dec 2, 2024

can you make PR for sonic-sairedis to enable this feature by default ? i think there are no implications for not enabling that in production, from other hand, if there are new attributes that are not in official release then this needs to be disabled, but currently this feature is only implemented in discovery, user can still make invalid create/set/get of attrbitue that is vendor extention after ATTR_END, but in normal operation this will not happen, since release versions are tested before production

but maybe it would make sense to also implement check on those operations too, in case some old libsai would be used in some weird case

done: sonic-net/sonic-sairedis#1477

github-actions bot pushed a commit to bradh352/sonic-sairedis that referenced this issue Dec 4, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
@yxieca yxieca closed this as completed in 8feb45f Dec 6, 2024
github-actions bot pushed a commit to bradh352/sonic-sairedis that referenced this issue Dec 7, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
github-actions bot pushed a commit to bradh352/sonic-sairedis that referenced this issue Dec 11, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
shiraez pushed a commit to Marvell-switching/sonic-sairedis that referenced this issue Dec 12, 2024
By default disabled. If enabled, it will use SAI metadata and libsai
version returned by sai_api_version_query to verify if given attribute
during SAI discovery process was introduced in the current libsai
version or below that version.

This feature is partial solution to this problem:
sonic-net/sonic-buildimage#20725,
more here: opencomputeproject/SAI#2099
shiraez pushed a commit to Marvell-switching/sonic-sairedis that referenced this issue Dec 12, 2024
By default disabled. If enabled, it will use SAI metadata and libsai
version returned by sai_api_version_query to verify if given attribute
during SAI discovery process was introduced in the current libsai
version or below that version.

This feature is partial solution to this problem:
sonic-net/sonic-buildimage#20725,
more here: opencomputeproject/SAI#2099
github-actions bot pushed a commit to bradh352/sonic-sairedis that referenced this issue Dec 17, 2024
As per sonic-net/sonic-buildimage#20725
attribute validation must be enabled otherwise failures will
be observed.

Not sure why this isn't default.
VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this issue Jan 21, 2025
Why I did it
Upgrade SAI xgs version to 11.2.16.2 to fix sonic-net#20725

Work item tracking
Microsoft ADO (number only): 30445459

How I did it
Update xgs SAI version to 11.2.16.2 in the sai.mk file.

How to verify it
load the built image to the DUT, swss starts fine.

Signed-off-by: zitingguo-ms <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants