Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(send_queue): Persist failed to send errors #4137

Merged
merged 10 commits into from
Oct 23, 2024

Conversation

BillCarsonFr
Copy link
Member

Modify the SendQueue in order to persist the error that cause the event to fail to send as a QueueWedgeError. The QueueWedgeError is not a 1:1 mapping for all kinds of errors, but holds variant and information that the client can react to in order to propose "quick fixes"/solution before retrying to send.

Fixes #3973
Also fixes element-hq/element-x-ios#3287 because when a timeline reset occurs the fail to send reason is also lost.

This PR starts with a refactoring commit e769600 to introduce the new QueueWedgedError and move the logic that was in the ffi layer to convert api errors to SendState error variant. This QueueWedgedError can be directly use in the SendingFailed variant and expose to ffi.

Second commit 109c133 adds the persistence, QueuedEvent now have an optional error field instead of a is_weged boolean. Same for LocalEchoContent::Event.
Adds also Migration for sqlite and indexeddb

  • Public API changes documented in changelogs (optional)

Signed-off-by:

@BillCarsonFr BillCarsonFr force-pushed the valere/send_queue_persist_wedge_reason branch 4 times, most recently from 390f170 to 4e43dc2 Compare October 16, 2024 13:01
Changelog:  We now persist the error that caused an event to fail to send. The error `QueueWedgeError` contains info that client can use to try to resolve the problem when the error is not automatically retriable.

Some breaking changes in the ffi layer for `timeline::EventSendState`, `SendingFailed` now directly contains the wedge reason enum. Use it in place of the removed variant of EventSendState
@BillCarsonFr BillCarsonFr force-pushed the valere/send_queue_persist_wedge_reason branch from 4e43dc2 to a5e5c02 Compare October 16, 2024 13:13
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 84.56%. Comparing base (2eca727) to head (5b911da).
Report is 74 commits behind head on main.

Files with missing lines Patch % Lines
crates/matrix-sdk/src/send_queue.rs 63.15% 7 Missing ⚠️
crates/matrix-sdk-sqlite/src/state_store.rs 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4137      +/-   ##
==========================================
- Coverage   84.68%   84.56%   -0.12%     
==========================================
  Files         269      269              
  Lines       28818    28845      +27     
==========================================
- Hits        24405    24394      -11     
- Misses       4413     4451      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BillCarsonFr BillCarsonFr marked this pull request as ready for review October 16, 2024 13:42
@BillCarsonFr BillCarsonFr requested a review from a team as a code owner October 16, 2024 13:42
@BillCarsonFr BillCarsonFr requested review from jmartinesp and removed request for a team October 16, 2024 13:42
Copy link
Contributor

@jmartinesp jmartinesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, but some clarification about the serialization/deserialization of the BLOB in the DB when there are new/removed enum cases would be good.

Also, since this is related to the send queue, maybe a 2nd pair of eyes from @bnjbvr wouldn't hurt either.

crates/matrix-sdk-ui/src/timeline/controller/mod.rs Outdated Show resolved Hide resolved
Comment on lines 1157 to 1160
/// If the event couldn't be sent because of an API error, it's marked as
/// wedged, and won't ever be peeked for sending. The only option is to
/// remove it.
pub is_wedged: bool,
pub error: Option<QueueWedgeError>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this doc comment, since an identical one was added below to is_wedged()?

-- New send queue events, now persists the type of error causing it to be wedged
ALTER TABLE "send_queue_events"
-- Used as a value, thus encrypted/decrypted
ADD COLUMN "wedge_reason" BLOB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this works in Rust, would this still work if QueueWedgeError adds/removes some case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the value is serialized (I suspect to JSON bytes) before being put into the table, and it has to be a blob because of the encryption case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no need to annotate this to mark it nullable? and that the default value is null when not set? Maybe it's the default with sqlite, and I don't know, but might be worth indicating in the comment then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. And yes the default in NULLABLE and NULL.

@bnjbvr bnjbvr self-requested a review October 16, 2024 14:21
Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job overall! I have issues with the first commit which loses us a bit of static typing and checkability (because of the use of stringly typed values and errors), so I'd like to take another look, otherwise the code looks great and most of my comments should be minor. Thanks!

crates/matrix-sdk-ui/src/timeline/controller/mod.rs Outdated Show resolved Hide resolved
MissingLocalEchoFailError,
))),
error: QueueWedgeError::GenericApiError {
msg: MISSING_LOCAL_ECHO_FAIL_ERROR.into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for &str to String conversions, we prefer .to_owned().

/// the current encryption setting prohibit sharing with them.
InsecureDevices {
/// The insecure devices as a Map of userID to deviceID.
user_device_map: HashMap<String, Vec<String>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the FFI layer, we're not using stringly-typing, so this would become a HashMap<OwnedUserId, Vec<OwnedDeviceId>>, and similarly below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 3e65753

/// can be taken before retry sending.
#[derive(Clone, Debug, Serialize, Deserialize)]
#[cfg_attr(feature = "uniffi", derive(uniffi::Enum))]
pub enum QueueWedgeError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is related to the send queue, can we put it in the send queue file? Or is there a reason I'm missing to put it into deserialized_responses?

Also, should this use our common pattern of deriving thiserror::Error, to make it clear it's an error type, and would avoid the manual display impl? (uniffi also has an uniffi::Error derive which could be convenient here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send Queue is in matrix-sdk and this error is persisted in the state storage matrix-sdk-base.
So it cannot be in the send_queue.rs file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can we put those in matrix-sdk-base next to the QueueEvent type, then? (or in a new file send_queue.rs somewhere in that crate)

/// the current encryption setting prohibit sharing when it happens.
IdentityViolations {
/// The users that are expected to be verified but are not.
users: Vec<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec<OwnedUserId>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here #4137 (comment)

-- New send queue events, now persists the type of error causing it to be wedged
ALTER TABLE "send_queue_events"
-- Used as a value, thus encrypted/decrypted
ADD COLUMN "wedge_reason" BLOB;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no need to annotate this to mark it nullable? and that the default value is null when not set? Maybe it's the default with sqlite, and I don't know, but might be worth indicating in the comment then?

crates/matrix-sdk-sqlite/src/state_store.rs Show resolved Hide resolved
crates/matrix-sdk-sqlite/src/state_store.rs Outdated Show resolved Hide resolved
crates/matrix-sdk/src/send_queue.rs Outdated Show resolved Hide resolved
@@ -1171,8 +1175,8 @@ pub enum LocalEchoContent {
/// A handle to manipulate the sending of the associated event.
send_handle: SendHandle,
/// Whether trying to send this local echo failed in the past with an
/// unrecoverable error (see [`SendQueueRoomError::is_recoverable`]).
is_wedged: bool,
/// error (see [`SendQueueRoomError::is_recoverable`]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep unrecoverable here. It has a precise meaning in the context of this file, i.e. retrying would lead to the same error again and again (vs network errors that may be transient).

@BillCarsonFr
Copy link
Member Author

BillCarsonFr commented Oct 17, 2024

Good job overall! I have issues with the first commit which loses us a bit of static typing and checkability (because of the use of stringly typed values and errors), so I'd like to take another look, otherwise the code looks great and most of my comments should be minor. Thanks!

Yeah, I started by doing this refactoring because persisting all the different kind of errors required to make a lot of things serializable, and at the end this is not used at all by the application layer (that was doing a mapping of the error to usable SendingFailed types).
What the app was using was more the thing that we are really interesting into persisting.
It's not really the error that the app layer really need, I think of it as a suggestion a compiler could give you when compilation fails. Lik "Re-verify this user or whitelist the problematic device" then retry.

I could have done without the stringly typed values and just put a GenericUnrecoverableError, the only suggestion is to delete the message. The reason why doesn't matter much as it cannot be presented to the user, it will be an english only technical string.

@bnjbvr
Copy link
Member

bnjbvr commented Oct 17, 2024

Good point about needing to ser/de all the error types. In that case, can we keep the String for the generic unrecoverable error case, and use proper types for the other error variants?

@BillCarsonFr
Copy link
Member Author

Good point about needing to ser/de all the error types. In that case, can we keep the String for the generic unrecoverable error case, and use proper types for the other error variants?

Hello @bnjbvr, What do you think about the following changes:
So we are looking at all the errors that could happen, and we are doing 2 things:

  • Signalling the errors via RoomSendQueueUpdate and SendQueueRoomError and telling if is recoverable or not.
  • Persisting the errors that are unrecoverable in the queue.

So my idea is that we keep the global sdk Error type for signalling. But for unrecoverable errors that we want to persist we use the new QueueWedgeError (that will stringify some types not really useful).

See this commit for this change: e824fd2

As requested I also put back the ruma OwnedUserId/DeviveId inside the QueueWedgeError variant instead of raw string. I did that initially so that I can directly create ffi bindings. But I see the point, and now I separated 2 versions of the error with a string one for bindings.

3e65753

@bnjbvr bnjbvr self-requested a review October 21, 2024 08:11
@bnjbvr
Copy link
Member

bnjbvr commented Oct 21, 2024

What do you think about the following changes:

Sounds great, thanks, I'll take another look!

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, this is on the right track. I might have misunderstood the intent to store the most generic error type when signalling the error to callers, and it seems we can achieve this; can you try it out please? :)

/// can be taken before retry sending.
#[derive(Clone, Debug, Serialize, Deserialize)]
#[cfg_attr(feature = "uniffi", derive(uniffi::Enum))]
pub enum QueueWedgeError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can we put those in matrix-sdk-base next to the QueueEvent type, then? (or in a new file send_queue.rs somewhere in that crate)

crates/matrix-sdk-common/src/deserialized_responses.rs Outdated Show resolved Hide resolved
crates/matrix-sdk-common/src/deserialized_responses.rs Outdated Show resolved Hide resolved
pub enum QueueWedgeError {
/// This error occurs when there are some insecure devices in the room, and
/// the current encryption setting prohibit sharing with them.
#[error("There are insecure devices in the room")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this error message, you won't see the user_device_map in the display impl. Saying here in case it's not intended; to display them, you'd need to add something like : {user_device_map:?} to the string above, and it will include the debug display of the btreemap.

crates/matrix-sdk-common/src/deserialized_responses.rs Outdated Show resolved Hide resolved
crates/matrix-sdk-common/src/deserialized_responses.rs Outdated Show resolved Hide resolved
@@ -577,6 +589,28 @@ impl RoomSendQueue {
}
}

impl From<&crate::Error> for QueueWedgeError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: using a From impl makes it part of the public API of QueueWedgeError, which we probably don't want to do, since it's an internal conversion. Can you use a plain function instead, please?

Also, take the crate::Error by ownership (i.e. no ref &), since this could avoid some cloning if I'm reading the code correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, take the crate::Error by ownership (i.e. no ref &), since this could avoid some cloning if I'm reading the code correctly.

Retracting this part: the error is used by ownership thereafter (by Arcing it). I'd still like to not use From to avoid adding to the public API of QueueWedgeError please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the timeline controller to update the in memory local event EventSendState. Why is it a issue that it is a public API?

crates/matrix-sdk/src/send_queue.rs Show resolved Hide resolved
Comment on lines 281 to 287
let msg = assert_matches!(
error,
QueueWedgeError::GenericApiError { msg } => msg
);
assert_eq!(
msg,
"the server returned an error: [413 / M_TOO_LARGE] Sounds like you have a lot to say!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(with respect to the previous comment, this would mean undoing this, since we could still pattern-match the right client API error here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this one see #4137 (comment)
EventSendState is using QueueWedgeError and not the full crate::Error. Notice that for this particular error we propbably will want to add a proper variant for QueueWedgeError to have a proper media local echo feedback

crates/matrix-sdk-ui/src/timeline/tests/echo.rs Outdated Show resolved Hide resolved
@BillCarsonFr BillCarsonFr requested a review from bnjbvr October 22, 2024 06:53
@bnjbvr
Copy link
Member

bnjbvr commented Oct 22, 2024

So as to help landing this PR, I've pushed a small commit that keeps the whole matrix_sdk::Error for the timeline, so that we don't lose any information that we shouldn't have to lose. At the FFI layer it's always converted into a QueueWedgeError, which makes it easy to pattern-match on the other side of the fence, so it's functionally equivalent. With this patch, I don't have more objections to landing this, but before giving my final approval, I'd like to make sure there's no changes required on our side to address the complement crypto bindings failure.

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Approving assuming no dramatic changes for complement crypto)

Thanks for the PR, notably good job getting through all the layers!

@bnjbvr bnjbvr merged commit 31e9600 into main Oct 23, 2024
40 checks passed
@bnjbvr bnjbvr deleted the valere/send_queue_persist_wedge_reason branch October 23, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Messages bluntly refuse to send in a room with an unverified device present. send_queue: persist send errors
3 participants