-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster split implementation expects cached data to be fully replicated. #61
Comments
|
If we do choose to change things around, then we might find help in using a different Hazelcast distributed data structure than the Map that we're now using everywhere. Maybe some caches whould instead be using a ReplicatedMap? As opposed to Maps, ReplicatedMaps will make sure that each cluster node has the data. There are some obvious trade-offs of course (see the describion in the link). We might want to consider using a ReplicatedMap instead of a Map to replace the |
I wonder if the unexpected behavior that I described in the issue description is a result of there being duplicated code in If that's the case, then this issue should probably be moved to Openfire. |
Yes, it does seem that with a three node cluster, not every node will no about all data, which means if it drops out of thee cluster problems will arise. A ReplicatedMap does look like the right structure to use (I assume (hah!) it should be a simply swap out). Re. your comment about duplicated code in |
@GregDThomas any thoughts on using |
I'm generally of the approach "If it ain't broke, don't fix it". If the |
Not using EntryEventListener probably reduces the amount of code that we have to maintain ourselves (the implementation in the Hazelcast plugin is quite hard to understand), but it does lend itself to maintain only that state that we're interested in: the Hazelcast plugin only keeps track of JIDs. The caches often contain more data, which would require additional resources to keep things in sync on all nodes. Then again, Hazelcast probably does a more optimized job of that than what we can do ourselves? A concern that I have is that timing might be different. I think that with
That said:
|
I think we need to recognise that if a cluster leaves a node (split brain or not) there is a risk some data will be lost. There is always a delay between (say) a user logging in, the local node updating it's local data, and that data being replicated across the other nodes. The amount of data lost depends on exactly when the problem occurs. Split-brain is an interesting one. The false-senior node has, and will always have, the wrong view of data from the rest of the cluster. I don't think there's an easy way to recover from that (even with ReplicatedMap - what happens if the data held by the false-senior is inconsistent with the rest of the cluster? Someone has to be wrong). Hence #51 which basically prevents the situation from happening (the 'false-senior' node essentially disconnects all clients, throws away all it's data, and waits until it rejoins the cluster). |
Split-brain: One headache at a time please :) Moving 'cleanup' from Hazelcast plugin to Openfire makes a lot of sense to me. Another concern that I have is that if we explicitly ask for a cache entry (maybe under lock), would that guarantee the replicatedmap to be updated with the latest data for that entry, or would the local node get old state? |
https://docs.hazelcast.com/imdg/4.2/data-structures/replicated-map.html says
which sounds like to me that if node 1 writes and node 2 reads after node 1 writes, there will be a period of time before node 2 sees what node 1 has written. I suspect IMap is the same, tbh. Sounds like you're pushing clustering harder than it has been before! |
Let me try to explain my concern differently: A ReplicatedMap (which does not inherit from |
Ah ISWYM. Yes, with no lock on a ReplicatedMap the current lock/do stuff/unlock mechanism would require thinking about. First thought; |
I don't think that there's any guarantee that replication in the cache would have actually happened before the lock has been released (and/or before another thread/node acquires a lock for the same key again). |
Oh, good point. Perhaps we're stuck with an IMap that is "manually" fully populated using a listener. Or perhaps more simply, increase the IMap backup-count from 1 to it's maximum value of 6 (assuming a cluster has less than 7 nodes). |
I'm not sure that the backup-count does what we expect. |
Assuming that a cluster doesn't have more than 6 nodes isn't a hard requirement that I'd like to introduce. Replacing the entire Cache with a solution that fully depends on the MapEntryListener mechanism feels like applying to much ducttape. We would effectively be building a distributed datastructure from Hazelcast events, even though Hazelcast almost explicitly doesn't offer that implementation. I fear that they're not offering that implementation for a good reason, one that we'll find out about in the hard way. Also (and more concrete) I fear that very similar synchronization issues could exist. Although I've not verified this, I'd not be surprised if the event listeners are fired outside of any lock that's been held. @Fishbowler is probably completely right with regards to backups. We thought we've seen improvements by moving to ReplicatedMap in a test setup, but it turns out that we didn't actually change to ReplicatedMap at all (changing the type of the datastructure configuration doesn't mean that the datastructure that is used changes type - the code still instantiates the cache as an Changing to ReplicatedMap would require implementation changes, which would either require all caches to switch type, or would require additional API, for Map and ReplicatedMap-based caches to coexist. In all, I think most pragmatically, moving forward with what we have now (Hazelcast Maps, combined with selected bits of the cache being stored in a local datastructure that's populated with EntryListener), seems like the best solution to me for now. Moving that from this plugin to Openfire, to reduce duplication (#50). |
This plugin introduces an implementation of
org.jivesoftware.util.cache.Cache<K,V>
that is backed by a Hazelcast-provided distributed data structure. Much of the clustering-related functionality depends on data in caches of this implementation to be shared amongst cluster nodes. For example: one node puts something in the cache, another node can access that data from a locally instantiated cache with the same name. The data is "synchronized under water" by Hazelcast.Pretty much all caches, as well as the default, as defined in https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/classes/hazelcast-cache-config.xml use a Hazelcast Map as the data structure that is used to back our Cache implementation. All of them seem to use a
backup-count
of 1. This means that data added by a node will be replicated to one other node. This is largely hidden from all usages of the Cache, as the data will be available on all nodes, even if the cluster is larger than 2 nodes. Nodes on which the data is accessed, but don't have it, will obtain it through Hazelcast-provided magic.Although during normal run-time, the data is accessible to all cluster nodes (as described above), there does not seem to be a guarantee that all data is, at all times, readily available on all cluster nodes when the cluster is larger than two nodes. It is probably reasonable to assume that this is almost guaranteed to not be the case in larger clusters, as by default, the data will live on one node, and be backed-up to another. I'm not sure if there at times are more copies, but it's probably safe to assume that Hazelcast will eventually retain data on only two nodes.
Much of the code in Openfire that is executed when a cluster node drops out of the cluster (notably, implementations of
org.jivesoftware.openfire.cluster.ClusterEventListener
) is written with the expectation that all data is available on the local cluster node, for at least a number of (and possibly all) caches. This seems to be an error.To illustrate: the following was observed (repeatedly) during testing: On a three-node cluster, where a different client/user that is subscribed to the presence of all other users was connected to each node, the senior node was disconnected from the cluster (I'm unsure if seniority is important). It is important to realize that at that point, Hazelcast won't be able to look up cache entries 'online'. Whatever it has on the local node will be what it can deal with - all other data is lost. The expected behavior would be that the client connected to the local node would receive a presence unavailable for its two contacts, by virtue of the routing table being cleaned up, having recognized that those two routes are now no longer available. In practise, we did not always see this happen. Often, we'd get presence unavailable for only one contact, instead of both.
We believe that what's going on here is that the disconnected node iterates over (or is otherwise dependent of - see Remark A below) on the routing table to send the presence unavailable's for the clients on the now unreachable cluster nodes. As there is no guarantee that all data exists on all cluster nodes, this might go wrong.
(I've actually provided a simplified description of the scenario that was used during testing: the test scenario that was actually used involved MUC. The 'offline' presence that is expected would be picked up by the conference service, to be broadcast to all local occupants. I believe that this nuance is not important)
A confusing characteristic of the implementation is that there seems to be overlap in the implementation of
org.jivesoftware.openfire.plugin.util.cache.ClusterListener
in the Hazelcast plugin (notably its cleanup routines) and the implementation of theorg.jivesoftware.openfire.cluster.ClusterEventListener
interface in various parts of the Openfire code base (such asRoutingTableImpl
).The issue described here will probably not be very apparent in a test environment, when the test environment does not consist of at least three cluster nodes. The default backup count of 1 will effectively cause replication "to the entire cluster" if the cluster consists of no more than two nodes.
Remark A: While penning this text, I have started to wonder if the problem described above (not guaranteed to have cluster data after a sudden disconnect) is what the
lookupJIDList
method in Hazelcast'sClusterListener
class is trying to work around. That implementation is using a HazelcastEntryListener
(as implemented in theS2SCacheListener
in that sameClusterListener
definition) to keep track of all JIDs in many caches. If these EntryListeners get fired whenever data is modified on any cluster node, then this could ensure that at least some data-modification (it only seems to store JIDs) is guaranteed to be registered on all nodes as soon as that occurs on any node. If that's done synchronously, even better, but I have not checked yet. With this, and having identified the apparent duplication of code in Hazelcast'sClusterListener
and some Openfire implementations ofClusterEventListener
, combined with the fact that for some Caches, it'd make sense to me to update them 'last' (or at the very least in a predictable order), I wonder if a sensible course of action would be to remove code from Openfire, and have (only) the Hazelcast plugin be responsible for things like maintaining the RoutingTable state.The text was updated successfully, but these errors were encountered: