Update: We have been able to reproduce the issue on a set of wiresharked boxes.
After staring at wireshark captures for a good while, here's our current most probable explanation:
- The caj Java client (BEAUTY archive engine) is happily connected to the IOC.
- The connection is very busy, at above 40K updates (double+status+timestamp) per second. We see congestion mode (events_off / events_on message pairs), i.e. the client is the bottleneck, while the IOC is still sort of relaxed, sending out beacons like a clockwork.
- At some point, the client gets so busy that it stops decoding the IOC's beacons. The reason for this moment of saturation is not clear. Garbage collection?
- When the next beacon gets through (after a few minutes!), the client decides the IOC is unresponsive, and issues a ping (echo request) on the TCP circuit.
- The TCP circuit is so busy that the echo doesn't return within the 5 second timeout period. The client declares the IOC dead (while continuing to receive lots of updates from it).
- We see the client doing name resolution broadcasts, the IOC answers.
- The client - still receiving lots of updates - issues event_add subscription requests for all channels.
- Trying to send the initial value update responses for the new subscriptions, within a very short time the IOC fills its send buffers and blocks.
- At the same time, the client fills it send buffers with event_add messages and blocks.
- Both ends continuously fail to send (send buffers full), and never receive (receive buffers not empty).
Bottom line: an IOC may be unresponsive but not dead at all. Getting lots of updates should count as a sign of life.
Does that sound realistic? We can put up the original captures on an accessible place for download if someone is interested.
Thanks for your help,
~Ralph