Experimental Physics and
At the SNS, we are suddenly experiencing issues with R3.14.7 IOCs under vxWorks.
The symptom: Everything on the IOC looks great, but the IOC allows no new CA connections. "casr" shows several existing CA connections, and they in fact continue to work fine.
Additional suspicious fact: A "caget some_pv" will typically print the current value of the PV. Or give a message "connect timed out: 'some_pv' not found" after (by default) 2 seconds.
When used with a PV from those affected IOCs, it will print the error message after 2 seconds, but then hang for a total of ~30 seconds. And "casr" on the IOC will show a new connection, even though caget never got a value!
So far, we have not seen any specific error message on the IOC.
Mostly in fact no error messages at all.
No suspended tasks, the network interface is still in "full duplex" with 0 in/out errors,
dbcar keeps all its client connections to other IOCs,
netStatDataPoolShow, endPoolShow have 0 "failed" entries, ...
The only known fix is an IOC reboot.
This started to happen in April. It has affected at least two different types of IOCs: Those running low-level RF and those running power-supply applications. It first affected LLRF IOCs that hadn't changed since January, so the guess is that this is caused by changes to our network or CA clients.
What also changed since about that time: We have more IOCs, some CA clients already using R184.108.40.206, and some using CAJ, the pure java CA client.
Maybe completely unrelated: We are pretty sure that CAJ clients cause occasional IOC error messages CAS: request from someIP:port => "bad resource ID"
These messages were on different IOCs at different times.
They coincide with running a CAJ client,
and the IP address matches the host where the CAJ client runs.
They don't happen every time,
and so far the IOCs which gave the "bad resource ID" message
have not been the ones which had CA servers no-more-connections problems.
I still considered it a good enough reason for an SNS CAJ witch hunt:
If it's causing this error message, maybe CAJ is capable
of causing more confusion inside the CA server?
Confusion that goes unnoticed and then results in the connection issue?
Has anybody else seen anything similar?
|ANJ, 02 Sep 2010||
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·