Experimental Physics and
Industrial Control System

<[email protected]> · Thu, 1 Nov 2012 11:42:16 +0000

I asked the question quoted below in the middle of the summer, but as the problem is very intermittent (and when it does happen I don't always get my hand on the offending machine), it's been a while since I've been able to follow up.

In brief, the problem is this: I have a vxWorks IOC on an overloaded ethernet connection (it's also running a proprietary and ill-behaved MR1394 driver with a locally written Asyn EPICS driver), and occasionally the connection between the IOC and our gateway gets into a state where the number of PVs reported on the connection grows continually, and contains numerous repeats.  Mysteriously, running `casr 2` on the IOC seems to clear the problem, at least temporarily!

I've been able to take a closer look today as the problem has recurred.  One very interesting detail is that there appear to be *two* connections from the gateway, one that seems well behaved, the other not so:

	TCP 172.23.194.201:60697(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 5 Channels, Priority=0
	TCP 172.23.194.201:60698(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 9641 Channels, Priority=0

Of the 9641 channels reported here, there are only 25 unique different PVs, each duplicated hundreds of times.  Note the two gateway connections with adjacent port numbers!  (I'll pick up on this below.)

From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Johnson
> On 2012-06-11 [email protected] wrote:
> > 1. The server is vxWorks 5.5.1 and EPICS 3.14.11

> > 3. The ethernet connection is horribly horribly overloaded (100MBit link),
> >  EPICS would like to deliver far more image frames than the network will
> >  permit

> > 5. At some random point during operation the number of IOC connections
> >  ($(IOC):CA:CNX as reported by vxStats) starts climbing steadily.
> 
> What is the CPU load on the IOC at this time?  Does the CPU have any
> idle time at all?  Can you run Spy to find out which threads are getting
> any work done?

Surprisingly, vxStats reports only 27% CPU:LOAD.  I ran a few rounds of spy with the following reports (I've deleted all the 0 tick entries from each report to keep the transcript manageable):

	SR01C-DI-IOC-02 -> spy
	value = 469523296 = 0x1bfc5b60
	SR01C-DI-IOC-02 -> 
	NAME          ENTRY         TID   PRI   total % (ticks)  delta % (ticks)
	--------     --------      -----  ---   ---------------  ---------------
	t1394iidc               1ec566a0   45    10% (      50)   10% (      50)
	tNetTask     netTask    1ee314b0   50     2% (      14)    2% (      14)
	CAS-event               1cbf1e50   50     3% (      16)    3% (      16)
	CAC-UDP                 1ed55740  147     0% (       1)    0% (       1)
	CAS-client              1de27e40  179     1% (       5)    1% (       5)
	CAS-event               1d9eb9f0  180     4% (      23)    4% (      23)
	CAC-repeater            1e9cb080  189     0% (       2)    0% (       2)
	KERNEL                                    0% (       1)    0% (       1)
	IDLE                                     77% (     387)   77% (     387)
	TOTAL                                    97% (     499)   97% (     499)

	1 task was created.

	NAME          ENTRY         TID   PRI   total % (ticks)  delta % (ticks)
	--------     --------      -----  ---   ---------------  ---------------
	tSpyTask     spyComTask 1bfc5b60    5     0% (      12)    1% (      12)
	t1394ohci    ohciTask   1ec804b0   40     0% (       1)    0% (       1)
	t1394iidc               1ec566a0   45     9% (     163)    9% (     113)
	tNetTask     netTask    1ee314b0   50     3% (      55)    3% (      41)
	CAS-event               1cbf1e50   50     3% (      50)    2% (      34)
	cbLow                   1eaaabf0  140     0% (       4)    0% (       4)
	CAC-UDP                 1ed55740  147     0% (       2)    0% (       1)
	timerQueue              1ed69090  148     0% (       1)    0% (       1)
	CAS-client              1de27e40  179     0% (      14)    0% (       9)
	CAS-event               1d9eb9f0   50     4% (      71)    4% (      48)
	CAS-UDP                 1ed07a00  183     0% (       3)    0% (       3)
	CAC-repeater            1e9cb080  189     0% (       5)    0% (       3)
	cpuUsageTask            1ea9ae50  255    31% (     510)   44% (     510)
	KERNEL                                    1% (      27)    2% (      26)
	IDLE                                     44% (     723)   29% (     336)
	TOTAL                                    95% (    1641)   94% (    1142)

	NAME          ENTRY         TID   PRI   total % (ticks)  delta % (ticks)
	--------     --------      -----  ---   ---------------  ---------------
	tSpyTask     spyComTask 1bfc5b60    5     0% (      20)    0% (       8)
	t1394ohci    ohciTask   1ec804b0   40     0% (       5)    0% (       4)
	t1394iidc               1ec566a0   45     9% (     276)    9% (     113)
	tNetTask     netTask    1ee314b0   50     3% (      93)    3% (      38)
	CAS-event               1cbf1e50   50     3% (      90)    3% (      40)
	cbLow                   1eaaabf0  140     0% (       7)    0% (       3)
	CAC-UDP                 1ed55740  147     0% (       3)    0% (       1)
	timerQueue              1ed69090  148     0% (       2)    0% (       1)
	save_restore            1ed02390  179     0% (       1)    0% (       1)
	CAS-client              1de27e40  179     1% (      29)    1% (      15)
	CAS-event               1d9eb9f0   50     4% (     112)    3% (      41)
	CAS-UDP                 1ed07a00  183     0% (       6)    0% (       3)
	CAC-repeater            1e9cb080  189     0% (       8)    0% (       3)
	KERNEL                                    1% (      43)    1% (      16)
	IDLE                                     56% (    1576)   74% (     853)
	TOTAL                                    95% (    2781)   94% (    1140)

Nothing catches my eye.

> Also, are you running a name-server for the gateway, or is it relying
> on UDP broadcasts for name resolution?

No, I don't believe we're using an EPICS name server anywhere on our network.

> > 7. Running `casr 2` takes forever (well, several minutes), the connection
> >  is over a 9600 baud serial line.  Of the 8697 channels, the same PV is
> >  reported over and over and over and over again (it's a simple camera
> >  STATUS provided by the asyn driver).
> >
> > 8. After `casr 2` has completed, the bogus channels have gone away:
> 
> > Very odd.  Any thoughts?
> 
> Your 'casr 2' is being run at a much higher priority than the other Channel
> Access activities.  I suspect your CPU is too busy handling higher priority
> operations to spend any time actually running the low-level CA threads that
> communicate with client.
> 
> UDP name lookups normally happen in the lowest priority CA thread, so the IOC
> /should/ stop accepting new client connections if the CPU gets overloaded
> because the name resolver task never gets a chance to run.  However if the
> gateway uses a name-server instead of UDP broadcasts this natural throttling
> will not be happening and I could see your symptoms occurring as a
> result.
> 
> Does this help explain what might be going on?

I think the problem was only between the IOC and the gateway, and in particular other clients on the same network can see the IOC just fine.  I guess this means that the name resolver task is running normally?

From: Hill, Jeff [mailto:[email protected]]

(I've elided thoughts on CPU starvation, don't think that applies.)

> Furthermore, if the IP kernel is low on buffers then activities in the server
> might also be stalled. It's even possible that TCP circuit shutdown activities
> might stall (i.e. socket close might block) when a network buffer isn't available
> depending on the IP kernel's implementation.

Buffer starvation in the vxWorks kernel?  Perish the thought!  Yes, that's a possibility, though we have to take into account the fact that other CA activity was proceeding normally.

> > 6. Running casr reports that practically all of these connections are to
> > the gateway, for example:
> 
> Starting around R3.14.6 I made some changes in the CA client library so that it
> will _not_ disconnect an unresponsive circuit and start a new one in such (heavily
> congested) situations. Instead it disconnects the application but does not
> disconnect the circuit, and simply waits for TCP to recover the circuit using
> its mature built-in capabilities for dealing with congestion.

Ahhh.  Now *that* is an interesting thought.  Again, recall that we're dealing with vxWorks (5.5.1), and talk of "mature" capabilities strikes me as a trifle optimistic in this context.

> However, if for any reason the CA GW, on its own volition, were to destroy the
> channel and create a new one (when the channel was unresponsive) then this would
> circumvent the protections mentioned in the previous paragraph. I didn't write
> the GW, but have looked inside of it, and I don't recall that it does this. I do
> seem to recall that if a channel isn't used for some time in the gateway then it
> will be destroyed and later recreated when a client (of the GW) needs it again, but
> the timeouts are probably long enough that they are not coming into play in your
> situation? Another possibility might be that this gateway was somehow using
> a pre R3.14.6 ca client library.

This makes a lot of sense.  It's worth noting, I think, that the gateway seems to be having real difficulties in communicating with this IOC, and only a fraction of its PVs are visible.  That the gateway might be continually trying to reconnect strikes me as very plausible.

> > 8. After `casr 2` has completed, the bogus channels have gone away:
> 
> As I recall, the casr doesn't perform any cleanup activities. I don't
> claim to have a clear explanation for this behavior. One possible guess
> is that casr, running at higher priority than the vxWorks shell,
> temporarily disrupts the scan tasks from adding more labor to the
> event queue for the server to complete, and thereby allows network
> buffer starvation in the IP kernel to clear out, and maybe this allows
> the server threads to finish their cleanup activities. Maybe a stretch, HTA.

A couple of details are puzzling me here.

1. What on earth is the gateway doing running *two* TCP connections to the IOC?  They have adjacent port numbers (60697, 60698), which suggests they were allocated by the same task and close together in time.  The first connection (60697) appears to be running normally, and now (after "clearing" the problem with a couple of rounds of `casr 2`) all new connections through the gateway go through this port.  The second connection (60698) looks wedged and is the one that grew.

2. I have a number of connections on the IOC for which `casr 2` reports large numbers of "Unprocessed request bytes".  Below I attach a transcript with the stable and health connections elided.  This was taken after successfully clearing everything down with a previous run of `casr 2`.

	SR01C-DI-IOC-02 -> casr 2
	Channel Access Server V4.11
	Connected circuits:
	TCP 172.23.194.201:60697(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 55 Channels, Priority=0
	        Task Id=0x1e05f020, Socket FD=11
	        Secs since last send   0.95, Secs since last receive 213.35
	        Unprocessed request bytes=0, Undelivered response bytes=0
	        State=up jumbo-send-buf
	        12024 bytes allocated

	TCP 172.23.194.201:60698(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 21 Channels, Priority=0
	        Task Id=0x1cbeec70, Socket FD=40
	        Secs since last send   0.10, Secs since last receive 2250.70
	        Unprocessed request bytes=786472, Undelivered response bytes=0
	        State=up jumbo-send-buf
	        5472 bytes allocated

	TCP 172.23.196.3:51554(cs01r-cs-pc-03.pri.diamond.ac.uk): User="ops-ycs63881", V4.11, 5 Channels, Priority=0
	        Task Id=0x1cf0d600, Socket FD=31
	        Secs since last send 771508.40, Secs since last receive 771665.92
	        Unprocessed request bytes=196648, Undelivered response bytes=0
	        State=up jumbo-send-buf
	        1440 bytes allocated

	UDP Server:
	UDP 172.23.194.10:43801(): User="", V4.11, 0 Channels, Priority=0
	        Task Id=0x1ed07a00, Socket FD=17
	        Secs since last send 285.30, Secs since last receive   0.20
	        Unprocessed request bytes=16, Undelivered response bytes=0
	        State=up
	        180 bytes allocated

	There are currently 17213444 bytes on the server's free list
	18 client(s), 9609 channel(s), 437 event(s) (monitors) 0 putNotify(s)
	28 small buffers (16384 bytes ea), and 8 jumbo buffers (2000024 bytes ea)
	The server's resource id conversion table:
	Bucket entries in use = 119 bytes in use = 18308
	Bucket entries/hash id - mean = 0.029053 std dev = 0.167954 max = 1
	The server's array size limit is 2000024 bytes max
	Channel Access Address List
	172.23.207.255:5065

> The best way to find out what _is_ occurring would be to log into that
> IOC and use the "tt <task id>" vxWorks target shell command to determine
> where the CA server's TCP threads happen to be loitering at.

Well, I've had a quick poke around with `tt`, but I think we'll need a better idea of where we're looking.  There are a lot of tasks...

	SR01C-DI-IOC-02 -> i

	  NAME        ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY
	---------- ------------ -------- --- ---------- -------- -------- ------- -----
	tExcTask   excTask      1effcbd0   0 PEND         23385c 1effcab0   3006b     0
	tLogTask   logTask      1effa000   0 PEND         23385c 1eff9ef0       0     0
	tShell     shell        1e0476b0   1 READY        22d188 1e047290   c0002     0
	tRlogind   rlogind      1eda4f40   2 PEND         22782c 1eda4ae0       0     0
	tWdbTask   wdbTask      1ed97f90   3 PEND         22782c 1ed97e60       0     0
	tillRdb    11823c       1e042670  10 PEND         22782c 1e042210  3d0002     0
	t1394ohci  ohciTask     1ec804b0  40 PEND         22782c 1ec803b0       0     0
	t1394stack 1e7f3c44     1ec7e2e0  45 PEND         228058 1ec7e1f0       0     0
	t1394stackR1e7f3c44     1ec7bab0  45 PEND         228058 1ec7b9c0       0     0
	t1394iidc  1e7f3c44     1ec566a0  45 PEND         228058 1ec565b0       0     0
	tAioIoTask1aioIoTask    1efeb120  50 PEND         228058 1efeb020       0     0
	tAioIoTask0aioIoTask    1efe3e70  50 PEND         228058 1efe3d70       0     0
	tNetTask   netTask      1ee314b0  50 PEND         22782c 1ee313c0       0     0
	CAS-event  1e94ef80     1cbf1e50  50 PEND+I       22782c 1cbf1a60       0     0
	CAS-event  1e94ef80     1cf1f7f0  50 PEND+I       22782c 1cf1f400       0     0
	tAioWait   aioWaitTask  1eff23d0  51 PEND         22782c 1eff2260  3d0002     0
	tPortmapd  portmapd     1eda0970  54 PEND         22782c 1eda0710  3d0002     0
	tTelnetd   telnetd      1eda2b60  55 PEND         22782c 1eda29f0       0     0
	NTPTimeSync1e94ef80     1e9cf920 109 PEND+T       22782c 1e9cf7a0  3d0002  3009
	ClockTimeSy1e94ef80     1e9cd3c0 109 PEND+T       22782c 1e9cd240  3d0004  2999
	cbHigh     1e94ef80     1ea30cb0 128 PEND         22782c 1ea30b90       0     0
	timerQueue 1e94ef80     1eab2d00 129 PEND         22782c 1eab27c0  3d0004     0
	scanOnce   1e94ef80     1ed49c70 129 PEND         22782c 1ed49b50       0     0
	scan0.1    1e94ef80     1ed1e680 133 PEND+T       22782c 1ed1e530  3d0004     2
	scan0.2    1e94ef80     1ed23f70 134 PEND+T       22782c 1ed23e20  3d0004     1
	cbMedium   1e94ef80     1eaa3480 135 PEND         22782c 1eaa3360       0     0
	scan0.5    1e94ef80     1ed29860 135 PEND+T       22782c 1ed29710  3d0004     4
	scan1      1e94ef80     1ed2f150 136 PEND+T       22782c 1ed2f000       0    57
	scan2      1e94ef80     1ed34a40 137 PEND+T       22782c 1ed348f0  3d0004    97
	scan5      1e94ef80     1ed3a330 138 PEND+T       22782c 1ed3a1e0  3d0004   140
	timerQueue 1e94ef80     1ecaaed0 139 PEND         22782c 1ecaa990       0     0
	timerQueue 1e94ef80     1eca7840 139 PEND         22782c 1eca7300       0     0
	scan10     1e94ef80     1ed3fc20 139 PEND+T       22782c 1ed3fad0  3d0004    87
	cbLow      1e94ef80     1eaaabf0 140 PEND         22782c 1eaaaad0       0     0
	asCaTask   1e94ef80     1ed13ed0 142 PEND         22782c 1ed13db0       0     0
	CAC-UDP    1e94ef80     1ed55740 147 PEND         22782c 1ed551b0       0     0
	timerQueue 1e94ef80     1ed69090 148 PEND+T       22782c 1ed68b50  3d0004     1
	CAC-event  1e94ef80     1e7aa7a0 148 PEND         22782c 1e7aa690       0     0
	CAC-TCP-sen1e94ef80     1e16ae70 148 PEND         22782c 1e16a860       0     0
	timerQueue 1e94ef80     1eca3b00 149 PEND+T       22782c 1eca35c0  3d0004    22
	dbCaLink   1e94ef80     1ea2b4a0 149 PEND         22782c 1ea2b380       0     0
	poolPoll   1e94ef80     1ed04900 149 DELAY        22caac 1ed04820       0    30
	CAM1       1e94ef80     1e071bc0 149 PEND+T       22782c 1e071a30  3d0004    37
	CAM2       1e94ef80     1e06e520 149 PEND+T       22782c 1e06e390  3d0004    32
	CAM3       1e94ef80     1e06b420 149 PEND+T       22782c 1e06b290  3d0004    27
	CAM4       1e94ef80     1e068320 149 PEND+T       22782c 1e068190  3d0004    22
	CAM5       1e94ef80     1e065220 149 PEND+T       22782c 1e065090  3d0004    17
	CAC-TCP-rec1e94ef80     1e787ba0 150 PEND         22782c 1e7872e0       0     0
	CAS-client 1e94ef80     1d98cf90 175 PEND         22782c 1d98ccc0       0     0
	CAS-client 1e94ef80     1cf07df0 175 PEND         22782c 1cf07b20       0     0
	CAS-event  1e94ef80     1d96c240 176 PEND         22782c 1d96c130       0     0
	CAS-event  1e94ef80     1cef0020 176 PEND         22782c 1ceeff10       0     0
	CAC-event  1e94ef80     1e04d7d0 178 PEND         22782c 1e04d6c0       0     0
	save_restor1e94ef80     1ed02390 179 DELAY        22caac 1ed01d50       0    59
	CAS-client 1e94ef80     1e05f020 179 PEND         22782c 1e05ed50       0     0
	CAS-client 1e94ef80     1cbeec70 179 PEND         22782c 1cbee9a0       0     0
	CAS-client 1e94ef80     1cba2090 179 PEND         22782c 1cba1dc0       0     0
	CAS-client 1e94ef80     1cf0d600 179 PEND         22782c 1cf0d330       0     0
	CAS-client 1e94ef80     1de27e40 179 PEND         22782c 1de27b70       0     0
	CAS-client 1e94ef80     1cf3c160 179 PEND         22782c 1cf3be90       0     0
	CAS-client 1e94ef80     1bfcb370 179 PEND         22782c 1bfcb0a0       0     0
	CAS-client 1e94ef80     1d6d5f80 179 PEND         22782c 1d6d5cb0       0     0
	CAS-client 1e94ef80     1d6defc0 179 PEND         22782c 1d6decf0       0     0
	CAS-client 1e94ef80     1c5cc1a0 179 PEND         22782c 1c5cbed0       0     0
	CAS-client 1e94ef80     1d0ea100 179 PEND         22782c 1d0e9e30       0     0
	CAS-event  1e94ef80     1e062120 180 PEND         22782c 1e062010       0     0
	CAS-event  1e94ef80     1cba5270 180 PEND         22782c 1cba5160       0     0
	CAS-event  1e94ef80     1cee77a0 180 PEND         22782c 1cee7690       0     0
	CAS-event  1e94ef80     1cef37d0 180 PEND         22782c 1cef36c0       0     0
	CAS-event  1e94ef80     1dbe5e50 180 PEND         22782c 1dbe5d40       0     0
	CAS-event  1e94ef80     1dbe2d20 180 PEND         22782c 1dbe2c10       0     0
	CAS-event  1e94ef80     1cea7830 180 PEND         22782c 1cea7720       0     0
	CAS-event  1e94ef80     1d9eb9f0 180 PEND         22782c 1d9eb600       0     0
	CAS-event  1e94ef80     1cede0d0 180 PEND         22782c 1ceddfc0       0     0
	CAS-TCP    1e94ef80     1ed0dca0 181 PEND         22782c 1ed0da00       0     0
	CAS-beacon 1e94ef80     1ed0a9c0 182 DELAY        22caac 1ed0a6f0      2f   515
	CAS-UDP    1e94ef80     1ed07a00 183 PEND         22782c 1ed07670       0     0
	taskwd     1e94ef80     1ecb2e60 189 PEND+T       22782c 1ecb2ce0  3d0004   193
	errlog     1e94ef80     1eac12a0 189 PEND         22782c 1eac1170  3d0002     0
	ipToAsciiPr1e94ef80     1ed6ecf0 189 PEND         22782c 1ed6e8c0       0     0
	CAC-repeate1e94ef80     1e9cb080 189 PEND         22782c 1e9cacf0      30     0
	DS1621     117508       1ea9a750 199 DELAY        22caac 1ea9a6b0       0    38
	cpuUsageTas1e83713c     1ea9ae50 255 PEND         22782c 1ea9ad70       0     0

My best guess is that EPICS CA on the IOC and the gateway (we're running version 2.0.3.0, I'm told) are conspiring to create this issue...

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System