I asked the question quoted below in the middle of the summer, but as the problem is very intermittent (and when it does happen I don't always get my hand on the offending machine), it's been a while since I've been able to follow up.
In brief, the problem is this: I have a vxWorks IOC on an overloaded ethernet connection (it's also running a proprietary and ill-behaved MR1394 driver with a locally written Asyn EPICS driver), and occasionally the connection between the IOC and our gateway gets into a state where the number of PVs reported on the connection grows continually, and contains numerous repeats. Mysteriously, running `casr 2` on the IOC seems to clear the problem, at least temporarily!
I've been able to take a closer look today as the problem has recurred. One very interesting detail is that there appear to be *two* connections from the gateway, one that seems well behaved, the other not so:
TCP 172.23.194.201:60697(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 5 Channels, Priority=0
TCP 172.23.194.201:60698(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 9641 Channels, Priority=0
Of the 9641 channels reported here, there are only 25 unique different PVs, each duplicated hundreds of times. Note the two gateway connections with adjacent port numbers! (I'll pick up on this below.)
From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Johnson
> On 2012-06-11 [email protected] wrote:
> > 1. The server is vxWorks 5.5.1 and EPICS 3.14.11
> > 3. The ethernet connection is horribly horribly overloaded (100MBit link),
> > EPICS would like to deliver far more image frames than the network will
> > permit
> > 5. At some random point during operation the number of IOC connections
> > ($(IOC):CA:CNX as reported by vxStats) starts climbing steadily.
>
> What is the CPU load on the IOC at this time? Does the CPU have any
> idle time at all? Can you run Spy to find out which threads are getting
> any work done?
Surprisingly, vxStats reports only 27% CPU:LOAD. I ran a few rounds of spy with the following reports (I've deleted all the 0 tick entries from each report to keep the transcript manageable):
SR01C-DI-IOC-02 -> spy
value = 469523296 = 0x1bfc5b60
SR01C-DI-IOC-02 ->
NAME ENTRY TID PRI total % (ticks) delta % (ticks)
-------- -------- ----- --- --------------- ---------------
t1394iidc 1ec566a0 45 10% ( 50) 10% ( 50)
tNetTask netTask 1ee314b0 50 2% ( 14) 2% ( 14)
CAS-event 1cbf1e50 50 3% ( 16) 3% ( 16)
CAC-UDP 1ed55740 147 0% ( 1) 0% ( 1)
CAS-client 1de27e40 179 1% ( 5) 1% ( 5)
CAS-event 1d9eb9f0 180 4% ( 23) 4% ( 23)
CAC-repeater 1e9cb080 189 0% ( 2) 0% ( 2)
KERNEL 0% ( 1) 0% ( 1)
IDLE 77% ( 387) 77% ( 387)
TOTAL 97% ( 499) 97% ( 499)
1 task was created.
NAME ENTRY TID PRI total % (ticks) delta % (ticks)
-------- -------- ----- --- --------------- ---------------
tSpyTask spyComTask 1bfc5b60 5 0% ( 12) 1% ( 12)
t1394ohci ohciTask 1ec804b0 40 0% ( 1) 0% ( 1)
t1394iidc 1ec566a0 45 9% ( 163) 9% ( 113)
tNetTask netTask 1ee314b0 50 3% ( 55) 3% ( 41)
CAS-event 1cbf1e50 50 3% ( 50) 2% ( 34)
cbLow 1eaaabf0 140 0% ( 4) 0% ( 4)
CAC-UDP 1ed55740 147 0% ( 2) 0% ( 1)
timerQueue 1ed69090 148 0% ( 1) 0% ( 1)
CAS-client 1de27e40 179 0% ( 14) 0% ( 9)
CAS-event 1d9eb9f0 50 4% ( 71) 4% ( 48)
CAS-UDP 1ed07a00 183 0% ( 3) 0% ( 3)
CAC-repeater 1e9cb080 189 0% ( 5) 0% ( 3)
cpuUsageTask 1ea9ae50 255 31% ( 510) 44% ( 510)
KERNEL 1% ( 27) 2% ( 26)
IDLE 44% ( 723) 29% ( 336)
TOTAL 95% ( 1641) 94% ( 1142)
NAME ENTRY TID PRI total % (ticks) delta % (ticks)
-------- -------- ----- --- --------------- ---------------
tSpyTask spyComTask 1bfc5b60 5 0% ( 20) 0% ( 8)
t1394ohci ohciTask 1ec804b0 40 0% ( 5) 0% ( 4)
t1394iidc 1ec566a0 45 9% ( 276) 9% ( 113)
tNetTask netTask 1ee314b0 50 3% ( 93) 3% ( 38)
CAS-event 1cbf1e50 50 3% ( 90) 3% ( 40)
cbLow 1eaaabf0 140 0% ( 7) 0% ( 3)
CAC-UDP 1ed55740 147 0% ( 3) 0% ( 1)
timerQueue 1ed69090 148 0% ( 2) 0% ( 1)
save_restore 1ed02390 179 0% ( 1) 0% ( 1)
CAS-client 1de27e40 179 1% ( 29) 1% ( 15)
CAS-event 1d9eb9f0 50 4% ( 112) 3% ( 41)
CAS-UDP 1ed07a00 183 0% ( 6) 0% ( 3)
CAC-repeater 1e9cb080 189 0% ( 8) 0% ( 3)
KERNEL 1% ( 43) 1% ( 16)
IDLE 56% ( 1576) 74% ( 853)
TOTAL 95% ( 2781) 94% ( 1140)
Nothing catches my eye.
> Also, are you running a name-server for the gateway, or is it relying
> on UDP broadcasts for name resolution?
No, I don't believe we're using an EPICS name server anywhere on our network.
> > 7. Running `casr 2` takes forever (well, several minutes), the connection
> > is over a 9600 baud serial line. Of the 8697 channels, the same PV is
> > reported over and over and over and over again (it's a simple camera
> > STATUS provided by the asyn driver).
> >
> > 8. After `casr 2` has completed, the bogus channels have gone away:
>
> > Very odd. Any thoughts?
>
> Your 'casr 2' is being run at a much higher priority than the other Channel
> Access activities. I suspect your CPU is too busy handling higher priority
> operations to spend any time actually running the low-level CA threads that
> communicate with client.
>
> UDP name lookups normally happen in the lowest priority CA thread, so the IOC
> /should/ stop accepting new client connections if the CPU gets overloaded
> because the name resolver task never gets a chance to run. However if the
> gateway uses a name-server instead of UDP broadcasts this natural throttling
> will not be happening and I could see your symptoms occurring as a
> result.
>
> Does this help explain what might be going on?
I think the problem was only between the IOC and the gateway, and in particular other clients on the same network can see the IOC just fine. I guess this means that the name resolver task is running normally?
From: Hill, Jeff [mailto:[email protected]]
(I've elided thoughts on CPU starvation, don't think that applies.)
> Furthermore, if the IP kernel is low on buffers then activities in the server
> might also be stalled. It's even possible that TCP circuit shutdown activities
> might stall (i.e. socket close might block) when a network buffer isn't available
> depending on the IP kernel's implementation.
Buffer starvation in the vxWorks kernel? Perish the thought! Yes, that's a possibility, though we have to take into account the fact that other CA activity was proceeding normally.
> > 6. Running casr reports that practically all of these connections are to
> > the gateway, for example:
>
> Starting around R3.14.6 I made some changes in the CA client library so that it
> will _not_ disconnect an unresponsive circuit and start a new one in such (heavily
> congested) situations. Instead it disconnects the application but does not
> disconnect the circuit, and simply waits for TCP to recover the circuit using
> its mature built-in capabilities for dealing with congestion.
Ahhh. Now *that* is an interesting thought. Again, recall that we're dealing with vxWorks (5.5.1), and talk of "mature" capabilities strikes me as a trifle optimistic in this context.
> However, if for any reason the CA GW, on its own volition, were to destroy the
> channel and create a new one (when the channel was unresponsive) then this would
> circumvent the protections mentioned in the previous paragraph. I didn't write
> the GW, but have looked inside of it, and I don't recall that it does this. I do
> seem to recall that if a channel isn't used for some time in the gateway then it
> will be destroyed and later recreated when a client (of the GW) needs it again, but
> the timeouts are probably long enough that they are not coming into play in your
> situation? Another possibility might be that this gateway was somehow using
> a pre R3.14.6 ca client library.
This makes a lot of sense. It's worth noting, I think, that the gateway seems to be having real difficulties in communicating with this IOC, and only a fraction of its PVs are visible. That the gateway might be continually trying to reconnect strikes me as very plausible.
> > 8. After `casr 2` has completed, the bogus channels have gone away:
>
> As I recall, the casr doesn't perform any cleanup activities. I don't
> claim to have a clear explanation for this behavior. One possible guess
> is that casr, running at higher priority than the vxWorks shell,
> temporarily disrupts the scan tasks from adding more labor to the
> event queue for the server to complete, and thereby allows network
> buffer starvation in the IP kernel to clear out, and maybe this allows
> the server threads to finish their cleanup activities. Maybe a stretch, HTA.
A couple of details are puzzling me here.
1. What on earth is the gateway doing running *two* TCP connections to the IOC? They have adjacent port numbers (60697, 60698), which suggests they were allocated by the same task and close together in time. The first connection (60697) appears to be running normally, and now (after "clearing" the problem with a couple of rounds of `casr 2`) all new connections through the gateway go through this port. The second connection (60698) looks wedged and is the one that grew.
2. I have a number of connections on the IOC for which `casr 2` reports large numbers of "Unprocessed request bytes". Below I attach a transcript with the stable and health connections elided. This was taken after successfully clearing everything down with a previous run of `casr 2`.
SR01C-DI-IOC-02 -> casr 2
Channel Access Server V4.11
Connected circuits:
TCP 172.23.194.201:60697(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 55 Channels, Priority=0
Task Id=0x1e05f020, Socket FD=11
Secs since last send 0.95, Secs since last receive 213.35
Unprocessed request bytes=0, Undelivered response bytes=0
State=up jumbo-send-buf
12024 bytes allocated
TCP 172.23.194.201:60698(cs03r-cs-gate-01.cs.diamond.ac.uk): User="gate", V4.11, 21 Channels, Priority=0
Task Id=0x1cbeec70, Socket FD=40
Secs since last send 0.10, Secs since last receive 2250.70
Unprocessed request bytes=786472, Undelivered response bytes=0
State=up jumbo-send-buf
5472 bytes allocated
TCP 172.23.196.3:51554(cs01r-cs-pc-03.pri.diamond.ac.uk): User="ops-ycs63881", V4.11, 5 Channels, Priority=0
Task Id=0x1cf0d600, Socket FD=31
Secs since last send 771508.40, Secs since last receive 771665.92
Unprocessed request bytes=196648, Undelivered response bytes=0
State=up jumbo-send-buf
1440 bytes allocated
UDP Server:
UDP 172.23.194.10:43801(): User="", V4.11, 0 Channels, Priority=0
Task Id=0x1ed07a00, Socket FD=17
Secs since last send 285.30, Secs since last receive 0.20
Unprocessed request bytes=16, Undelivered response bytes=0
State=up
180 bytes allocated
There are currently 17213444 bytes on the server's free list
18 client(s), 9609 channel(s), 437 event(s) (monitors) 0 putNotify(s)
28 small buffers (16384 bytes ea), and 8 jumbo buffers (2000024 bytes ea)
The server's resource id conversion table:
Bucket entries in use = 119 bytes in use = 18308
Bucket entries/hash id - mean = 0.029053 std dev = 0.167954 max = 1
The server's array size limit is 2000024 bytes max
Channel Access Address List
172.23.207.255:5065
> The best way to find out what _is_ occurring would be to log into that
> IOC and use the "tt <task id>" vxWorks target shell command to determine
> where the CA server's TCP threads happen to be loitering at.
Well, I've had a quick poke around with `tt`, but I think we'll need a better idea of where we're looking. There are a lot of tasks...
SR01C-DI-IOC-02 -> i
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
tExcTask excTask 1effcbd0 0 PEND 23385c 1effcab0 3006b 0
tLogTask logTask 1effa000 0 PEND 23385c 1eff9ef0 0 0
tShell shell 1e0476b0 1 READY 22d188 1e047290 c0002 0
tRlogind rlogind 1eda4f40 2 PEND 22782c 1eda4ae0 0 0
tWdbTask wdbTask 1ed97f90 3 PEND 22782c 1ed97e60 0 0
tillRdb 11823c 1e042670 10 PEND 22782c 1e042210 3d0002 0
t1394ohci ohciTask 1ec804b0 40 PEND 22782c 1ec803b0 0 0
t1394stack 1e7f3c44 1ec7e2e0 45 PEND 228058 1ec7e1f0 0 0
t1394stackR1e7f3c44 1ec7bab0 45 PEND 228058 1ec7b9c0 0 0
t1394iidc 1e7f3c44 1ec566a0 45 PEND 228058 1ec565b0 0 0
tAioIoTask1aioIoTask 1efeb120 50 PEND 228058 1efeb020 0 0
tAioIoTask0aioIoTask 1efe3e70 50 PEND 228058 1efe3d70 0 0
tNetTask netTask 1ee314b0 50 PEND 22782c 1ee313c0 0 0
CAS-event 1e94ef80 1cbf1e50 50 PEND+I 22782c 1cbf1a60 0 0
CAS-event 1e94ef80 1cf1f7f0 50 PEND+I 22782c 1cf1f400 0 0
tAioWait aioWaitTask 1eff23d0 51 PEND 22782c 1eff2260 3d0002 0
tPortmapd portmapd 1eda0970 54 PEND 22782c 1eda0710 3d0002 0
tTelnetd telnetd 1eda2b60 55 PEND 22782c 1eda29f0 0 0
NTPTimeSync1e94ef80 1e9cf920 109 PEND+T 22782c 1e9cf7a0 3d0002 3009
ClockTimeSy1e94ef80 1e9cd3c0 109 PEND+T 22782c 1e9cd240 3d0004 2999
cbHigh 1e94ef80 1ea30cb0 128 PEND 22782c 1ea30b90 0 0
timerQueue 1e94ef80 1eab2d00 129 PEND 22782c 1eab27c0 3d0004 0
scanOnce 1e94ef80 1ed49c70 129 PEND 22782c 1ed49b50 0 0
scan0.1 1e94ef80 1ed1e680 133 PEND+T 22782c 1ed1e530 3d0004 2
scan0.2 1e94ef80 1ed23f70 134 PEND+T 22782c 1ed23e20 3d0004 1
cbMedium 1e94ef80 1eaa3480 135 PEND 22782c 1eaa3360 0 0
scan0.5 1e94ef80 1ed29860 135 PEND+T 22782c 1ed29710 3d0004 4
scan1 1e94ef80 1ed2f150 136 PEND+T 22782c 1ed2f000 0 57
scan2 1e94ef80 1ed34a40 137 PEND+T 22782c 1ed348f0 3d0004 97
scan5 1e94ef80 1ed3a330 138 PEND+T 22782c 1ed3a1e0 3d0004 140
timerQueue 1e94ef80 1ecaaed0 139 PEND 22782c 1ecaa990 0 0
timerQueue 1e94ef80 1eca7840 139 PEND 22782c 1eca7300 0 0
scan10 1e94ef80 1ed3fc20 139 PEND+T 22782c 1ed3fad0 3d0004 87
cbLow 1e94ef80 1eaaabf0 140 PEND 22782c 1eaaaad0 0 0
asCaTask 1e94ef80 1ed13ed0 142 PEND 22782c 1ed13db0 0 0
CAC-UDP 1e94ef80 1ed55740 147 PEND 22782c 1ed551b0 0 0
timerQueue 1e94ef80 1ed69090 148 PEND+T 22782c 1ed68b50 3d0004 1
CAC-event 1e94ef80 1e7aa7a0 148 PEND 22782c 1e7aa690 0 0
CAC-TCP-sen1e94ef80 1e16ae70 148 PEND 22782c 1e16a860 0 0
timerQueue 1e94ef80 1eca3b00 149 PEND+T 22782c 1eca35c0 3d0004 22
dbCaLink 1e94ef80 1ea2b4a0 149 PEND 22782c 1ea2b380 0 0
poolPoll 1e94ef80 1ed04900 149 DELAY 22caac 1ed04820 0 30
CAM1 1e94ef80 1e071bc0 149 PEND+T 22782c 1e071a30 3d0004 37
CAM2 1e94ef80 1e06e520 149 PEND+T 22782c 1e06e390 3d0004 32
CAM3 1e94ef80 1e06b420 149 PEND+T 22782c 1e06b290 3d0004 27
CAM4 1e94ef80 1e068320 149 PEND+T 22782c 1e068190 3d0004 22
CAM5 1e94ef80 1e065220 149 PEND+T 22782c 1e065090 3d0004 17
CAC-TCP-rec1e94ef80 1e787ba0 150 PEND 22782c 1e7872e0 0 0
CAS-client 1e94ef80 1d98cf90 175 PEND 22782c 1d98ccc0 0 0
CAS-client 1e94ef80 1cf07df0 175 PEND 22782c 1cf07b20 0 0
CAS-event 1e94ef80 1d96c240 176 PEND 22782c 1d96c130 0 0
CAS-event 1e94ef80 1cef0020 176 PEND 22782c 1ceeff10 0 0
CAC-event 1e94ef80 1e04d7d0 178 PEND 22782c 1e04d6c0 0 0
save_restor1e94ef80 1ed02390 179 DELAY 22caac 1ed01d50 0 59
CAS-client 1e94ef80 1e05f020 179 PEND 22782c 1e05ed50 0 0
CAS-client 1e94ef80 1cbeec70 179 PEND 22782c 1cbee9a0 0 0
CAS-client 1e94ef80 1cba2090 179 PEND 22782c 1cba1dc0 0 0
CAS-client 1e94ef80 1cf0d600 179 PEND 22782c 1cf0d330 0 0
CAS-client 1e94ef80 1de27e40 179 PEND 22782c 1de27b70 0 0
CAS-client 1e94ef80 1cf3c160 179 PEND 22782c 1cf3be90 0 0
CAS-client 1e94ef80 1bfcb370 179 PEND 22782c 1bfcb0a0 0 0
CAS-client 1e94ef80 1d6d5f80 179 PEND 22782c 1d6d5cb0 0 0
CAS-client 1e94ef80 1d6defc0 179 PEND 22782c 1d6decf0 0 0
CAS-client 1e94ef80 1c5cc1a0 179 PEND 22782c 1c5cbed0 0 0
CAS-client 1e94ef80 1d0ea100 179 PEND 22782c 1d0e9e30 0 0
CAS-event 1e94ef80 1e062120 180 PEND 22782c 1e062010 0 0
CAS-event 1e94ef80 1cba5270 180 PEND 22782c 1cba5160 0 0
CAS-event 1e94ef80 1cee77a0 180 PEND 22782c 1cee7690 0 0
CAS-event 1e94ef80 1cef37d0 180 PEND 22782c 1cef36c0 0 0
CAS-event 1e94ef80 1dbe5e50 180 PEND 22782c 1dbe5d40 0 0
CAS-event 1e94ef80 1dbe2d20 180 PEND 22782c 1dbe2c10 0 0
CAS-event 1e94ef80 1cea7830 180 PEND 22782c 1cea7720 0 0
CAS-event 1e94ef80 1d9eb9f0 180 PEND 22782c 1d9eb600 0 0
CAS-event 1e94ef80 1cede0d0 180 PEND 22782c 1ceddfc0 0 0
CAS-TCP 1e94ef80 1ed0dca0 181 PEND 22782c 1ed0da00 0 0
CAS-beacon 1e94ef80 1ed0a9c0 182 DELAY 22caac 1ed0a6f0 2f 515
CAS-UDP 1e94ef80 1ed07a00 183 PEND 22782c 1ed07670 0 0
taskwd 1e94ef80 1ecb2e60 189 PEND+T 22782c 1ecb2ce0 3d0004 193
errlog 1e94ef80 1eac12a0 189 PEND 22782c 1eac1170 3d0002 0
ipToAsciiPr1e94ef80 1ed6ecf0 189 PEND 22782c 1ed6e8c0 0 0
CAC-repeate1e94ef80 1e9cb080 189 PEND 22782c 1e9cacf0 30 0
DS1621 117508 1ea9a750 199 DELAY 22caac 1ea9a6b0 0 38
cpuUsageTas1e83713c 1ea9ae50 255 PEND 22782c 1ea9ad70 0 0
My best guess is that EPICS CA on the IOC and the gateway (we're running version 2.0.3.0, I'm told) are conspiring to create this issue...
- Replies:
- RE: Runaway connection count on IOC -- Possible Gateway Issue? Hill, Jeff
- References:
- Runaway connection count on IOC michael.abbott
- Re: Runaway connection count on IOC Andrew Johnson
- Navigate by Date:
- Prev:
RE: How to debug an IOC running on MSWindows using windbg Ahed Aladwan
- Next:
Re: User Created File Types: FILE TYPE += ABC [EPICS R3-14-12] Ralph Lange
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
Re: Runaway connection count on IOC Andrew Johnson
- Next:
RE: Runaway connection count on IOC -- Possible Gateway Issue? Hill, Jeff
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|