Marty wrote:
> Could you give a little more info about the application
>
> I think you are sayiing
>
> iocA is getting data at a 30Hz rate
> iocB has input links to iocA and iocB drivers the digital readout display.
> MEDM gets its input from iocA
>
> Is this correct?
Well, sorta. iocA is getting the data at 30Hz. iocB collects that data from iocA
and drives the DROs. This is done through CA in the same subroutine record that
drives all the DROs as well as some other equipment (it is pretending to be the
old DataGeneral computer). There is ONE record in iocB that has a link to iocA.
That data (the calculated value in the record in iocB) is often updated by the
DRO subroutine even when the other readouts freeze. That indicates to me that
iocB is still functioning and still writing the DROs successfully, but that the
CA from the subroutine record to iocA has stalled, even though that very same
task is still able to fetch values locally.
It may be worth noting that my subroutine does not distinguish local and remote
records. It stacks 'em up with ca_search(). Then does ca_pend_io(10.0) for the
list, and then does ca_add_event() for each channel to get a callback. After
doing that all once, it hits this code:
/* Loop here handling ca_events and ca_puts forever */
while (1) {
semTake(ca_sem_id, WAIT_FOREVER); /* Wait for Process' okay to run */
status = ca_pend_event(.0001); /* Do it */
}
The sub record offers up that semaphore once each time it's processed.
This just in: I made an MEDM display with readouts identical to the DRO panel.
Watching it on the workstation and the DRO panel at the same time, it is obvious
that both copies freeze simultaneously. Does that help narrow the search?
> How are the input links configured? CPP CP ?
Whichever is the default. These databases were in place before those were added.
The one record in iocB that has an input link from iocA lists it as "PP MS".
> How many signals total?
In iocA? <200. In iocB? Just a handful, though one is a subRecord which acts as
a DataGeneral emulator and is more complex than I want to think about again.
Shared between the two and related to the problem at hand? About 20 for sure.
Though I've seen delays looking at other channels on iocA with MEDM.
> How heavily is your network loaded?
> Maybe you can look tcpstatShow on the ioc regularly. If you could look at it
> before and after an incident it might give some info.
Here goes:
iocDRO162b> tcpstatShow
TCP:
5373166 packets sent
2658284 data packets (86524601 bytes)
190203 data packets (37020634 bytes) retransmitted
2232064 ack-only packets (995946 delayed)
0 URG only packet
0 window probe packet
291537 window update packets
1078 control packets
12295282 packets received
839174 acks (for 86527309 bytes)
605625 duplicate acks
0 ack for unsent data
9817841 packets (1965333241 bytes) received in-sequence
1945 completely duplicate packets (2739425 bytes)
342 packets with some dup. data (195643 bytes duped)
1232104 out-of-order packets (443996099 bytes)
0 packet (0 byte) of data after window
0 window probe
38723 window update packets
0 packet received after close
0 discarded for bad checksum
0 discarded for bad header offset field
0 discarded because packet too short
21 connection requests
1022 connection accepts
1043 connections established (including accepts)
1054 connections closed (including 0 drop)
8 embryonic connections dropped
652076 segments updated rtt (of 841775 attempts)
6775 retransmit timeouts
0 connection dropped by rexmit timeout
0 persist timeout
15 keepalive timeouts
15 keepalive probes sent
0 connection dropped by keepalive
value = 36 = 0x24 = '$'
I did this before, during, and after an event and noticed no significant change.
Only the "retransmit timeouts" increased by more than a few counts and it seems
to keep doing that anyway.
> Also look at mbufShow from time to time.
> netHelp shows all the network show commands supplied with vxWorks.
iocDRO162b> mbufShow
type number
--------- ------
FREE : 170
DATA : 26
HEADER : 13
SOCKET : 0
PCB : 47
RTABLE : 2
HTABLE : 0
ATABLE : 0
SONAME : 0
ZOMBIE : 0
SOOPTS : 0
FTABLE : 0
RIGHTS : 0
IFADDR : 2
TOTAL : 260
number of mbufs: 260
number of clusters: 7
number of interface pages: 0
number of free clusters: 7
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
I did this several times and this one has the lowest number of free mbufs. Most
of the time it hovers around 190 free.
> We have been looking at Ethernet errors such as collisions with
> "ifShow" on vxWorks, and with "netstat -i" on UNIX. You can also
> look for IP level errors with "ipstatShow" on vxWorks, and
> "netstat -s" on UNIX. Are you using a switched Ethernet?
Okay, as luck would have it, I did "ifShow" twice during one of the long
outtages without realizing it:
iocDRO162b> ifShow
ei (unit number 0):
Flags: (0x63) UP BROADCAST ARP RUNNING
Internet address: 164.54.250.3
Broadcast address: 164.54.251.255
Netmask 0xffff0000 Subnetmask 0xfffffe00
Ethernet address is 08:00:3e:24:ed:92
Metric is 0
Maximum Transfer Unit size is 1500
11554152 packets received; 4515187 packets sent
21759 input errors; 469370 output errors
27758 collisions
lo (unit number 0):
Flags: (0x69) UP LOOPBACK ARP RUNNING
Internet address: 127.0.0.1
Netmask 0xff000000 Subnetmask 0xff000000
Metric is 0
Maximum Transfer Unit size is 4096
80860 packets received; 80860 packets sent
0 input errors; 0 output errors
0 collisions
value = 18 = 0x12
iocDRO162b> ifShow
ei (unit number 0):
Flags: (0x63) UP BROADCAST ARP RUNNING
Internet address: 164.54.250.3
Broadcast address: 164.54.251.255
Netmask 0xffff0000 Subnetmask 0xfffffe00
Ethernet address is 08:00:3e:24:ed:92
Metric is 0
Maximum Transfer Unit size is 1500
11556988 packets received; 4516733 packets sent
21760 input errors; 469438 output errors
27759 collisions
lo (unit number 0):
Flags: (0x69) UP LOOPBACK ARP RUNNING
Internet address: 127.0.0.1
Netmask 0xff000000 Subnetmask 0xff000000
Metric is 0
Maximum Transfer Unit size is 4096
80898 packets received; 80898 packets sent
0 input errors; 0 output errors
0 collisions
value = 18 = 0x12
Then I was called from the control room and notified of the situation because
they were starting to worry. While I was on the phone, the problem cleared. I
did "ifShow" again:
iocDRO162b> ifShow
ei (unit number 0):
Flags: (0x63) UP BROADCAST ARP RUNNING
Internet address: 164.54.250.3
Broadcast address: 164.54.251.255
Netmask 0xffff0000 Subnetmask 0xfffffe00
Ethernet address is 08:00:3e:24:ed:92
Metric is 0
Maximum Transfer Unit size is 1500
11563366 packets received; 4520084 packets sent
21766 input errors; 469612 output errors
27768 collisions
lo (unit number 0):
Flags: (0x69) UP LOOPBACK ARP RUNNING
Internet address: 127.0.0.1
Netmask 0xff000000 Subnetmask 0xff000000
Metric is 0
Maximum Transfer Unit size is 4096
80978 packets received; 80978 packets sent
0 input errors; 0 output errors
0 collisions
value = 18 = 0x12
Mean anything to you?
Please let me know if any of this helps or if there's anything else I can do to
help diagnose the problem. In the meantime, I'm still trying to rope in the
network sniffer. If I find out anything else, I'll let you know.
Thanks for the help.
Garrett Rinehart
Intense Pulsed Neutron Source
Argonne National Laboratory
9700 S. Cass Ave
Argonne, IL 60439
(630)252-6561
- Replies:
- Re: Delays in receipt of CA monitors Maren Purves
- Navigate by Date:
- Prev:
mbboDirectRecord Benjamin Franksen
- Next:
Re: Delays in receipt of CA monitors Maren Purves
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: Delays in receipt of CA monitors Jeff Hill
- Next:
Re: Delays in receipt of CA monitors Maren Purves
- Index:
1994
1995
1996
1997
1998
<1999>
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|