Experimental Physics and
Industrial Control System

"Garrett D. Rinehart" <[email protected]> · Thu, 9 Dec 1999 11:29:34 -0600 (CST)

Marty wrote:

> Could you give a little more info about the application
> 
> I think you are sayiing
> 
> iocA is getting data at a 30Hz rate
> iocB has input links to iocA and iocB drivers the digital readout display.
> MEDM gets its input from iocA
> 
> Is this correct?

Well, sorta. iocA is getting the data at 30Hz. iocB collects that data from iocA 
and drives the DROs. This is done through CA in the same subroutine record that 
drives all the DROs as well as some other equipment (it is pretending to be the 
old DataGeneral computer). There is ONE record in iocB that has a link to iocA. 
That data (the calculated value in the record in iocB) is often updated by the 
DRO subroutine even when the other readouts freeze. That indicates to me that 
iocB is still functioning and still writing the DROs successfully, but that the 
CA from the subroutine record to iocA has stalled, even though that very same 
task is still able to fetch values locally. 

It may be worth noting that my subroutine does not distinguish local and remote 
records. It stacks 'em up with ca_search(). Then does ca_pend_io(10.0) for the 
list, and then does ca_add_event() for each channel to get a callback. After 
doing that all once, it hits this code:
/* Loop here handling ca_events and ca_puts forever */
  while (1) {
    semTake(ca_sem_id, WAIT_FOREVER); /* Wait for Process' okay to run */
    status = ca_pend_event(.0001);    /* Do it */
  }
The sub record offers up that semaphore once each time it's processed.

This just in: I made an MEDM display with readouts identical to the DRO panel. 
Watching it on the workstation and the DRO panel at the same time, it is obvious 
that both copies freeze simultaneously. Does that help narrow the search?

> How are the input links configured? CPP CP ?

Whichever is the default. These databases were in place before those were added. 
The one record in iocB that has an input link from iocA lists it as "PP MS".

> How many signals total?

In iocA? <200. In iocB? Just a handful, though one is a subRecord which acts as 
a DataGeneral emulator and is more complex than I want to think about again. 
Shared between the two and related to the problem at hand? About 20 for sure. 
Though I've seen delays looking at other channels on iocA with MEDM.

> How heavily is your network loaded?
> Maybe you can look tcpstatShow on the ioc regularly. If you could look at it
> before and after an incident it might give some info.

Here goes:
 iocDRO162b> tcpstatShow
TCP:
        5373166 packets sent
                2658284 data packets (86524601 bytes)
                190203 data packets (37020634 bytes) retransmitted
                2232064 ack-only packets (995946 delayed)
                0 URG only packet
                0 window probe packet
                291537 window update packets
                1078 control packets
        12295282 packets received
                839174 acks (for 86527309 bytes)
                605625 duplicate acks
                0 ack for unsent data
                9817841 packets (1965333241 bytes) received in-sequence
                1945 completely duplicate packets (2739425 bytes)
                342 packets with some dup. data (195643 bytes duped)
                1232104 out-of-order packets (443996099 bytes)
                0 packet (0 byte) of data after window
                0 window probe
                38723 window update packets
                0 packet received after close
                0 discarded for bad checksum
                0 discarded for bad header offset field
                0 discarded because packet too short
        21 connection requests
        1022 connection accepts
        1043 connections established (including accepts)
        1054 connections closed (including 0 drop)
        8 embryonic connections dropped
        652076 segments updated rtt (of 841775 attempts)
        6775 retransmit timeouts
                0 connection dropped by rexmit timeout
        0 persist timeout
        15 keepalive timeouts
                15 keepalive probes sent
                0 connection dropped by keepalive
value = 36 = 0x24 = '$'

I did this before, during, and after an event and noticed no significant change. 
Only the "retransmit timeouts" increased by more than a few counts and it seems 
to keep doing that anyway.

> Also look at mbufShow from time to time. 
> netHelp shows all the network show commands supplied with vxWorks.

iocDRO162b> mbufShow
type        number
---------   ------
FREE    :    170
DATA    :     26
HEADER  :     13
SOCKET  :      0
PCB     :     47
RTABLE  :      2
HTABLE  :      0
ATABLE  :      0
SONAME  :      0
ZOMBIE  :      0
SOOPTS  :      0
FTABLE  :      0
RIGHTS  :      0
IFADDR  :      2
TOTAL   :    260
number of mbufs: 260
number of clusters: 7
number of interface pages: 0
number of free clusters: 7
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0

I did this several times and this one has the lowest number of free mbufs. Most 
of the time it hovers around 190 free.

> We have been looking at Ethernet errors such as collisions with
> "ifShow" on vxWorks, and with "netstat -i" on UNIX. You can also
> look for IP level errors with "ipstatShow" on vxWorks, and
> "netstat -s" on UNIX. Are you using a switched Ethernet?

Okay, as luck would have it, I did "ifShow" twice during one of the long 
outtages without realizing it:

iocDRO162b> ifShow
ei (unit number 0):
     Flags: (0x63) UP BROADCAST ARP RUNNING 
     Internet address: 164.54.250.3
     Broadcast address: 164.54.251.255
     Netmask 0xffff0000 Subnetmask 0xfffffe00
     Ethernet address is 08:00:3e:24:ed:92
     Metric is 0
     Maximum Transfer Unit size is 1500
     11554152 packets received; 4515187 packets sent
     21759 input errors; 469370 output errors
     27758 collisions
lo (unit number 0):
     Flags: (0x69) UP LOOPBACK ARP RUNNING 
     Internet address: 127.0.0.1
     Netmask 0xff000000 Subnetmask 0xff000000
     Metric is 0
     Maximum Transfer Unit size is 4096
     80860 packets received; 80860 packets sent
     0 input errors; 0 output errors
     0 collisions
value = 18 = 0x12
iocDRO162b> ifShow
ei (unit number 0):
     Flags: (0x63) UP BROADCAST ARP RUNNING 
     Internet address: 164.54.250.3
     Broadcast address: 164.54.251.255
     Netmask 0xffff0000 Subnetmask 0xfffffe00
     Ethernet address is 08:00:3e:24:ed:92
     Metric is 0
     Maximum Transfer Unit size is 1500
     11556988 packets received; 4516733 packets sent
     21760 input errors; 469438 output errors
     27759 collisions
lo (unit number 0):
     Flags: (0x69) UP LOOPBACK ARP RUNNING 
     Internet address: 127.0.0.1
     Netmask 0xff000000 Subnetmask 0xff000000
     Metric is 0
     Maximum Transfer Unit size is 4096
     80898 packets received; 80898 packets sent
     0 input errors; 0 output errors
     0 collisions
value = 18 = 0x12

Then I was called from the control room and notified of the situation because 
they were starting to worry. While I was on the phone, the problem cleared. I 
did "ifShow" again:

iocDRO162b> ifShow
ei (unit number 0):
     Flags: (0x63) UP BROADCAST ARP RUNNING 
     Internet address: 164.54.250.3
     Broadcast address: 164.54.251.255
     Netmask 0xffff0000 Subnetmask 0xfffffe00
     Ethernet address is 08:00:3e:24:ed:92
     Metric is 0
     Maximum Transfer Unit size is 1500
     11563366 packets received; 4520084 packets sent
     21766 input errors; 469612 output errors
     27768 collisions
lo (unit number 0):
     Flags: (0x69) UP LOOPBACK ARP RUNNING 
     Internet address: 127.0.0.1
     Netmask 0xff000000 Subnetmask 0xff000000
     Metric is 0
     Maximum Transfer Unit size is 4096
     80978 packets received; 80978 packets sent
     0 input errors; 0 output errors
     0 collisions
value = 18 = 0x12

Mean anything to you?

Please let me know if any of this helps or if there's anything else I can do to 
help diagnose the problem. In the meantime, I'm still trying to rope in the 
network sniffer. If I find out anything else, I'll let you know.

Thanks for the help.

Garrett Rinehart
Intense Pulsed Neutron Source
Argonne National Laboratory
9700 S. Cass Ave
Argonne, IL  60439
(630)252-6561

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System