EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Inter-IOC link problems
From: "Jeff Hill" <[email protected]>
To: "'Shepherd, EL \(Emma\)'" <[email protected]>
Cc: [email protected]
Date: Thu, 12 Mar 2009 10:12:56 -0600
Emma,

> There doesn't seem to be any other obvious problems that I can see (CPU
> usage very low) - I've attached some of the console output.  I did a "tt"
> on the dbca link thread but wasn't sure where to go from there - is there
> anything else I should try before I reboot the IOC?

First try to determine if it's an IP kernel related issue (you should see
some aspects of TCP/UDP that are not working using protocols that are not CA
if it's an IP kernel issue). Does telnet (verifying TCP) and ping (verifying
IP) work with the IOC when it is in this state? If your vxWorks system has
an echo server (listening on port seven) you could test UDP with that. 

Here is a talk by Dave Thomson (which has some info on diagnosing vxWorks
buffer starvation related issues).

http://www.diamond.ac.uk/CMSWeb/Downloads/diamond/Events/EPICS/MBUF_Problems
.ppt

This one might help also.

http://www.xs4all.nl/~borkhuis/vxworks/troubleshooting.txt

And here is some info on how to configure vxWorks to run EPICS.

http://www.aps.anl.gov/epics/base/tornado.php?format=printer


The output from ifShow, endPoolShow("name", 0), netStackDataPoolShow(),
netStackSysPoolShow(), and maybe also udpStatShow are probably most likely
to provide some hints at the cause of your problems if you are experiencing
troubles with the vxWorks IP kernel (or below). The output from ifShow can
be very interesting if there are low level media transmission errors.

Look at the output from inetStatShow. In particular, look at TCP circuits
that consistently indicate the same large number of bytes pending in their
buffers (in multiple samples dumped with inetStatShow). Pending output bytes
can indicate congestion problems with the IP kernel, network, routing
system, and or the server (possibly a CA server (GW or IOC) this IOC is
connected to). Pending input bytes usually indicate issues with the code
consuming bytes from the socket (in this case the CA client library).

> > I would also look very closely at the output from dbcar at higher
> > interest levels. As the interest level increases you should be able to
> > see if CA thinks that the channel is connected or not (the output from
> > void nciu::show ()). Of particular interest would be any situations
> > where CA thinks the channel is connected, but the DB CA link code does
> > not.
> > Also look for situations where the DB CA Link code thinks that it's a
> > CA link, but the CA channel hasn't been created (yet).

I would definitely dump the output of dbcar when specifying a very high
magnitude interest level (a level of 1000 should be sufficient) so that you
see all of the gory details. We need to fault isolate so look for situations
where the CA client library marks a particular channel as being connected,
but the db ca link facility marks this channel as being disconnected. Also
look for situations where a channel hasn't been created in the CA client
library, but the db ca link facility considers the link to be a CA link, and
of course the third possibility would be that the channel exists in the CA
client library and both the CA client library and the DB CA link facility
consider the channel to be disconnected.

If you can somehow capture the entire output from dbcar at interest level
1000, and send it to me in an email, I would be happy to have a look. One
possibility would be to forward the output of the vxWorks command to a file.
Also send the name of the channels that should be connected, but aren't.

It will be time consuming, but you might also capture a tt from the thread
running the db ca link facility, and hopefully also all of the threads
managing the CA client context created for the db ca link facility. If you
could send that information I might be able to determine what has happened.
The tornado, host based debugging system, might help to automate the stack
trace collection process.

Jeff

> -----Original Message-----
> From: Shepherd, EL (Emma) [mailto:[email protected]]
> Sent: Tuesday, March 10, 2009 9:12 AM
> To: Jeff Hill
> Subject: RE: Inter-IOC link problems
> 
> Hi Jeff,
> 
> You may remember this problem I reported on tech-talk a little while ago.
> It has occurred again, and I have managed to do a little more debugging.
> I loaded a standalone CA client as you suggested and it works fine, so it
> appears that it is not a global CA issue.
> 
> There doesn't seem to be any other obvious problems that I can see (CPU
> usage very low) - I've attached some of the console output.  I did a "tt"
> on the dbca link thread but wasn't sure where to go from there - is there
> anything else I should try before I reboot the IOC?
> 
> Thanks again for your help,
> 
> Emma
> 
> Emma Shepherd
> Software Systems Engineer
> Beamline Controls - I06, I07, I24
> 
> +44 (0)1235-778235
> http://www.diamond.ac.uk
> 
> > -----Original Message-----
> > From: Jeff Hill [mailto:[email protected]]
> > Sent: 20 October 2008 17:24
> > To: Shepherd, EL (Emma); [email protected]
> > Subject: RE: Inter-IOC link problems
> >
> >
> > Presumably, the IP stack on this IOC is operating correctly when this
> > happens - as verified by {telnet, ping, ifShow, ...}?
> >
> > When this occurs, you might try running a small standalone CA client
> > that you have dynamically loaded into vxWorks. Its best to spawn this
> > type of client so that a CA context will not end up getting attached
> > to the vxWorks shell. The intent of course would be to isolate between
> > a global CA issue, and one that is isolated to the CA client / DB CA
> > Link code combination.
> >
> > I would also look very closely at the output from dbcar at higher
> > interest levels. As the interest level increases you should be able to
> > see if CA thinks that the channel is connected or not (the output from
> > void nciu::show ()). Of particular interest would be any situations
> > where CA thinks the channel is connected, but the DB CA link code does
> > not.
> > Also look for situations where the DB CA Link code thinks that it's a
> > CA link, but the CA channel hasn't been created (yet).
> >
> > Also, do a "tt" on the DBCA Link thread, and the satellite threads for
> > its CA context. Look for any situations where threads are hanging
> > around in unusual places which might indicate some form of deadlock.
> > If you see anything out of the ordinary please send the tt output and
> > I will have a look. In lightly loaded situations, "out of the
> > ordinary"
> > usually means a thread that isn't parked in the normal place (as seen
> > by snapshots with tt) for an extended length of time. One of course
> > needs to compare tt output from when the IOC is normal to tt output
> > from when the IOC is misbehaving.
> > Needless to say, a CPU starvation situation on this IOC would also
> > cause issues (could be the cause of your issue).
> >
> > In the past, quite some years back actually, I have seen UDP issues if
> > there were too many machines on a network with the wrong subnet mask
> > configuration. I think that there used to be some issues in particular
> > with HP workstations because they would reply with "ICMP network
> > unreachable" if their network mask was set incorrectly and this could
> > cause the IOC's search response to be discarded off the end of the
> > finite length UDP input queue (depending on which response got there
> > first and how many bogus ICMP messages are sent in response to each
> > search request). ICMP traffic can be seen with Ethernet snoopers like
> > wireshark or tcpdump. However, on modern switched networks, it may be
> > best to be on the same hub (not a switch) with the IOC so that you can
> > see unicast traffic that the switch sends only between the IOC and its
> > message peers. Admittedly, this is perhaps contraindicated based on
> > your not seeing any search traffic from the IOC in casnooper.
> >
> > You might have a look at the output from utpStatShow (presuming that
> > something is wrong with UDP and not IP).
> > Also, have a look at ifShow and verify that the broadcast address
> > remains correctly configured, and that there are not high error rates.
> >
> > Jeff
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]]
> > > On Behalf Of Shepherd, EL (Emma)
> > > Sent: Friday, October 17, 2008 9:12 AM
> > > To: [email protected]
> > > Subject: RE: Inter-IOC link problems
> > >
> > > I've done a little more investigation and I think that in this case
> > > the gateway is not to blame.  It seems that other CA links
> > on this IOC
> > > are also not working, and they are not all going through
> > the gateway
> > > (some are on other IOCs on the same network).
> > >
> > > I setup caSnooper to monitor connection requests on one of
> > the PVs my
> > > IOC is failing to link to.  When I change the link to a
> > constant and
> > > change it back again, caSnooper does not report any new
> > requests for
> > > the PV.  However when I do the same on a 'healthy' IOC which has
> > > working links, I see the new request on caSnooper when I
> > put the link
> > > back.
> > >
> > > I'm not sure what that tells me except that it looks like
> > the IOC has
> > > somehow stopped broadcasting search requests..?
> > >
> > > Emma
> > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]] On Behalf Of Shepherd, EL
> > > > (Emma)
> > > > Sent: 17 October 2008 12:28
> > > > To: Ralph Lange
> > > > Cc: [email protected]
> > > > Subject: RE: Inter-IOC link problems
> > > >
> > > >
> > > > Hi there,
> > > >
> > > > Thanks for the replies, it seems that the 'undefined' entry might
> > > > have been a red herring.
> > > >
> > > > The IOC I am looking at is the client of the PV
> > connection, and the
> > > > IP address listed is the server side of the CA gateway.
> > There are
> > > > in fact two gateways on this machine - one for each
> > direction as you
> > > > suggested. The configuration is really very simple, it is
> > setup to
> > > > allow read access for all PVs.  Do you need to know anything more
> > > > specific?
> > > >
> > > > Cheers,
> > > >
> > > > Emma
> > > >
> > > > > -----Original Message-----
> > > > > From: Ralph Lange [mailto:[email protected]]
> > > > > Sent: 17 October 2008 08:52
> > > > > To: Shepherd, EL (Emma)
> > > > > Cc: [email protected]
> > > > > Subject: Re: Inter-IOC link problems
> > > > >
> > > > >
> > > > > Hi Emma,
> > > > >
> > > > > I would need a bit more information about your setup to
> > be able to
> > > > > fully understand your report.
> > > > >
> > > > > You are looking at the CA client side of an IOC. When you are
> > > > > losing connections between IOCs, is the IOC you're
> > looking at the
> > > > > server or the client of that PV connection?
> > > > > It seems there are no beacons coming from the CA Gateway
> > > > > (172.23.106.35). Is that the client side or the server side
> > > > of the CA
> > > > > Gateway? Are two (or more) Gateway processes running on
> > > > that machine
> > > > > (i.e. one for each direction)? What is the CA configuration for
> > > > > the
> > > > > Gateway(s) on that machine?
> > > > >
> > > > > CA configuration of a Gateway is difficult and subtle.
> > There are a
> > > > > lot of environment variables for CA server and client (see the
> > > > CA Manual)
> > > > > which influence the behaviour of a CA application. Some
> > > > variables are
> > > > > using other variables' values as default, which simplifies
> > > > > configuration of pure CA client or server apps, but may lead to
> > > > > unwanted behaviour for a CA Gateway (whis is one of the few apps
> > > > > that is as well CA server and client). E.g, it is quite easy to
> > > > > create a setup where the
> > > > Gateway is
> > > > > sending out beacons on the wrong (i.e. client) side.
> > > > >
> > > > > Cheers,
> > > > > Ralph
> > > > >
> > > > >
> > > > > Shepherd, EL (Emma) wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We still seem to suffer quite a bit from problems with
> > > > > database links
> > > > > > between IOCs, particularly when a gateway is
> > involved.  For some
> > > > > > reason the links can become disconnected and a reboot
> > is usually
> > > > > > necessary to get them working again.  I have just had an
> > > > > opportunity
> > > > > > to do some diagnosis on one such problem and found a clue
> > > > in the CA
> > > > > > beacon hashtable part of the dbcar report.  The entry for
> > > > > the gateway
> > > > > > (172.23.106.35) is 'undefined', although the gateway itself
> > > > > seems to
> > > > > > be working just fine and I can use caget through the
> > gateway as
> > > > > > normal.
> > > > > >
> > > > > > Any ideas what could cause this to happen, or how to fix
> > > > it when it
> > > > > > does?  None of the tasks are suspended, CPU usage is low and
> > > > > > everything else looks fine.
> > > > > >
> > > > > > CA beacon hash entry for 172.23.106.32:5064 with
> > period estimate
> > > > > > 15.000521
> > > > > >         beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > >         beacon number 0, on <undefined> CA beacon hash entry
> > > > > > for 172.23.106.97:5064 with
> > period estimate
> > > > > > 14.988265
> > > > > >         beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > 14.988637
> > > > > >         beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > 14.980477
> > > > > >         beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > estimate
> > > > > > 14.990867
> > > > > >         beacon number 39993, on THU OCT 16 2008 14:27:53 CA
> > > > > > beacon hash entry for 172.23.106.32:5064 with period estimate
> > > > > > 15.000521
> > > > > >         beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > >         beacon number 0, on <undefined> CA beacon hash entry
> > > > > > for 172.23.106.97:5064 with
> > period estimate
> > > > > > 14.988265
> > > > > >         beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > 14.988637
> > > > > >         beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > 14.980477
> > > > > >         beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > estimate
> > > > > > 14.990867
> > > > > >         beacon number 39993, on THU OCT 16 2008 14:27:53
> > > > > entries per
> > > > > > bucket: mean = 0.011719 std dev = 0.107617 max = 1
> > > > > >
> > > > > >
> > > > > > Thanks in advance....
> > > > > >
> > > > > > Emma
> > > > > >
> > > > >
> > > > <DIV><FONT size="1" color="gray">This e-mail and any
> > attachments may
> > > > contain confidential, copyright and or privileged
> > material, and are
> > > > for the use of the intended addressee only. If you are not the
> > > > intended addressee or an authorised recipient of the addressee
> > > > please notify us of receipt by returning the e-mail and
> > do not use,
> > > > copy, retain, distribute or disclose the information in
> > or attached
> > > > to the e-mail. Any opinions expressed within this e-mail are those
> > > > of the individual and not necessarily of Diamond Light Source Ltd.
> > > > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > > > attachments are free from viruses and we cannot accept liability
> > > > for any damage which you may sustain as a result of software
> > > > viruses which may be transmitted in or with the message. Diamond
> > > > Light Source Limited (company no. 4375679).
> > > > Registered in England and Wales with its registered office at
> > > > Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> > > >
> > > >
> > > <DIV><FONT size="1" color="gray">This e-mail and any
> > attachments may
> > > contain confidential, copyright and or privileged material, and are
> > > for
> > the
> > > use of the intended addressee only. If you are not the intended
> > > addressee or an authorised recipient of the addressee
> > please notify us
> > > of receipt by returning the e-mail and do not use, copy, retain,
> > > distribute or disclose the information in or attached to
> > the e-mail.
> > > Any opinions expressed within this e-mail are those of the
> > individual
> > > and not necessarily of Diamond Light Source Ltd. Diamond
> > Light Source
> > > Ltd. cannot guarantee that this e-mail or any attachments are free
> > > from viruses and we cannot accept liability for any damage
> > which you
> > > may sustain as a result of software viruses which may be
> > transmitted
> > > in or with the message. Diamond Light Source Limited (company no.
> > > 4375679). Registered in England and Wales with its
> > registered office
> > > at Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> >
> >
> <DIV><FONT size="1" color="gray">This e-mail and any attachments may
> contain confidential, copyright and or privileged material, and are for
> the use of the intended addressee only. If you are not the intended
> addressee or an authorised recipient of the addressee please notify us of
> receipt by returning the e-mail and do not use, copy, retain, distribute
> or disclose the information in or attached to the e-mail.
> Any opinions expressed within this e-mail are those of the individual and
> not necessarily of Diamond Light Source Ltd.
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> attachments are free from viruses and we cannot accept liability for any
> damage which you may sustain as a result of software viruses which may be
> transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England
> and Wales with its registered office at Diamond House, Harwell Science and
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> </FONT></DIV>
> 
> --
> 
> Scanned by iCritical.


Replies:
Re: Inter-IOC link problems Andrew Johnson

Navigate by Date:
Prev: Scitec 500MC LockIn Peter . Mueller
Next: RE: Inter-IOC link problems Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Inter-IOC link problems Shepherd, EL (Emma)
Next: Re: Inter-IOC link problems Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 31 Jan 2014 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·