EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Inter-IOC link problems
From: Andrew Johnson <[email protected]>
To: [email protected]
Date: Thu, 12 Mar 2009 10:42:35 -0600
Jeff,

Could this be the beacon port network order problem that you fixed recently?  
That might explain why she's seeing no beacons from the gateway.

- Andrew

On Thursday 12 March 2009 11:12:56 Jeff Hill wrote:
> Emma,
>
> > There doesn't seem to be any other obvious problems that I can see (CPU
> > usage very low) - I've attached some of the console output.  I did a "tt"
> > on the dbca link thread but wasn't sure where to go from there - is there
> > anything else I should try before I reboot the IOC?
>
> First try to determine if it's an IP kernel related issue (you should see
> some aspects of TCP/UDP that are not working using protocols that are not
> CA if it's an IP kernel issue). Does telnet (verifying TCP) and ping
> (verifying IP) work with the IOC when it is in this state? If your vxWorks
> system has an echo server (listening on port seven) you could test UDP with
> that.
>
> Here is a talk by Dave Thomson (which has some info on diagnosing vxWorks
> buffer starvation related issues).
>
> http://www.diamond.ac.uk/CMSWeb/Downloads/diamond/Events/EPICS/MBUF_Problem
>s .ppt
>
> This one might help also.
>
> http://www.xs4all.nl/~borkhuis/vxworks/troubleshooting.txt
>
> And here is some info on how to configure vxWorks to run EPICS.
>
> http://www.aps.anl.gov/epics/base/tornado.php?format=printer
>
>
> The output from ifShow, endPoolShow("name", 0), netStackDataPoolShow(),
> netStackSysPoolShow(), and maybe also udpStatShow are probably most likely
> to provide some hints at the cause of your problems if you are experiencing
> troubles with the vxWorks IP kernel (or below). The output from ifShow can
> be very interesting if there are low level media transmission errors.
>
> Look at the output from inetStatShow. In particular, look at TCP circuits
> that consistently indicate the same large number of bytes pending in their
> buffers (in multiple samples dumped with inetStatShow). Pending output
> bytes can indicate congestion problems with the IP kernel, network, routing
> system, and or the server (possibly a CA server (GW or IOC) this IOC is
> connected to). Pending input bytes usually indicate issues with the code
> consuming bytes from the socket (in this case the CA client library).
>
> > > I would also look very closely at the output from dbcar at higher
> > > interest levels. As the interest level increases you should be able to
> > > see if CA thinks that the channel is connected or not (the output from
> > > void nciu::show ()). Of particular interest would be any situations
> > > where CA thinks the channel is connected, but the DB CA link code does
> > > not.
> > > Also look for situations where the DB CA Link code thinks that it's a
> > > CA link, but the CA channel hasn't been created (yet).
>
> I would definitely dump the output of dbcar when specifying a very high
> magnitude interest level (a level of 1000 should be sufficient) so that you
> see all of the gory details. We need to fault isolate so look for
> situations where the CA client library marks a particular channel as being
> connected, but the db ca link facility marks this channel as being
> disconnected. Also look for situations where a channel hasn't been created
> in the CA client library, but the db ca link facility considers the link to
> be a CA link, and of course the third possibility would be that the channel
> exists in the CA client library and both the CA client library and the DB
> CA link facility consider the channel to be disconnected.
>
> If you can somehow capture the entire output from dbcar at interest level
> 1000, and send it to me in an email, I would be happy to have a look. One
> possibility would be to forward the output of the vxWorks command to a
> file. Also send the name of the channels that should be connected, but
> aren't.
>
> It will be time consuming, but you might also capture a tt from the thread
> running the db ca link facility, and hopefully also all of the threads
> managing the CA client context created for the db ca link facility. If you
> could send that information I might be able to determine what has happened.
> The tornado, host based debugging system, might help to automate the stack
> trace collection process.
>
> Jeff
>
> > -----Original Message-----
> > From: Shepherd, EL (Emma) [mailto:[email protected]]
> > Sent: Tuesday, March 10, 2009 9:12 AM
> > To: Jeff Hill
> > Subject: RE: Inter-IOC link problems
> >
> > Hi Jeff,
> >
> > You may remember this problem I reported on tech-talk a little while ago.
> > It has occurred again, and I have managed to do a little more debugging.
> > I loaded a standalone CA client as you suggested and it works fine, so it
> > appears that it is not a global CA issue.
> >
> > There doesn't seem to be any other obvious problems that I can see (CPU
> > usage very low) - I've attached some of the console output.  I did a "tt"
> > on the dbca link thread but wasn't sure where to go from there - is there
> > anything else I should try before I reboot the IOC?
> >
> > Thanks again for your help,
> >
> > Emma
> >
> > Emma Shepherd
> > Software Systems Engineer
> > Beamline Controls - I06, I07, I24
> >
> > +44 (0)1235-778235
> > http://www.diamond.ac.uk
> >
> > > -----Original Message-----
> > > From: Jeff Hill [mailto:[email protected]]
> > > Sent: 20 October 2008 17:24
> > > To: Shepherd, EL (Emma); [email protected]
> > > Subject: RE: Inter-IOC link problems
> > >
> > >
> > > Presumably, the IP stack on this IOC is operating correctly when this
> > > happens - as verified by {telnet, ping, ifShow, ...}?
> > >
> > > When this occurs, you might try running a small standalone CA client
> > > that you have dynamically loaded into vxWorks. Its best to spawn this
> > > type of client so that a CA context will not end up getting attached
> > > to the vxWorks shell. The intent of course would be to isolate between
> > > a global CA issue, and one that is isolated to the CA client / DB CA
> > > Link code combination.
> > >
> > > I would also look very closely at the output from dbcar at higher
> > > interest levels. As the interest level increases you should be able to
> > > see if CA thinks that the channel is connected or not (the output from
> > > void nciu::show ()). Of particular interest would be any situations
> > > where CA thinks the channel is connected, but the DB CA link code does
> > > not.
> > > Also look for situations where the DB CA Link code thinks that it's a
> > > CA link, but the CA channel hasn't been created (yet).
> > >
> > > Also, do a "tt" on the DBCA Link thread, and the satellite threads for
> > > its CA context. Look for any situations where threads are hanging
> > > around in unusual places which might indicate some form of deadlock.
> > > If you see anything out of the ordinary please send the tt output and
> > > I will have a look. In lightly loaded situations, "out of the
> > > ordinary"
> > > usually means a thread that isn't parked in the normal place (as seen
> > > by snapshots with tt) for an extended length of time. One of course
> > > needs to compare tt output from when the IOC is normal to tt output
> > > from when the IOC is misbehaving.
> > > Needless to say, a CPU starvation situation on this IOC would also
> > > cause issues (could be the cause of your issue).
> > >
> > > In the past, quite some years back actually, I have seen UDP issues if
> > > there were too many machines on a network with the wrong subnet mask
> > > configuration. I think that there used to be some issues in particular
> > > with HP workstations because they would reply with "ICMP network
> > > unreachable" if their network mask was set incorrectly and this could
> > > cause the IOC's search response to be discarded off the end of the
> > > finite length UDP input queue (depending on which response got there
> > > first and how many bogus ICMP messages are sent in response to each
> > > search request). ICMP traffic can be seen with Ethernet snoopers like
> > > wireshark or tcpdump. However, on modern switched networks, it may be
> > > best to be on the same hub (not a switch) with the IOC so that you can
> > > see unicast traffic that the switch sends only between the IOC and its
> > > message peers. Admittedly, this is perhaps contraindicated based on
> > > your not seeing any search traffic from the IOC in casnooper.
> > >
> > > You might have a look at the output from utpStatShow (presuming that
> > > something is wrong with UDP and not IP).
> > > Also, have a look at ifShow and verify that the broadcast address
> > > remains correctly configured, and that there are not high error rates.
> > >
> > > Jeff
> > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]]
> > > > On Behalf Of Shepherd, EL (Emma)
> > > > Sent: Friday, October 17, 2008 9:12 AM
> > > > To: [email protected]
> > > > Subject: RE: Inter-IOC link problems
> > > >
> > > > I've done a little more investigation and I think that in this case
> > > > the gateway is not to blame.  It seems that other CA links
> > >
> > > on this IOC
> > >
> > > > are also not working, and they are not all going through
> > >
> > > the gateway
> > >
> > > > (some are on other IOCs on the same network).
> > > >
> > > > I setup caSnooper to monitor connection requests on one of
> > >
> > > the PVs my
> > >
> > > > IOC is failing to link to.  When I change the link to a
> > >
> > > constant and
> > >
> > > > change it back again, caSnooper does not report any new
> > >
> > > requests for
> > >
> > > > the PV.  However when I do the same on a 'healthy' IOC which has
> > > > working links, I see the new request on caSnooper when I
> > >
> > > put the link
> > >
> > > > back.
> > > >
> > > > I'm not sure what that tells me except that it looks like
> > >
> > > the IOC has
> > >
> > > > somehow stopped broadcasting search requests..?
> > > >
> > > > Emma
> > > >
> > > > > -----Original Message-----
> > > > > From: [email protected]
> > > > > [mailto:[email protected]] On Behalf Of Shepherd, EL
> > > > > (Emma)
> > > > > Sent: 17 October 2008 12:28
> > > > > To: Ralph Lange
> > > > > Cc: [email protected]
> > > > > Subject: RE: Inter-IOC link problems
> > > > >
> > > > >
> > > > > Hi there,
> > > > >
> > > > > Thanks for the replies, it seems that the 'undefined' entry might
> > > > > have been a red herring.
> > > > >
> > > > > The IOC I am looking at is the client of the PV
> > >
> > > connection, and the
> > >
> > > > > IP address listed is the server side of the CA gateway.
> > >
> > > There are
> > >
> > > > > in fact two gateways on this machine - one for each
> > >
> > > direction as you
> > >
> > > > > suggested. The configuration is really very simple, it is
> > >
> > > setup to
> > >
> > > > > allow read access for all PVs.  Do you need to know anything more
> > > > > specific?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Emma
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Ralph Lange [mailto:[email protected]]
> > > > > > Sent: 17 October 2008 08:52
> > > > > > To: Shepherd, EL (Emma)
> > > > > > Cc: [email protected]
> > > > > > Subject: Re: Inter-IOC link problems
> > > > > >
> > > > > >
> > > > > > Hi Emma,
> > > > > >
> > > > > > I would need a bit more information about your setup to
> > >
> > > be able to
> > >
> > > > > > fully understand your report.
> > > > > >
> > > > > > You are looking at the CA client side of an IOC. When you are
> > > > > > losing connections between IOCs, is the IOC you're
> > >
> > > looking at the
> > >
> > > > > > server or the client of that PV connection?
> > > > > > It seems there are no beacons coming from the CA Gateway
> > > > > > (172.23.106.35). Is that the client side or the server side
> > > > >
> > > > > of the CA
> > > > >
> > > > > > Gateway? Are two (or more) Gateway processes running on
> > > > >
> > > > > that machine
> > > > >
> > > > > > (i.e. one for each direction)? What is the CA configuration for
> > > > > > the
> > > > > > Gateway(s) on that machine?
> > > > > >
> > > > > > CA configuration of a Gateway is difficult and subtle.
> > >
> > > There are a
> > >
> > > > > > lot of environment variables for CA server and client (see the
> > > > >
> > > > > CA Manual)
> > > > >
> > > > > > which influence the behaviour of a CA application. Some
> > > > >
> > > > > variables are
> > > > >
> > > > > > using other variables' values as default, which simplifies
> > > > > > configuration of pure CA client or server apps, but may lead to
> > > > > > unwanted behaviour for a CA Gateway (whis is one of the few apps
> > > > > > that is as well CA server and client). E.g, it is quite easy to
> > > > > > create a setup where the
> > > > >
> > > > > Gateway is
> > > > >
> > > > > > sending out beacons on the wrong (i.e. client) side.
> > > > > >
> > > > > > Cheers,
> > > > > > Ralph
> > > > > >
> > > > > > Shepherd, EL (Emma) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We still seem to suffer quite a bit from problems with
> > > > > >
> > > > > > database links
> > > > > >
> > > > > > > between IOCs, particularly when a gateway is
> > >
> > > involved.  For some
> > >
> > > > > > > reason the links can become disconnected and a reboot
> > >
> > > is usually
> > >
> > > > > > > necessary to get them working again.  I have just had an
> > > > > >
> > > > > > opportunity
> > > > > >
> > > > > > > to do some diagnosis on one such problem and found a clue
> > > > >
> > > > > in the CA
> > > > >
> > > > > > > beacon hashtable part of the dbcar report.  The entry for
> > > > > >
> > > > > > the gateway
> > > > > >
> > > > > > > (172.23.106.35) is 'undefined', although the gateway itself
> > > > > >
> > > > > > seems to
> > > > > >
> > > > > > > be working just fine and I can use caget through the
> > >
> > > gateway as
> > >
> > > > > > > normal.
> > > > > > >
> > > > > > > Any ideas what could cause this to happen, or how to fix
> > > > >
> > > > > it when it
> > > > >
> > > > > > > does?  None of the tasks are suspended, CPU usage is low and
> > > > > > > everything else looks fine.
> > > > > > >
> > > > > > > CA beacon hash entry for 172.23.106.32:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 15.000521
> > > > > > >         beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > > >         beacon number 0, on <undefined> CA beacon hash entry
> > > > > > > for 172.23.106.97:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 14.988265
> > > > > > >         beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > > 14.988637
> > > > > > >         beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > > 14.980477
> > > > > > >         beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > >
> > > estimate
> > >
> > > > > > > 14.990867
> > > > > > >         beacon number 39993, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.32:5064 with period estimate
> > > > > > > 15.000521
> > > > > > >         beacon number 168436, on THU OCT 16 2008 14:27:46 CA
> > > > > > > beacon hash entry for 172.23.106.35:5064 <no period estimate>
> > > > > > >         beacon number 0, on <undefined> CA beacon hash entry
> > > > > > > for 172.23.106.97:5064 with
> > >
> > > period estimate
> > >
> > > > > > > 14.988265
> > > > > > >         beacon number 76356, on THU OCT 16 2008 14:27:52 CA
> > > > > > > beacon hash entry for 172.23.106.96:5064 with period estimate
> > > > > > > 14.988637
> > > > > > >         beacon number 39491, on THU OCT 16 2008 14:27:53 CA
> > > > > > > beacon hash entry for 172.23.106.98:5064 with period estimate
> > > > > > > 14.980477
> > > > > > >         beacon number 58989, on THU OCT 16 2008 14:27:47 CA
> > > > > > > beacon hash entry for 172.23.106.102:5064 with period
> > >
> > > estimate
> > >
> > > > > > > 14.990867
> > > > > > >         beacon number 39993, on THU OCT 16 2008 14:27:53
> > > > > >
> > > > > > entries per
> > > > > >
> > > > > > > bucket: mean = 0.011719 std dev = 0.107617 max = 1
> > > > > > >
> > > > > > >
> > > > > > > Thanks in advance....
> > > > > > >
> > > > > > > Emma
> > > > >
> > > > > <DIV><FONT size="1" color="gray">This e-mail and any
> > >
> > > attachments may
> > >
> > > > > contain confidential, copyright and or privileged
> > >
> > > material, and are
> > >
> > > > > for the use of the intended addressee only. If you are not the
> > > > > intended addressee or an authorised recipient of the addressee
> > > > > please notify us of receipt by returning the e-mail and
> > >
> > > do not use,
> > >
> > > > > copy, retain, distribute or disclose the information in
> > >
> > > or attached
> > >
> > > > > to the e-mail. Any opinions expressed within this e-mail are those
> > > > > of the individual and not necessarily of Diamond Light Source Ltd.
> > > > > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > > > > attachments are free from viruses and we cannot accept liability
> > > > > for any damage which you may sustain as a result of software
> > > > > viruses which may be transmitted in or with the message. Diamond
> > > > > Light Source Limited (company no. 4375679).
> > > > > Registered in England and Wales with its registered office at
> > > > > Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> > > >
> > > > <DIV><FONT size="1" color="gray">This e-mail and any
> > >
> > > attachments may
> > >
> > > > contain confidential, copyright and or privileged material, and are
> > > > for
> > >
> > > the
> > >
> > > > use of the intended addressee only. If you are not the intended
> > > > addressee or an authorised recipient of the addressee
> > >
> > > please notify us
> > >
> > > > of receipt by returning the e-mail and do not use, copy, retain,
> > > > distribute or disclose the information in or attached to
> > >
> > > the e-mail.
> > >
> > > > Any opinions expressed within this e-mail are those of the
> > >
> > > individual
> > >
> > > > and not necessarily of Diamond Light Source Ltd. Diamond
> > >
> > > Light Source
> > >
> > > > Ltd. cannot guarantee that this e-mail or any attachments are free
> > > > from viruses and we cannot accept liability for any damage
> > >
> > > which you
> > >
> > > > may sustain as a result of software viruses which may be
> > >
> > > transmitted
> > >
> > > > in or with the message. Diamond Light Source Limited (company no.
> > > > 4375679). Registered in England and Wales with its
> > >
> > > registered office
> > >
> > > > at Diamond House, Harwell Science and Innovation Campus, Didcot,
> > > > Oxfordshire, OX11 0DE, United Kingdom </FONT></DIV>
> >
> > <DIV><FONT size="1" color="gray">This e-mail and any attachments may
> > contain confidential, copyright and or privileged material, and are for
> > the use of the intended addressee only. If you are not the intended
> > addressee or an authorised recipient of the addressee please notify us of
> > receipt by returning the e-mail and do not use, copy, retain, distribute
> > or disclose the information in or attached to the e-mail.
> > Any opinions expressed within this e-mail are those of the individual and
> > not necessarily of Diamond Light Source Ltd.
> > Diamond Light Source Ltd. cannot guarantee that this e-mail or any
> > attachments are free from viruses and we cannot accept liability for any
> > damage which you may sustain as a result of software viruses which may be
> > transmitted in or with the message.
> > Diamond Light Source Limited (company no. 4375679). Registered in England
> > and Wales with its registered office at Diamond House, Harwell Science
> > and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> > </FONT></DIV>
> >
> > --
> >
> > Scanned by iCritical.



-- 
The best FOSS code is written to be read by other humans -- Harold Welte

Replies:
RE: Inter-IOC link problems Jeff Hill
References:
RE: Inter-IOC link problems Jeff Hill

Navigate by Date:
Prev: RE: Inter-IOC link problems Jeff Hill
Next: building asyn-4.10 under cygwin-x86 Frank Hoeft
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Inter-IOC link problems Jeff Hill
Next: RE: Inter-IOC link problems Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 31 Jan 2014 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·