EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Channel Access monitoring tools
From: Bill Lavender <[email protected]>
To: [email protected], [email protected]
Date: Mon, 14 May 2012 18:50:26 -0500
On Mon, May 14, 2012 at 03:29:41PM +0000, Hill, Jeff wrote:
> Hi,
> 
> > 2.  Call ca_pend_io() to make sure the request is sent on its way.
> 
> The ca_pend_io does have also an implicit flush, but it's not typically 
> used only for that purpose, if a flush is all that needed. It also turns out 
> that the udp based search messages are sent periodically, and ca internally 
> takes care of flushing them out (when you periodically call ca_poll).

OK.  I will do ca_flush_io() at that point instead.
 
> > At present, the only client platform that I have that has these timeouts
> > is Debian 6.0 Linux (Squeeze).  The same hardware running Debian 5.0
> > did not have these problems.  The clients are using EPICS Base 3.14.10
> > and the IOC is using 3.14.12.1.
> 
> Are both installations the same word size (i.e. both 32 or both 64). Is gcc
> at a significantly different version on the two os? There might also be 
> something changing in the nic driver btw debian versions.

Yes, they are all running 32-bit operating systems.  Here is how they
are configured:

  Distribution          Kernel    GCC    Ethernet card  Kernel driver
Debian 6.0 (Squeeze)    2.6.32   4.4.5   nVidia CK804     forcedeth
Debian 5.0 (Lenny)      2.6.26   4.3.2    VIA VT6102      via-rhine

The two Debian 6.0 machines are the user control workstations for two
different APS beamlines.  The Debian 5.0 machine was set up specifically
for this test.  One thing that just occurred to me is that I do not know
whether the current beamline control computers are old beamline control
computers upgraded to Debian 6.0 or whether they are brand new computers.
I am not involved in administering those computers, so I will have to ask
the people who do that, since that may be the important difference here.

> > I am assuming that what I need to do here is to monitor the network
> > traffic between the Debian 6.0 machine and the IOC and compare it
> > to the traffic between a Debian 5.0 machine and the same IOC.  
> 
> Do you see these same issues when changing EPICS_CA_AUTO_ADDR_LIST to
> NO, and then manually configure the server's address(s) in 
> EPICS_CA_ADDR_LIST (this would fault isolate an issue with broadcasting).
> It might be interesting to monitor ICMP (IP error diagnostic) traffic 
> on your network with wireshark or some other Ethernet sniffer. You could 
> could look at all ICMP messages and also only the ICMP messages that have 
> the CA client's host for a destination address. I have seen problems here 
> getting a client to connect reliably when there were too many hosts
> on the LAN with miss-configured subnet masks. There were so many ICMP
> error responses to each search broadcast that the finite length UDP 
> input queue was saturating. 

I do intend to pursue this using Wireshark.  In fact, my original reason
for asking a question was to find out whether or not there are other tools
(perhaps EPICS-specific ones?) besides Wireshark.  One possible glitch is
that the beamline computers are supposed to use a netmask of 255.255.255.128.
If some computers are mistakenly configured with 255.255.255.0, then that
might produce the ICMP errors you are talking about.

The odd thing is that currently we cannot reproduce the timeout problems.
If we run tests now, everything seems to run just fine.  That makes it
rather hard to diagnose the problem.

However, the APS is currently in a shutdown period between user runs.
Given that we had timeout problems when the APS was running, but do not
have timeout problems during the current APS shutdown, my current wild
guess is that the network traffic profile is different on the local subnet
during an APS run compared to what it is during a shutdown.  Which is why
I am investigating network monitoring tools.
 
> I assume that other types of network traffic work (i.e. ping, ssh, telnet,
> http) without issues on this network?
> 
> Jeff

As far as I know, all other network protocols seem to be working fine.

Thanks.

Bill Lavender

> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > On Behalf Of Bill Lavender
> > Sent: Friday, May 11, 2012 2:21 PM
> > To: [email protected]
> > Subject: Channel Access monitoring tools
> > 
> > One of the installations I am responsible for is having trouble with
> > unexplained PV connection failures.  What I have is a program that needs
> > to do things in a serialized fashion.  In other words, for each of
> > several PVs, it needs to start a connection to a PV, wait for the
> > connection to complete, and then immediately use that PV.
> > 
> > At present, I am doing something like this:
> > 
> > 1.  Invoke ca_create_channel() with a connection state change handler.
> >     The connection state change handler sets a flag to indicate that
> >     the connection has complete.
> > 
> > 2.  Call ca_pend_io() to make sure the request is sent on its way.
> > 
> > 3.  I then wait inside a loop periodically calling ca_poll() until
> >     my connection flag has been sent.
> > 
> > 4.  If the loop has been looping for too long, I declare a timeout
> >     and call ca_clear_channel() to get rid of the existing unconnected
> >     PV andn then call ca_pend_io() to send that request on the way.
> > 
> >     My code did not originally have this ca_clear_channel() call,
> >     but I added it to see if it would help and in the name of
> >     preventing memory leaks.  It didn't help.
> > 
> > 5.  I then go back to step one to create a new channel.
> > 
> > 6.  If I execute the outer loop from step 1 to step 5 too many times
> >     then I give up and tell the user that I have timed out.
> > 
> > Increasing the timeouts has not helped.  For debugging, ,I have tried
> > timeouts as long as 10 seconds and have not seen a change in the frequency
> > of connection timeouts.
> > 
> > At present, the only client platform that I have that has these timeouts
> > is Debian 6.0 Linux (Squeeze).  The same hardware running Debian 5.0
> > did not have these problems.  The clients are using EPICS Base 3.14.10
> > and the IOC is using 3.14.12.1.
> > 
> > I am assuming that what I need to do here is to monitor the network
> > traffic between the Debian 6.0 machine and the IOC and compare it
> > to the traffic between a Debian 5.0 machine and the same IOC.  I see
> > that there is a Channel Access plugin for Wireshark that I hope will
> > be helpful.  Are there other things that I should be trying?
> > 
> > The Channel Access code is wrapped in some code of my own, so it will
> > not look quite the same as raw Channel Access code, but if you want
> > to look at it anyway, look at the function mx_epics_pv_connect()
> > in this file
> > 
> >   http://svn.csrri.iit.edu/mx/trunk/modules/epics/mx_epics.c
> > 
> > Thanks.
> > 
> > Bill Lavender
> > [email protected]
> 

Replies:
Re: Channel Access monitoring tools Ralph Lange
References:
Channel Access monitoring tools Bill Lavender
RE: Channel Access monitoring tools Hill, Jeff

Navigate by Date:
Prev: Re: Channel Access monitoring tools Bill Lavender
Next: FW: Problem installing CA for QT Andrew Rhyder
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Channel Access monitoring tools Hill, Jeff
Next: Re: Channel Access monitoring tools Ralph Lange
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  <20122013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 18 Nov 2013 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·