Experimental Physics and
Industrial Control System

"Jeff Hill" <[email protected]> · Mon, 12 Mar 2007 10:11:02 -0600

> One more clue:
> I removed the currently down nodes from the EPICS_CA_ADDR_LIST and all
> seems to be okay. I don't get any error messages like "host is down".
> It seems to me, there is a problem in the code with the unicasts: if
> there are nodes on the list which do not respond (in time?) some other
> nodes (which are alive) are considered to be dead???
> Burkhard

The IP kernel has a finite length input queue attached to each UDP port. The
"host is down" diagnostic is communicated using ICMP messages. I suspect
that what is occurring is that there are a sufficient number of ICMP
messages to temporarily saturate this input queue. This saturation can be
cleared up as soon as CA gets an opportunity to read from the UDP socket,
but by then the legitimate UDP search protocol responses from certain IOCs
may reach this queue when it is saturated and end up being discarded.

So that is maybe an explanation of the behavior you see, but of course isn't
a solution to your troubles. I would first attempt to better understand your
situation by monitoring all ICMP traffic received into this host. On Linux
this can be done by typing "tcpdump ip proto icmp" or perhaps "tcpdump dst
host xxx and ip proto icmp". You will need to change to root first on most
systems and also run tcpdump on the host that is receiving the ICMP messages
(the host with the CA client). You will also need to monitor the ICMP
traffic while the CA client that is experiencing troubles is running. I
recommend this because the raw ICMP traffic can tell us something about what
might actually be occurring. 

See http://en.wikipedia.org/wiki/Internet_Control_Message_Protocol. 

The solution might be only to fix the hosts that have some sort of unusual
configuration - that the routing system might be objecting to. This might be
originating in ordinary hosts or possibly also in some of the routers in
your system. 

> >> CAC: error = "Host is down" sending UDP msg to 140.181.98.50:5064

There is an underlying ICMP message that causes this. Knowing which one it
is might be quite useful when attempting to eliminate the extra traffic.

Jeff

> -----Original Message-----
> From: Burkhard Kolb [mailto:[email protected]]
> Sent: Friday, March 09, 2007 3:07 AM
> To: Jeff Hill
> Cc: [email protected]
> Subject: Re: Darwin and EPICS_CA_AUTO_ADDR_LIST issue
> 
> One more clue:
> I removed the currently down nodes from the EPICS_CA_ADDR_LIST and all
> seems to be okay. I don't get any error messages like "host is down".
> It seems to me, there is a problem in the code with the unicasts: if
> there are nodes on the list which do not respond (in time?) some other
> nodes (which are alive) are considered to be dead???
> Burkhard
> 
> Jeff Hill wrote:
> >
> > What is the response to "ping -s 140.181.98.50"?
> >
> > PS: Behavior is known to vary between IP kernels for such unicast
> addresses
> > if there are two or more daemons (in this case two or more CA UDP
> servers)
> > listening to the same {IP address, port} tuple on the same host. The
> typical
> > behavior difference is that on certain IP kernels only one daemeon
> listening
> > on UDP port 5064 will receive a UDP message with a unicast destination
> > address, but in contrast on other IP kernels all daemeons listening on
> UDP
> > port 5064 will receive such messages. In my experience UDP frames with
> > broadcast address destinations always go to all registered listeners
> > listening on UDP port 5064 - a uniform behavior across IP kernel
> > implementations.
> >
> > PPS: One possible workaround to the above dilemma is to use directed
> > broadcast addresses (router forwarded broadcast addresses) in the
> > EPICS_CA_ADDR_LIST. This sometimes requires enabling of "broadcast
> > forwarding" features in your routers. We might also add support for
> > multicasting to future versions of CA.
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: Burkhard Kolb [mailto:[email protected]]
> >> Sent: Thursday, March 08, 2007 9:28 AM
> >> To: '[email protected]'
> >> Subject: Darwin and EPICS_CA_AUTO_ADDR_LIST issue
> >>
> >> On my MAC (OSX 10.4.8), EPICS base-3.14.9, medm 3.11:
> >>
> >> When I set EPICS_CA_AUTO_ADDR_LIST to NO and have the
> EPICS_CA_ADDR_LIST
> >> pointing to the list of IOCs medm does not find all PVs. Sometimes the
> >> connections work for a short time then several IOCs are not seen
> anymore.
> >> If I set it to YES, medm finds them always.
> >>
> >> In the xterm window I get some error messages from really not existing
> >> IOCs:
> >> CAC: error = "Host is down" sending UDP msg to 140.181.98.50:5064
> >> But the lost ones are not reported!
> >>
> >> I use the same list of IOCs on other linux boxes and have no problem
> > there.
> >> Any idea?
> >>
> 
> --
> -------------------------------------------------------------------
> Dr. Burkhard Kolb
> GSI mbH  |   KP1   |  Planckstr. 1  |  D-64291 Darmstadt
> Email: [email protected]                |  Tel.: +49 (0)6159 / 71 2667
> -------------------------------------------------------------------

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System