Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017 Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
<== Date ==> <== Thread ==>

Subject: Re: CA gatway runs away when zero length PV name in UDP search request
From: Anton Derbenev <aaderbenev@gmail.com>
To: tech-talk@aps.anl.gov
Date: Fri, 29 Sep 2017 17:36:12 -0400
Hello all,

it seems that we experience similar issues with NSLS-II gateway instances (thanks Zhijian Yin for discovering this talk). On multiple occurrences (separated by weeks and months) few gateways spontaneously became stuck with processes just hanging and eating an abnormal portion of CPU.

The problem was obscured by several factors:

- it appeared on rare occasions with no known reason;
- seemingly no distinguishable log error message about the cause;
- even for clues which were present (e.g. client host:port in logs), backtracking didn't yield much information.

In the context of clues provided in this discussion, some things are not clear for me:

- if there is a relevant bug in CAS code, shouldn't all servers receive the invalid zero-length name request and be affected, and not just gateways?
- when the issue appears, why are not all gateways affected but only some?
- why it looks like that some gateways choke right away while other instances can withstand multiple appearance of the same problem?

From one of our logs:

Sep 18 11:52:48 !!! Errlog message received (message is above)
zero length PV name in UDP search request?

Sep 18 11:52:48 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS Request: ? on box64-1.cs.nsls2.local:37525: cmd=6 cid=712 typ=5 cnt=11 psz=3
2 avail=2c8
CAS: 
Sep 18 11:52:48 !!! Errlog message received (message is above)

...
<a dozen of errors like that>
...

Sep 27 10:54:40 !!! Errlog message received (message is above)
CAS Request: ? on box64-1.cs.nsls2.local:59857: cmd=6 cid=433 typ=5 cnt=11 psz=32 avail=1b1
CAS: 
Sep 27 18:02:19 !!! Errlog message received (message is above)
zero length PV name in UDP search request?

Sep 27 18:02:19 !!! Errlog message received (message is above)
@@@ Restarting child

So it was a while before the complaint came from users and the gateway was restarted. There is a possibility, say, that when the issue occurs, existing connections persist - but new ones are not established?

It looks like for now there is a good chance that gateway restarts will be required shall any client perform an invalid query...

Regards,
Anton. 

Navigate by Date:
Prev: Re: registerRecordDeviceDriver.pl takes a long time to finish Andrew Johnson
Next: Re: Record processing twice after upgrading to Base 3.15.5 Dunning, Michael
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
Navigate by Thread:
Prev: Re: CA gatway runs away when zero length PV name in UDP search request Shuei YAMADA
Next: Area Detector base class parameters Iain Marcuson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
ANJ, 29 Sep 2017 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·