EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: flaky IOC problems at Jefferson Lab
From: [email protected]
To: [email protected]
Date: Mon, 06 Jan 97 11:05:49 -0500
EPICS experts:

Many of you are aware that Jefferson Lab (fna CEBAF) has had problems with
IOC's mysteriously crashing, and in fact having multiple IOC's crash because
1 crashed. Well, I think I may have a better understanding of the problem
now, so here's a little more info and an analysis of the problem.  Note
that I do not yet have the fix -- perhaps Jeff Hill can comment?

Symptoms of the problem: 
	1) one IOC is in a funny state; screens which are up continue
	   to work for signals NOT on that IOC
	2) MEDM or DM session which attempts to connect to ANY signal on
	   that IOC will become COMPLETELY UNUSABLE, and will in fact not
	   be able to connect to ANY signal on ANY ioc; existing screens
	   continue to function for signals on other IOC's.

Further investigation reveals:
	1) a scan task is consuming all available CPU resources

Analyis:
	1) the CAMAC serial highway is probably at fault, and the error
	   handling for a fault is sufficiently large that if all I/O to
	   the highway fails, there is not sufficient excess CPU cycles
	   to keep up with the increased load due to error handling
	   (this needs further study to verify and fix, probably by taking
	   the offending serial highway offline and forcing all further
	   i/o to fail immediately with no handler invocation).
	
	(IMPORTANT ASIDE: As some of you may remember, we run our name
	 resolution task at an elevated priority so that when we bring up
	 a screen with 2000 channels on it, it resolves in an acceptable
	 amount of time. Without this adjustment in priorities, that
	 screen would take 5 minutes or more to completely resolve. This
	 is due to the fact that some IOC's are running 75-80% busy in
	 steady state, and the remaining 20% is not enough CPU time to
	 resolve 2000 names before channel access times out)

	2) the name resolution task is at a higher priority than the scan
	   task now using all available CPU cycles, so name resolution
	   can respond, BUT the channel access client tasks are completely
	   starved

	3) So, here's the senario: client C broadcasts to target ioc T a list
	   of names. T responds with the subset that it is willing to serve.
	   C does the next step in connecting to those channels, but T does
	   not respond. C waits forever. 

	It appears that the channel access client library never times out
	on this error.  Most sites never see this, because the odds of an
	IOC dying between the name resolution response and the connection
	establishment are vanishingly small, and if the ioc crashes all the
	way and reboots, CA lib probably unhangs and reconnects OK.  So the
	error is only revealed if the ioc hangs or dies without rebooting
	between the name resolution reply and the connection establishment.

	It may also be that this is the reason that 1 ioc brings another
	down: ioc A hangs, B attempts to reconnect and its ca library hangs,
	causing ioc to attempt to reconnect to B and hang, causing ...

	Fixes available: 

	(1) improve the camac driver so starvation does
	not occur; this does not fix the problem, really, since anything that 
	causes complete CPU starvation on 1 IOC will effect our whole control
	system if it persists long enough
	
	(2) reduce the priority of the name resolution task to the default;
	this is simply not acceptable -- EPICS would be slow as a dog for
	operators

	(3) change channel access library to correctly time out in this
	senario;  this assumes that I have analyzed the problem correctly




Regards,

Chip


-----------------------------------------------------------------------------
Chip Watson
Internet: [email protected]    Thomas Jefferson National Accelerator Facility *
Tel: (757) 269-7101          12000 Jefferson Avenue, MS 12A2
FAX: (757) 269-5024          Newport News, VA 23606
WWW: http://www.jlab.org/~watson/
* (formerly CEBAF, the Continuous Electron Beam Accelerator Facility)



Replies:
Re: flaky IOC problems at Jefferson Lab Rolf Keitel
Re: flaky IOC problems at Jefferson Lab Jeff Hill

Navigate by Date:
Prev: Re: Serial and GPIB IP's and IP Carriers John R. Winans
Next: Re: flaky IOC problems at Jefferson Lab Rolf Keitel
Index: 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: Serial and GPIB IP's and IP Carriers John R. Winans
Next: Re: flaky IOC problems at Jefferson Lab Rolf Keitel
Index: 1994  1995  1996  <19971998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·