Experimental Physics and
Industrial Control System

Jeff Hill <[email protected]> · Tue, 07 Jan 1997 15:00:56 -0700

Chip,

First, let me say that I think we are all relieved now that we better understand
the nature of the problems that have been occurring at CEBAF. In summary,
we have most likely not seen this problem because:

1) We have not experienced the CAMAC driver failure that is occurring
at CEBAF. To the best of my knowledge CEBAF is using a different CAMAC driver
than is used at other sites. This is perhaps also related to the average available 
idle time. The CEBAF IOCs are (according to Chip) 75-80% busy on average. 
Therefore I expect that peak CPU consumption is much higher. 
It appears that a high CAMAC IO error rate caused the driver to consume all available 
CPU (perhaps spinning on a transaction completion flag). Others may disagree 
however IMHO it is a failure of the driver (or the system design) if a driver is allowed 
to consume all available CPU at high priority when the hardware is failing. 
When a driver fails in this fashion at high priority we can easily imagine 
that any number of critical functions will fail (because of CPU starvation).
When writing drivers it is always best to avoid consuming significantly
more CPU in off-normal situations than is required to process the record.

2) CEBAF has elevated the priority of one of the tasks in the CA server. 

3) The IP kernel TCP/IP virtual circuit "connect()" timeout parameter 
may be different at CEBAF.

Chip wrote:
> 
>         (IMPORTANT ASIDE: As some of you may remember, we run our name
>          resolution task at an elevated priority so that when we bring up
>          a screen with 2000 channels on it, it resolves in an acceptable
>          amount of time. Without this adjustment in priorities, that
>          screen would take 5 minutes or more to completely resolve. This
>          is due to the fact that some IOC's are running 75-80% busy in
>          steady state, and the remaining 20% is not enough CPU time to
>          resolve 2000 names before channel access times out)
> 

Note that the efficiency of the name resolution activity was improved
in 3.12 at some point (by changing a time constant). Perhaps Marty's numbers 
do not agree with what CEBAF has observed because he is using a more recent 
version of the code.

Marty wrote:
> What happened at TJNAF is the following:
>  
> 1)TJNAF raised the priorioty of CA UDP above that of the scan tasks
> 2)CAMAC failed causing a scan task to use all available cpu time.
> This caused all tasks of lower priority to be starved.
> 3)A CA client issued search requests.
> 4)CA UDP received the request and sent a reply to the client.
> 5)The client sent a message to CA TCP and waited forever for a response.
> 6)CA TCP never got a chance to process the message.

This is a correct summary of the CA activity occurring except that 
I must clarify steps 5 and 6. When the CA client library receives a 
search response over UDP from a new IOC it attempts to establish a TCP/IP
virtual circuit to the IOC using the "connect()" call in the
socket library. This call has no timeout parameter and the kernel
default timeout is quite long. Many of you may have experienced this long
timeout when you typed "telnet xxxx" when "xxxx" was not present on 
the net (you may have typed ^C instead of waiting the full duration
of the timeout). The default connect timeout on our sun systems is about 80 sec.
Note that the CA client lib _WILL_ recover in the rare circumstances when 
this occurs if the operator is willing to wait for the full duration of 
the timeout. 

There is no portable call for establishing a TCP/IP circuit other than
"connect()". The vxWorks OS does supply "connect_with_timeout()" however.

I am not using non-blocking IO at the time that "connect()" is called.
If I did it would add some perhaps substantial complication to the client 
lib but would also avoid stalling the CA client library for the duration 
of the timeout.  The net effect would be faster connects for clients that 
connect to multiple IOCs when one of the IOCs has failed under the rare 
circumstances described above. Of course the client would never 
connect to any IOCs that have failed in the way that Chip has described.

I will be looking at the level of effort required to install 
this (non-blocking connect) change. It would be interesting to hear from all 
of the sites that consider this to be an important issue (and therefore would 
like to see non-blocking connect() installed).

I am also examining the situations reported by Rolf Keitel and Bill Brown
in more detail.

Chip wrote:
>         It may also be that this is the reason that 1 ioc brings another
>         down: ioc A hangs, B attempts to reconnect and its ca library hangs,
>         causing ioc to attempt to reconnect to B and hang, causing ...
> 

I doubt that this is occurring unless the vxWorks "connect()" implementation is
consuming too much CPU (we have not seen this).

>         (2) reduce the priority of the name resolution task to the default;
>         this is simply not acceptable -- EPICS would be slow as a dog for
>         operators
> 

Perhaps newer versions of 3.12 (or a faster CPU that is less than 80% loaded) 
will connect 2000 channels faster. No doubt that the CA connect
algorithm could also be optimized (this would perhaps remove some
idle delays when the CPU isnt working but would not improve on the total 
CPU consumed much I suspect).

Jeff

-- 
______________________________________________________________________
Jeffrey O. Hill                 Internet        [email protected]
LANL MS H820                    Voice           505 665 1831
Los Alamos, NM 87545 USA        FAX             505 665 5107

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System