Experimental Physics and
Industrial Control System

"Ned D. Arnold" <[email protected]> · Wed, 27 Nov 2002 09:14:09 -0600

Marty,

Ken has an excellent test bed for the PV Gateway (and Portable Channel

Access Server). The PV Gateway on Hydra is used heavily and is a great

test for HEAVY LOAD conditions.

His observation of the increased CPU usage (his e-mail to Jeff on

10-8-02 is included below) seems to be much more significant than

typical "resource creep". The loop rate for the R3.13 version was over

100Hz. When this was stopped and the R3.14 version was started, the loop

rate dropped to under 10 Hz. Same hardware, same number of PV requests,

same version of Solaris, different version of PCAS.

Since the response time was noticeably slower for the users, we backed

out of the 3.14 version just before the user run began.

Ken and I thought this was a significant discovery that may effect many

applications attempting to move to R3.14. Now (before the R3.14 release)

is the time to be thorough and investigate whether this is a typical

case or not. Such a performance degradation will effect numerous systems

and buying faster hardware is not always a solution.

According to recent talks at JLAB, the PCAS is used extensively and in

many situations performance is critical (imaging systems, LabView

server, etc).

	Ned

Marty Kraimer wrote:
> At the EPICS Core Working Group meeting at JLAB the problem of running
> the gateway on 3.14 was discussed. My understanding is that when we
> build the gateway against 3.14 it uses so much cpu time that it doesn't
> work correctly. Someone mentioned that on 3.13 it already uses 75% of
> the cpu.
>
> Some questions.
>
> Is this true? Ken should know the answers. Do you have some actual
> performance numbers?
>
> If this is true then it sounds like only a matter of time until even the
> 3.13 version will fail.
>
> I assume this only applys to the gateway for ASD not the gateways for
> the CATS.
>
> For the ASD gateway can't we have another solution?
>
> Some possibilities.
>
> Run separate gatways for phoebus and oxygen.
> Get a more powerful gateway machine.
>
> Marty
>

Re: Gateway Status 10-8-02
Kenneth Evans, Jr. wrote:

Jeff,

     We have been running the latest Gateway 2.0 built with Base 3.14 on
Hydra as our main Gateway since Oct. 1.  This is the version that doesn't
print the many errlog messages, though there are quite a few left.  It
crashed (only) once on Oct. 8 with Pure virtual function called.  Otherwise,
it seems to be working properly.  It is doing what a Gateway is supposed to
do as far as I can tell.

     The problem is that it is inefficient and using too much CPU.  The
Gateway CPU has consistently been at around 95%, and the loop rate has been
just above 10 Hz, the limit if ca_poll() is to be called once every 100 ms.
This is on a 440 MHz UltraSparc-IIi with 1 processor.  It is "on the edge".
There are complaints of slow response, and if you try to do anything on
Hydra, the response is slow (as would be expected for a machine using 100%
CPU).  We did not feel we could continue to run it, as user operations
recommence tomorrow.

     The attached StripTool plot shows what happened when we changed back to
1.3.3.4, the latest Gateway 1.3 version.  The CPU goes down and the loop
rate goes up.  It is no longer "on the edge" and has quite a bit of
headroom.  The graph to the left of where it was changed is typical of the
load over the last week.  I have been watching it, and it has pretty much
looked like that during the whole period.  It is now handling the same load
but using fewer resources.  Note that the loop rate is now over 100 Hz.
fdMamager is called with a 10 ms timeout, so this means fdManager is
returning early.  (The loop consists of calls to fdManager, then ca_poll,
then Gateway stuff).  Note that both versions are "keeping up" in that the
ServerEventRate is equal to the ServerPostRate.  The threshold where this no
longer happens is much higher.

     It now runs on Linux and the behavior, while better, seems commensurate
given that the Linux box is 2 Pentium III's at 930 Mhz each.  It also runs
on Windows, but the performance seems much worse there (even though it is 1
Pentium at 800 Mhz.)  It appears to use very little CPU on WIN32, even when
loaded.  It just stops "keeping up".  In addition,  the threshold for
"keeping up" is lower than for the other two.  That is, it doesn't seem to
be utilizing the available CPU.

     It needs to be fixed before we can use it for production.

	-Ken

------------------------------------------------------------------------

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System