At the EPICS Core Working Group meeting at JLAB the problem of
running the
gateway on 3.14 was discussed. My understanding is that when we
build the
gateway against 3.14 it uses so much cpu time that it doesn't
work correctly.
Someone mentioned that on 3.13 it already uses 75% of the
cpu.
[KE] It works correctly. It uses more CPU than the
old one. Possibly up to a factor of 2 more.
Some
questions.
Is this true? Ken should know the answers. Do you have some
actual performance
numbers?
[KE] The only real-life test is for our Hydra
Gateway, It typically uses about 75% CPU with the old version and the CPU is
saturated at about 95% with the new one. All other programs tend
to be sluggish when it is running, and the machine is too close to the
limit. It ran for a week or two during shutdown and crashed only once,
compared to much more frequently for the old one. It also has some
useful new features and bug fixes. We (including Ned) decided to not use
it further. The performance is, of course, time dependent. I
have quite a lot of StripTool plots if you want more information. The
dramatic ones are the ones taken when we changed versions. Only the CPU
changes.
[KE] This could
all be fixed by a faster machine, but replacing all our Gateway machines would
be expensive. There is no reason 3.14 should be using more CPU. It
is a problem that should be fixed now. This is the last (probably) of a
series of problems with CAS in 3.14. Jeff has fixed the others as the
Gateway has uncovered them. The long time spent getting the 3.14 Gateway
running was not from the conversion, which was done in a few days, but in
fixing the sucession of problems in CAS for 3.14. I trust Jeff can fix
this one, too, if he works on it.
If this is true then it
sounds like only a matter of time until even the 3.13
version will
fail.
[KE] It's been doing OK for some time. The load
doesn't appear to be increasing much with time. You should verify
this with Marty Smith.
I assume this only applys to
the gateway for ASD not the gateways for the CATS.
[KE] I would
guess Hydra is the most used. Some of the CATs probably have a light
load. Marty Smith or Mohan would be the person to contact about
this. All the stats are available from http://www.aps4.anl.gov/user_operations/index.html.
For
the ASD gateway can't we have another solution?
[KE] Yes, we can
buy a faster computer, perhaps even a Linux one. Saturn does much better
than Hydra in my tests. But the problem probably affects all portable
servers. It is possibly actually in CA as the routine using all the
timing is in tcp_recv_thread in CA. Hence it may affect CA
clients. It may be owing to thread scheduling, which is used in CA but
not in CAS, according to my understanding. Why would we not investigate
and try to fix it?
[KE] A
long-standing problem with the Gateway is that it first calls fdManager then
ca_poll. Neither quits if it has work to do. Hence the other one
cannot get time to empty its queues, etc., when one is
running. Filled queues affect the one running. And the
problem compounds and accelerates. It would be nice to multiplex them in
some way as a long-term solution.
Some
possibilities.
Run separate gatways for phoebus and oxygen.
Get a
more powerful gateway machine.
Marty