Experimental Physics and
Industrial Control System

Matt Newville <[email protected]> · Fri, 16 Sep 2011 14:07:59 -0500

Hi Emmanuel,

On Fri, Sep 16, 2011 at 11:58 AM,  <[email protected]> wrote:
>
> Hello all,
>
> I was looking at the existing backup utilities (burt,casr,sddscasr, etc).
>
> To all those utilities you supply a request file (list of Pvs to save)
> Now what do those utilities do?
> Do they do a SEQUENTIAL 'caget' equivalent of all the Pvs?
> or do they do some kind of 'smarter' operation, like get all the PVs at once?
>
> Now I am interested in storing the PVs in hdf5 or sqlite files.
> (assume 1 backup per file)
> I could of course modify any of the existing backup tools.
> Most of them are written in C, so the code execution is pretty fast.
> (Is any one of them fast than the other? If so why?)
>
>
> Now i am contemplating writing the code in python.
> How would this python code cope with 50K+ Pvs?
> (In other words, is there a limitation in pyepics? or a performance issue?)

In principle, I think this is possible. .  I run an archive-to-mysql
application with 5K PVs, and I'm sure that going to 10K would be no
problem.   I think that going to 10M PVs would be challenging, but
can't say I really know where the breaking point would be.

> <open container file>
> for pvName in pvNames :
>    pv = epics.PV(pvName)
>    val = pv.get()
>    <put in result in container file>
> <write container file to disk>

I think the main performance issue here would be what happens if some
of the PVs are unconnected or take a while to connect.  A slightly
better approach might be to first create all the PVs without (as
pv.get() does) waiting for connection (or timing out) before going on,
and only then go on to get all the values:

pvlist = []
for pvName in pvNames:
    pvlist.append( epics.PV(pvName) )

for pvs in pvlist:
    val = pv.get()
    <store value>

With this sort of approach, I typically see on the order of 10ms per
PV connection on startup.   That is, if I create and connect to 5K
PVs, it takes ~50 seconds (meaning 40 to 80 seconds) to get initial
values.  I believe that is mostly the CA library, not the python part,
and I believe it would scale, suggesting that any save/restore process
that runs once and then quit would take 10 minutes for 50K PVs.
OTOH, once connected to all those PVs, writing all the values to disk
a second time would be very fast.

> I may be interested in using Cython, C compilation of python code of the above.
> Not sure of the performance gain, but is any one of you using Cython?

My understanding from a little playing with Cython is that it is very
good at turning the slow parts of Python to C-like performance.  I
think the "slow Python parts" for this code are
 a) creating a list of 50k  names / PVs -- not that big a cost

 b) creating the 50k PV object -- shouldn't be that big, as it creates
a CHID and asks for a connection callback -- which will not prevent
you from going to the next python statement.

 c) the python parts of the actual PV.get() , which shouldn't be that
big, especially for scalar values, and especially assuming that the
connection has been established.

I think these are not very slow, though there might be some python GIL
issues with having all those different bits of code run from the CA
connection callbacks -- I haven't noticed a problem, but I haven't
pushed this hard on it either.    I think most of the real time would
be spent in the initial network connection part of the CA library --
which is to say that I'd guess the speed of this loop would not be
that much slower than the C equivalent.

Hope that helps,

--Matt

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System