Experimental Physics and
Industrial Control System

Benjamin Franksen <[email protected]> · Fri, 24 Jun 2005 23:38:39 +0200

On Friday 24 June 2005 16:37, Kay-Uwe Kasemir wrote:
> On Jun 23, 2005, at 19:36, Benjamin Franksen wrote:
> > Do we really need this?
> > Isn't this more like exposing internal details, unnecessarily
> > complicating the client interface?
> > What about hiding most of the name service questions behind
> > protocol and
> > just give client code the answer(s) they want about PV in question?
>
> We need this type of API both for 'hello world' and the real world
> in the form of matlab scripts:
>
>    pulselen = caget("DTL4:LLRF:PulseLen.value");
>
> Some CA clients want to be more involved in
> what's happening internally:
> 1 - select directory server
> 2 - query it for "DTL4:LLRF:PulseLen"
> 3 - CAS = the CA server that's marked 'primary'
> 4 - create a channel for "DTL4:LLRF:PulseLen"
> 5 - connect to CAS
> 6 - build a request for the "value" property
> 7 - send it
> 8 - wait for response
> 9 - extract the value from the response
> In addition, there's error handling.
>
> The V3 client API hid the discovery
> (configurable via EPICS_CA*ADDR* but not in the API),
> then exposed most from 4 down.
>
> Doug asked me yesterday why we have the user 'connect'.
> Ben suggests hiding as much as possible.

Let me explain the rationale.

Consider the above first 3 steps:

1 - select directory server
2 - query it for "DTL4:LLRF:PulseLen"
3 - CAS = the CA server that's marked 'primary'

There are two hidden preconditions in this 'code':

(a) some directory server is running
(b) there is one CA server marked 'primary' among all that serve the PV 
in question

What if these preconditions are not fulfilled?

ad (a) The directory server might be down. Does this cause the whole 
program to fall down? Do we want such a server to be a single point of 
failure for all clients? I thought it was agreed that we want to fall 
back to broadcasting in case name servers are down.

ad (b) How do you ensure that for several servers serving a certain PV, 
there is always exactly one marked as 'primary'? This seems to imply 
maintaining an absolute ordering relation among CA servers. Is this 
ordering configurable for each PV or only for each server? Is the 
configuration static or can it change during runtime and what does that 
mean for already connected channels? Note that in a complex control 
system servers may go down and come up in unpredictable patterns. So 
even if the place in the ordering for servers is configured statically 
for each server, for clients it will be something that might change at 
runtime.

My suggestion is to hide the complexity of these questions behind a good 
abstraction, so that the client programmer can not inadvertantly break 
invariants, and so that client code is as robust as possible.

> Jeff's API notes include the question to abandon the
> channelStateNotify and instead allow subscription on channel
> creation.
>
> I agree completely that connection should be hidden.
> Instead of a 'connected' notification you get the value.
> A 'disconnect' is simply an error for blocking 'gets' & 'puts'
> or a special type of notification for async calls and subscriptions.

What you propose here is not hiding. As you said, you /will/ get an 
error (for the blocking calls) if connection is lost and you /will/ get 
a different sort of notification for async calls and subscriptions. 
Nothing is hidden. Really hiding the connection state is impossible 
under typical control system requirements. It is a fact we cannot 
ignore that connections get lost, that PVs appear, disappear and 
re-appear, possibly on a different server. The question is how you 
structure the API so that client code can handle these situations 
properly, depending on the client's requirements. Your proposal may be 
better or worse in this regard (I haven't formed an opinion yet) but 
the matter is completely separate from actually hiding stuff (like 
server selection) from the client program.

> I do see a need to select the directory server
> (there might be different ones for the linac, test network, ...).

See above: what if it isn't running? Do you want the application to 
fail, then?

> Sometimes I also want to decide which PV gets used in case
> there's a primary, backup, ...

Why? IMO, backup should be selected automatically, whenever primary 
fails. Yes, you definitely do want to be notified whenever this 
happens, but this is an orthogonal matter.

Server duplication for backup should IMHO be completely transparent to 
CA clients.

Let me sketch how I would solve the multiple PVs problem in the most 
general manner:

Each PV on each server has an associated property that indicates its 
place in the global selection order. Let us for the moment name this 
property 'precedence'. It could be an integral number, for instance. 
There is a default precedence (for instance zero), but servers can 
change the default (for subsequent creations) and are also able to 
change precedences bulk-wise for a given subset (including all) of the 
PVs they serve. Whenever a server creates a new PV with a given 
precedence, it first checks if the (PV-name, precedence) pair already 
exists in the system. If yes, this is an error and the PV does not get 
created.

For each PV, CA protocol automatically selects the server with the 
highest precedence. For an already connected channel, if a PV appears 
in the system with the same name but higher precedence, two things 
happen:

- client code is notified
- connection is re-routed to the new server

Thus, in the typical scenario with a primary server and a secondary 
backup (synchronized via some as yet unspecified method), the PVs on 
the primary server will all be configured to have higher precedence 
than on the ones on the secondary one. As soon as the primary IOC goes 
down, the secondary takes over without a fuss. Client /will/ get 
notified, though, so maintenance can try and get the broken server 
running again. As soon as it comes up again, it will take over, again 
without a fuss. Clients keep running and working, no need to restart 
anything, no need to reconfigure anything.

Second scenario: Testing in a production system. Let's assume you want 
to test the (formerly) primary server before it gets back online. You 
reconfigure it so that its PVs get default precedences lower than the 
secondary one. You start a new CA client with a restricted set of 
servers in its EPICS_CA_ADDR_LIST, i.e. excluding the (now working as 
primary) backup server (this could be improved: we might have an 
additional variable EPICS_CA_EXCLUDE_ADDR_LIST to make this easier and 
more fool-proof). Then this client does never see the PVs from the 
server you excluded, thus connects to the one you just brought online. 
Note: it is extremely important that each client is notified whenever a 
channel gets re-routed. This minimizes the probability of such a change 
taking place accidentally and go unnnoticed, which would be very bad. 
Such events should also be logged.

I think the above sketched solution is straight and simple to work with. 
What's not so simple is to implement all this at the protocol level and 
in the client libraries.

Note: It does in no way preclude directory services, even multiple 
redundant ones. It only means that locating such services and using 
them would not be visible to CA client programs. It also does not mean 
that we cannot have a low-level API, separate from the CA client API, 
that exposes and allows to manipulate all the details of service 
location and routing of channels. But it /should/ be a completly 
separate interface, and the client API should not depend on it.

> Lastly I like the idea that e.g. ZeroC ICE has where I can use
> a directory but also specify the data source directly (IP & port #).
> That way I can work without a directory,
> override the directory for testing or use another directory
> mechanism.

But isn't the current way to do this, i.e. /outside/ the program a lot 
more flexible? Do really want to hard-code server IP numbers in CA 
clients and recompile them whenever the IP number changes? I wouldn't 
even hard-code domain names.

> So how about exposing these steps:
>
> Directory:
> * select directory server
>    Optional. There's a default directory.
> * query it for a PV
>    Returns list of PVs: CA server, type (primary, ...)
>
> Channel:
> * constructor(name, Directory = default)
> * constructor(name, CA server info)
>    If you want full control over the data source
> [...]

I cannot imagine an application that needs this kind of full control. 
Which do you have in mind?

Ben

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System