Experimental Physics and
Industrial Control System

Andrew Johnson <[email protected]> · Mon, 22 Nov 2010 17:23:39 -0500

Hi Ben,

On Sunday 21 November 2010 18:12:32 Ben Franksen wrote:
>
> I have made more experiments with ca_put_callback and discovered what I
> think is a serious bug.

There is a bug, but it's not where you think it is.  It's actually in the bo 
record's implementation of the HIGH delay, and fixing it may cause screams 
from some database maintainers.  I'll get back to this topic at the end of 
this message.

The code that implements putNotify is complicated, but it does still pass 
Marty Kraimer's regression tests and I believe it is still behaving as it was 
originally supposed to.

> If the two calls target different records, but the processing chains (or
> networks) overlap, then what happens is that the second call
> *immediately* gets a *successful* callback.

Actually I don't believe that's what's happening, but there is a very long-
standing non-standard behavior in the bo record that is causing what you're 
seeing.

The AppDevGuide section 15.4.4 says:

    6. In general a set of records may be processed as a result of a single
       dbPutNotify. If a record in the set is found to be active, either
       because PACT is true or because a putNotify already owns the record,
       then that record is not made part of the set of records that must
       complete before the putNotify request completes.

This is important as it helps explain the behavior you're seeing.

> This is not what I expected. I had assumed that the CA server would send
> back an error to the second client, saying that there is already
> another request pending. Indeed, there is an error code in caerr.h,
> namely
>
> #define ECA_PUTCBINPROG     DEFMSG(CA_K_ERROR,     45)
>
> that I assumed is there for this case. Not so.

There is code in src/rsrv that generates this error under some circumstances, 
but I don't think I need to try and describe what causes it.

> The problem is that this defeats the whole purpose of the
> ca_put_callback resp. putNotify which is to make sure that the
> processing really has completed. If another client, *any* client
> anywhere on the control network happens to issue a call that interferes
> with mine it might be that I get a success callback even though nothing
> has completed yet.

As I said above, that's not what's happening; your second ca_put_callback() 
_does_ complete the bo record processing, but the OUT link points to the seq 
record that has PACT=TRUE so as the AppDevGuide passage above says, the 
putNotify code drops the seq record from the list of records to wait for and 
replies almost immediately.  You could forward link the second bo to another 
seq record to prove that.

If your overall system design allows for overlapping putNotify operations 
you need to treat that as a resource conflict and protect (or duplicate) the 
records in the overlapping sets.

> This is a bug, IMO. If a client specifically asks for a callback to make
> sure something it wishes to happen in the server has actually happened,
> then the server must either wait for completion, however long it takes,
> or else respond with an error.

The promise that putNotify and ca_put_callback make is that all processing on 
the local IOC that was directly caused by your put operation has finished when 
you receive the callback.  Unfortunately it is not hard for individual record 
types or device support to break the requirements that putNotify depends on.

One of those requirements is that the call to recGblFwdLink(prec) in a 
record's process() routine come after _all_ processing.  The long-standing 
implementation of the bo record breaks that rule though, and I believe that's 
partly what is causing the behavior that you're seeing, see below...

> BTW, the scenario can be easily tested, this is my test db file (which I
> run with 'softIoc -d <dbfile>'):

Thanks that helps me understand your complaint.  The thing about the bo 
records is that using the HIGH delay field of the bo record does not trigger 
asynchronous completion; the record code calls recGblFwdLink(prec) at the end 
of processing the VAL=1 phase.  The act of processing the later VAL=0 phase is 
run by a private timer callback, and will not be seen by the putNotify code as 
being caused by your initial ca_put_callback().

> This is the test command line and the output:
>
> ben@sarun[3]: .../seq/work > time caput -c -w5 pvPutAsync1 1 &; sleep 1;
> time caput -c -w5 pvPutAsync2 1
> [1] 32140
> Old : pvPutAsync1                    Low
> Old : pvPutAsync2                    Low
> New : pvPutAsync2                    High
> caput -c -w5 pvPutAsync2 1  0,00s user 0,01s system 22% cpu 0,036 total
> New : pvPutAsync1                    Low
> caput -c -w5 pvPutAsync1 1  0,00s user 0,00s system 0% cpu 2,034 total
> [1]  + done       time caput -c -w5 pvPutAsync1 1

Your second caput is completing immediately because its bo record tries to 
process the OUT link but finds the seq.PACT=TRUE, so it gives up and tells 
putNotify that everything it started has finished.  A second later the HIGH 
delay kicks in and reprocesses the record, but as I said above that's not seen 
by the putNotify since it has already notified you.

You have several delays in your test which are interacting and may be making 
it hard for you to follow what's actually going on; try setting HIGH to 5 and 
doing a single caput -c to see what I meant about the bo reprocessing being 
disconnected from the putNotify.  It's also very instructive to set TPRO=1 on 
all your records as that tells you if the IOC managed to process a record 
"<task>: Process <rec>" or was if it was unable to because it was already busy 
"<task>: Active <rec>".

Ok, so there's a bug in the bo record; the simple fix is to only process the 
forward link of a bo record that has HIGH set at the end of the HIGH delay.  I 
don't think that would be very popular though, and there might be sites that 
depend on the existing ca_put_callback() to a bo record behavior, so even the 
slightly more complex fix that I can see isn't necessarily something we want 
to do.

Discuss...

HTH,

- Andrew

PS: It's thanksgiving week here in the US, I may not respond to email for the 
rest of this week.
-- 
If a man is offered a fact which goes against his instincts, he will
scrutinize it closely, and unless the evidence is overwhelming, he will
refuse to believe it.  If, on the other hand, he is offered something
which affords a reason for acting in accordance to his instincts, he
will accept it even on the slightest evidence.  -- Bertrand Russell

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System