EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: streamdevice + asyn stuck
From: Dirk Zimoch <[email protected]>
To: Mark Rivers <[email protected]>
Cc: Eric Norum <[email protected]>, "Zhang, Dehong" <[email protected]>, EPICS Tech Talk <[email protected]>
Date: Fri, 15 Jul 2011 09:26:28 +0200
Mark, Dehong,

It is true that StreamDevice and asyn are used a lot without problems. However, with every new version of asyn I have to add new hacks into StreamDevice because the asyn behavior and API change quite often. It may well be that one of the recent modifications I did broke something. Furthermore it is almost impossible for me to test every combination of StreamDevice, asyn and epics base.

Nevertheless I certainly consider this problem a StreamDevice bug. StreamDevice should never hang up, regardless of any potentially misbehaving hardware. I will try to reproduce this problem as quickly as possible, but I am quite busy at the moment.

Dehong, what might help me is:
* StreamDevice debug output. Do "var streamDebug,1" on the ioc shell.
* A strack trace of the hanging ioc. Use pstack with the PID of the ioc process. * Your copy of the StreamDevice source code (to see which patch had been applied).

Dirk


Mark Rivers wrote:
Hi Dehong,

I try very hard not to set the records do "SCAN ** second", but let them listen
to certain events and process at different phases, so that I can be sure they
don't ever scan at the same moment.

That should not be a problem at all.  Your devices are all talking to an underlying asyn port driver.  You said you are using a Digi, so I assume you are using the drvAsynIPPort driver?  If so, it takes care of serialization so that it is fine for multiple records to be talking to the IP port "at the same moment".  The only thing you need to worry about is that you use device support that guarantees that write/reads are atomic.  The standard asyn device support does this, and I am sure that streamDevice does as well.

streamDevice with asyn has been used pretty extensively for systems like yours, without such problems.  If the network or Digi is sometimes messing up then you should just get an asyn timeout.

I am beginning to suspect what Eric Norum suggested, i.e. that some other component in the system is perhaps running amok and causing this problem as a side-effect.

Here is a suggestion:  make a bare-bones IOC which only runs this device.  Then you can rule out other software as causing the problem.

I have looked through your messages and I don't see mention of what OS you are using?  Linux, RTEMS, etc?

Also you are running asyn R4-10 which is 3 years old, and base 3.14.9 which is 4.5 years old.  You could try base 3.14.12.1 and asyn 4-16 to see if the problem persists.

Mark

________________________________

From: Zhang, Dehong [mailto:[email protected]]
Sent: Thu 7/14/2011 8:11 PM
To: Eric Norum; [email protected]; Mark Rivers
Cc: EPICS Tech Talk
Subject: RE: streamdevice + asyn stuck



Hi Dirk, Eric and Mark,

The effect of my hack is unknown -- the problem is still happening.
Here is what I did (my system does not use any wait command):

bool StreamCore::
startProtocol(StartMode startMode)
{
added:

    // start a global timeout watch-dog
    flags |= WaitPending;
    startTimer(10000);

right before:
    return evalCommand();
}


void StreamCore::
finishProtocol(ProtocolResult status)
{
added:
    flags &= ~WaitPending;

in the very beginning, before:
    if (flags & BusPending)
}


bool StreamCore::
evalCommand()
{
changed the first line to:
    if ((flags & ~WaitPending) & BusPending)
}


void StreamCore::
timerCallback()
{
commented out:
//      error("StreamCore::timerCallback(%s) called unexpectedly\n",
//          name());

changed the last line:
//  evalCommand();

    // it has been too long, abort the protocol
    finishProtocol(Fault);
}

Please check this.  If my implementation is correct, then there is only one suspect
left: the callbackRequest in

void Stream::
protocolFinishHook(ProtocolResult result)

Is this handled by Epics base?  The parameter callbackQueueSize is rather large,
at 2000.  Should I try to increase it?  Guess I need to do it before iocInit.

Another thing, the callback priority seems to be set to low.  Should I raise it?

I try very hard not to set the records do "SCAN ** second", but let them listen
to certain events and process at different phases, so that I can be sure they
don't ever scan at the same moment.  After all this, there should not be much
chances for resource contention.  I feel that the problem is happening less
often, but still there.

Any ideas?

Thanks much for helping!  Best regards,
Dehong




________________________________________
From: Eric Norum [[email protected]]
Sent: Wednesday, July 13, 2011 6:46 PM
To: Zhang, Dehong
Cc: EPICS Tech Talk
Subject: Re: streamdevice + asyn stuck

Intermittent strange behaviour sometimes results from a runaway pointer elsewhere in the program, stack overflow, accessing malloc'd memory beyond its range or after it's been freed.   All of which can be tricky to track down.............
On Jul 13, 2011, at 6:26 PM, Zhang, Dehong wrote:

Hi Dirk and Mark,

The problem is not the mutex.  I followed Mark's suggestion and did not see
any mutex stuck -- very often got an empty list.

The problem does not happen very often.  It seems the (random) frequence
depends on the time/environment?  I am having a hard time to catch a case
before the log file reaches the 2 GB limit.

Honestly, I really admire Dirk's effort to put together such a big piece of work!
It is so useful!  And so flexible of cource!  I did not know anything about it,
can only be amazed the more I read the documents and code.  I don't think
there is anything wrong.  It is only the situation we are dealing with: the
computer, the network, digi, then all kinds of hardware with different quality.
I strongly believe the problem is caused by losing communication, due to
network glitches, or digi dropping messages (I have a definitive prove of this),
or the firmware of the actual hardware.

Since the package already has a timer, and one of my systems is not using
(yet) the wait command, I hacked the code slightly, to implement a global
timeout -- if the protocol does not finish within like 10 seconds, the timer will
abort the protocol.  This should "catch-all".  Let's watch it for a few days!

Any new ideas?

Best regards,
Dehong




________________________________________
From: Dirk Zimoch [[email protected]]
Sent: Wednesday, July 13, 2011 12:47 AM
To: Zhang, Dehong
Cc: [email protected]; [email protected]
Subject: Re: streamdevice + asyn stuck

Hello Dehong,

Zhang, Dehong wrote:
Hi Dirk and Mark,

Thank you very much for helping!

I tried to change the SCAN to "10 second", it did not help.  I also tried the
epicsThreadShowAll command as Dirk suggested, it shows a few timerQueue
threads, all "OK"
Some time ago I had the problem that I crashed the timerQueue thread in
a timeout callback and thus never completed record processing. However,
this seems to be a different issue because you don't see crashed threads.

From what I can see from the source code, I believe that we somehow
lose callbacks from asyn, that breaks the evalcommand() chain and we
never get to finishProtocol().
Yes. That seems to be the case. At several places, StreamDevice gives
control to asyn or to EPICS base and relies on callbacks. If the
callbacks don't come, StreamDevice is lost.

I have to check if I did anything wrong there. Basically I give control
 to asynDriver using queueRequest() and expect a callback either to
handleRequest() or handleTimeout(). From there I call the individual
event handlers. So maybe either my queueRequest() setup is wrong or my
handlers are buggy or asynDriver never calls back. (all code in
AsynDriverInterface.cc)

Later I give control to EPICS base via callbackRequest() to finish
processing the record. But I don't think that there is something wrong.
(code in StreamEpics.cc)

I will try to add some diagnostics to see what a record is currently
waiting for. But that may take a while.

Another possibility is the mutex.  If it is not released properly, the
Stream::process() method will be stuck waiting.
The mutex is automatically released (via destructor) when the function
that contains the lock returns. Thus the only way to keep the lock is to
get stuck in a function while holding the lock. But because I don't
think I have unbounded loops in any function, that can only happen when
a thread crashes or a function I call does not return. I don't think
that happens.

Please switch on debugging (var streamDebug,1). Better remove or disable
all other StreamDevice records or the output will be overwhelming. What
are the last messages you get from a record before is gets stuck?


Dirk




________________________________________
From: Dirk Zimoch [[email protected]]
Sent: Tuesday, July 12, 2011 12:21 AM
To: Zhang, Dehong
Cc: [email protected]; [email protected]
Subject: Re: streamdevice + asyn stuck

Hello Dehong,

Do you have a hung up timerQueue thread? Try epicsThreadShowAll.

Dirk

Zhang, Dehong wrote:
Hi Dirk, Mark and fellow EPICSers,

Sorry to bother you with this.  I tried to chase the source code and search the tech-talk,
but could not find much hints.

We use base 3.14.9, streamdevice 2.4 and asyn 4.10, have been experiencing this stuck
problem for some time.  The symptom is that randomly, a mbbi/ai/bi... (input) record would
stop updating and no longer does FLNK, with no error/warning messages printed in the log
file.  And dbpr *** 3 would show that:
LCNT=11
SEVR=INVALID
STAT=SCAN

And manually writing to the PROC field also does nothing.

It seems like the record is stuck waiting for a callback from streamdevice (and asyn), and
the EPICS framework just ignores any "PROC" requests.

In the protocol file we do have the timeouts etc:
locktimeout  = 5000;
terminator   = CR;
replytimeout = 1000;
readtimeout  = 1000;
extrainput   = Ignore;

While chasing the problem, I noticed that "asynReport 10 ***" would show:
multiDevice:No canBlock:Yes autoConnect:Yes
   enabled:Yes connected:Yes numberConnects 1

For some the numberConnects can be 2.

Is this "canBlock:Yes" related to my problem?  Should the numberConnects be <=1?
We use one port to talk to one device.

Rebooting the IOC does fix the problems, but then they would come back, randomly.

Please advise.  Thank you very much.

Best regards,
Dehong




--
Eric Norum
[email protected]









References:
streamdevice + asyn stuck Zhang, Dehong
Re: streamdevice + asyn stuck Dirk Zimoch
RE: streamdevice + asyn stuck Zhang, Dehong
Re: streamdevice + asyn stuck Dirk Zimoch
RE: streamdevice + asyn stuck Zhang, Dehong
Re: streamdevice + asyn stuck Eric Norum
RE: streamdevice + asyn stuck Zhang, Dehong
RE: streamdevice + asyn stuck Mark Rivers

Navigate by Date:
Prev: RE: streamdevice + asyn stuck Mark Rivers
Next: Announce: sequencer release 2.1.0 Benjamin Franksen
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: streamdevice + asyn stuck Mark Rivers
Next: RE: streamdevice + asyn stuck Zhang, Dehong
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  <20112012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 18 Nov 2013 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·