EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: GPIB + vxStats hangs IOC
From: Benjamin Franksen <[email protected]>
To: EPICS Techtalk <[email protected]>
Date: Thu, 01 Aug 2002 16:55:06 +0200
Hello,

at BESSY we currently have two IOCs which perform GPIB I/O over a HP
E2050A LAN/GPIB gateway, using the EPICS GPIB support, currently R0-3,
and EPICS base 3.13.6.

Both IOCs occasionally hang and need to be rebooted.

The problem is at least one or two years old and was already present
with the old lanGpib2.4 and EPICS R3.13.2. It only appears more often
now, probably due to the latest additions to the databases that now do
more GPIB I/O.

Each of the IOCs controls exactly one gateway and therefore one GPIB
segment. We already stripped them of any other task. The only records
not GPIB related are the ones that monitor the IOC using the devVxStats
device support (part of base).

I am not aware of anyone using the EPICS GPIB support who reported a
similar problem.

When the IOC goes down, we see this:

iocIOC2X250C> 
Access Fault
Program Counter: 0x00069a16
Status Register: 0x3000
Access Address : 0x5f3f046f
Special Status : 0x0525
Task: 0xeff324 "cbLow"
filename="../taskwd.c" line number=175
task eff324 cbLow suspended
CAS: request from 192.168.21.99:51810 => "put call back time out"

Now, the IOC hangs so completely that the shell no longer accepts
commands. Killing the command with Ctrl-C gives the following:

iocIOC2X250C> memShow

(nothing happens, so I enter Ctrl-C)

 7865c _vxTaskEntry   +10 : _shell (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
 41538 _shell         +138: 41556 ([1, 0, 0, 4135c, 0])
 416d8 _shell         +2d8: _execute (d248de)
 417fc _execute       +ac : _yyparse ([0, 1, 0, d248de, 0])
 45568 _yyparse       +16 : _malloc (960)
 69476 _malloc        +e  : _memPartAlloc ([a72ea, 960, d248a0, 4556a,
960])
 6910a _memPartAlloc  +4a : _semMTake ([a72ea, 960, 4, d24854, 6947a])
tShell restarted.

IMHO, this can only mean that the suspended cbLow task still holds the
mutex that protects the VxWorks memory allocator's internal structures.
A very unfortunate situation: I can see no way to delete the task and
free the semaphore without shell interaction, which in turn needs to
allocate memory and thus waits forever... It seems that every effort to
diagnose the situation post mortem is doomed to fail. I can't even get a
stack trace for the cbLow task!

BTW, although there are no memory leaks in a running IOC, "memShow"
reports an ever (and quickly: ~1400 Bytes/sec) rising number for the
*accumulated* allocation. This is due to the RPC library calls where
structures are permanently allocated and deallocated to support
conversion between host and network format. AFAIK, this is inherent in
the RPC library interface - memory is never explicitly allocated by the
GPIB support after initialization is done.

The access fault message above (the first one) hints to a location were
the access fault happened, which is inside the vxWorks routine
memPartInfoGet:

iocIOC2X250C> lkAddr 0x00069a16
0x0006997c _memPartInfoGet           text    
0x00069a94 _mmu40LibInit             text    
.....

which is used by devVxStats to compute relative memory consumption.

For a test, I disabled the record that does the memory check. (Note that
even though devVxStats is synchronous, records are I/O interrupt scanned
and therefore processed by a callback task.)

The IOC now runs since Jul 31 15:32:42 and has not yet hung again. The
last times it ran for 25, 22, and then 4 hours.

I have the very bad intuition that, even in the improbable and vastly
fortunate case that the IOC will keep running, we have merely cured the
symptom, not the cause :-(

Any suggestions about what might be at the root of our problem or what
we could do to further analyze it would be warmly appreciated.

Ben

PS: Somehow this reminds me of the curious memory freelist bug that once
appeared in out control system network: an IOC that did exactly nothing
besides loading the kernel (empty startup file) produced an access
fault, when memShow was called from the shell. Strangely, this happend
only in our controls network - the identical setup with the *same* CPU
board worked fine in our development network.

Replies:
Re: GPIB + vxStats hangs IOC Benjamin Franksen

Navigate by Date:
Prev: Re: How do I use registryFunctionAdd Rozelle Wright
Next: Re: GPIB + vxStats hangs IOC Brian McAllister
Index: 1994  1995  1996  1997  1998  1999  2000  2001  <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: Re: How do I use registryFunctionAdd Rozelle Wright
Next: Re: GPIB + vxStats hangs IOC Benjamin Franksen
Index: 1994  1995  1996  1997  1998  1999  2000  2001  <20022003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 10 Aug 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·