EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: Re: Useless error messages in EPICS
From: Andrew Johnson <[email protected]>
To: [email protected]
Date: Mon, 2 Mar 2009 14:18:31 -0600
Hi Dirk,

On Monday 02 March 2009 06:49:05 Dirk Zimoch wrote:
>
> 1. The error message does not give any hint of the location at all.
>
> 2. The reported location is inside a (EPICS) system function or a macro,
> but the caller who actually caused the error is not reported.
>
> 3. The reported location is inside the source code of a file parser but the
> error location in the faulty file is not reported.

Note that the DBD and DB file parser underwent some major improvements to its 
error reporting between R3.14.9 and R3.14.10 so now it does indicate (in 
excruciating detail in some cases) the location of any errors in the input 
file.  If you find further issues like this in R3.14.10 or later, please 
report them.

> When looking into epicsMutexOsdLock(), I found that the calling thread is
> suspended via a call to cantProceed() when an error happens. In my opinion,
> this is as well not very helpful because the calling function does not get
> any indication of the failure and thus cannot print a proper error message.

In general cantProceed() was intended to be used when there really is no way 
for operations to continue if some particular condition is detected; it is 
usually an indication there's a bug in the code.  For example, if we run out 
of memory when loading a .db file, there's no point returning an error code 
to the parser because it isn't going to be able to do anything to recover 
from that.  By catching such errors early and not returning at all, we save 
the caller from having to check for and pass the error condition upwards, 
thus reducing code size and complexity in the caller.  In C++ we would use an 
exception to flag the problem without adding caller complexity, but this 
isn't C++ code.

We should never be calling cantProceed() from any code path that is visited 
during normal operation, thus the fact that it is called implies that there 
is a bug somewhere which needs to be fixed.

In your particular case I believe the posix implementation of the routine 
epicsMutexOsdLock() is being passed a pointer to something that is no longer 
a epicsMutexOSD object, causing the pthread_mutex_lock() call to return 
EINVAL ("Invalid argument").  This indicates there's code somewhere which is 
using an object after it has been destroyed.  I don't know why 
epicsMutexOsdLock() doesn't just return epicsMutexLockError in this case 
though, it already does that if you pass in a NULL for pmutex.  I will change 
that, although it won't fix the underlying problem and may result in some 
even less desirable symptoms appearing if/when it occurs.

> In my opinion, cantProceed() does more harm than good as long as it is not
> able to provide useful debug information (at least a stack trace) -- which
> is probably hard to do in a portable way. I think, a call to abort() would
> be more helpful as it produces a process dump (at least on Unix systems)
> which can be analyzed.

Calling abort() on vxWorks destroys all information about the problem location 
since there is no equivalent of a core dump file (at least on vxWorks 5.x).  
The cantProceed() behavior of suspending the calling thread makes debugging 
possible on most (all?) architectures by allowing a human to request a 
backtrace of the relevent thread.  On Unix-like systems you can do that by 
attaching a system debugger (gdb --pid=<pid> on Linux) to the frozen process.

EPICS also provides a fully-documented mechanism (in the AppDevGuide Section 
16.3) called the task watchdog that monitors tasks that register themselves 
with it and can execute callback functions when any such tasks are suspended.  
Most EPICS threads register themselves on start-up and un-register again when 
they shut down.  If a program wants to recover from a cantProceed() error by 
generating an abort(), it can register a task watchdog callback to do just 
that.

> Generally, error handling should not be part of the low-level functions.
> These functions do not have any application knowledge and thus cannot
> decide if suspending the thread, terminating the program, printing an error
> message or continuing normally is the "correct" behavior. Instead,
> low-level functions should only report errors (by or exceptions or return
> values -- I am not religious in this matter) and leave the choice of the
> correct response to the caller.

In general I agree with you, but sometimes the original API was designed 
without the possibility of reporting errors to the caller — say no errors 
could ever occur when the API was first introduced, and it's now called from 
many other pieces of code outside of Base.  Just adding an error status 
return value to a function that used to return void doesn't help because the 
unmodified callers will completely ignore the new return value and the 
compiler won't flag that.  We do use exceptions in C++ code, but obviously 
can't in C APIs.

> In this case, the gateway (maybe the CAS library) stopped accepting new
> clients which made it effectively useless. On the other hand, it did not
> terminate. Thus any mechanism to restart the gateway on failure does not
> work, too. I do not consider this a "useful" behavior.

The gateway should register a task watchdog callback to make sure that it can 
be restarted in this circumstance; it probably needs to register the main 
thread to be monitored as well since it isn't by default.  However I'm hoping 
that Jeff has/will fix your underlying problem too.

> It may be a goal of the next codathlon to improve the error messages
> provided by EPICS.

The list of people attending is available on the website; feel free to 
encourage someone to work on this issue, which I've added to my list of 
tasks.

Thanks,

- Andrew
-- 
The best FOSS code is written to be read by other humans -- Harold Welte


References:
Useless error messages in EPICS Dirk Zimoch

Navigate by Date:
Prev: Uncovered gold??? - "Channel Access Client Library Tutorial, R3.13" John Hammonds
Next: RE: Useless error messages in EPICS Jeff Hill
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Useless error messages in EPICS Jeff Hill
Next: Uncovered gold??? - "Channel Access Client Library Tutorial, R3.13" John Hammonds
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  <20092010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 31 Jan 2014 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·