EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: EPICS CA problems
From: "Mark Rivers" <[email protected]>
To: "Jeff Hill" <[email protected]>, "tech-talk" <[email protected]>
Cc: Antonio Lanzirotti <[email protected]>
Date: Tue, 26 Oct 2010 10:18:54 -0500
Hi Jeff,
 
Thanks for the reply, that is very useful.  The IOC is running a detector (XIA xMAP) that I wrote the driver for, so there is a good chance it is a bug in my driver that is hanging up a CA thread as you described.
 
The system just hung up again, and when it did we ran epicsMutexShowAll to look for a deadlock, etc.
 
Here is the output:
 
epics> epicsMutexShowAll 1
ellCount(&mutexList) 59546 ellCount(&freeList) 23
epicsMutexId 0x12ec13e0 source ../caservertask.c line 732

So there is only 1 epicsMutex that is currently locked, and it is a caservertask that has the lock.  Is this consistent with your hypothesis?
 
Thanks,
Mark
 

 

 


________________________________

From: Jeff Hill [mailto:[email protected]]
Sent: Mon 10/25/2010 4:22 PM
To: Mark Rivers; 'tech-talk'
Cc: Antonio Lanzirotti
Subject: RE: EPICS CA problems



> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
>         Task Id=0x12ec19d0, Socket FD=16
>         Secs since last send 131528.55, Secs since last receive 140881.60
>         Unprocessed request bytes=8296, Undelivered response bytes=0
>         State=up
>         377424 bytes allocated

There are no response bytes pending, but request bytes _are_ pending (for a
long time). I am going to go out on a limb and guess that the server's
per-client receive thread is trapped in db_put_field waiting for the db scan
lock, or waiting in the device support's signal write function - probably
due to a device driver issue. The server's per-client lock is held by the
receive thread in this situation, and that would shutdown also subscription
updates to this client. The situation can be diagnosed in the debugger.
Typically the server's per-client receive thread is parked in socket
receive, and the server's per-client send thread is parked in event flag
wait. Symptomatic would be wedged in db_put_field (device driver issue) or
wedged always waiting for the same lock in the same place (deadlock).

Jeff
______________________________________________________
Jeffrey O. Hill           Email        [email protected]
LANL MS H820              Voice        505 665 1831
Los Alamos NM 87545 USA   FAX          505 665 5107

Message content: TSPA


> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Monday, October 25, 2010 10:55 AM
> To: Jeff Hill; tech-talk
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Jeff,
>
> I have some more information on this.  The problem does NOT appear to be a
> problem with caRepeater crashing.  When the client loses connection to the
> IOC the Windows Task Manager shows that caRepeater is still running on the
> IOC PC.  Normally we had been seeing the problem when the client and the
> IOC were running on the same computer.
>
> However, last night we managed to reproduce the problem with the client
> running on a separate PC.
>
> I have attached the output of casr(100) on the IOC when the client has
> lost communication.  The IOC server is 172.16.1.20 (X26A-Control) and the
> client is running on 172.16.1.21 (X26A-Data).
>
> It appears that when this happens the client loses connection to all PVs
> on the server.  But we know for sure that it lost connection to
> X26A:med:Acquiring.
>
> I think I see something suspicious in the output.  Here is the start of
> one block of output from casr for the client machine that has lost
> connection:
>
> TCP 172.16.1.21:1726(X26A-Data): User="X26A User", V4.11, 1755 Channels,
> Priority=0
>         Task Id=0x12a022b0, Socket FD=15
>         Secs since last send   0.02, Secs since last receive   0.02
>         Unprocessed request bytes=0, Undelivered response bytes=0
>         State=up
>         360696 bytes allocated
>         X26A:med:PresetMode(0rw)        X26A:med:ElapsedReal(1rw)
> X26A:med:PresetReal(0rw)
>
> Here is the start of another block for the same client:
>
> TCP 172.16.1.21:1752(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
>         Task Id=0x12ec19d0, Socket FD=16
>         Secs since last send 131528.55, Secs since last receive 140881.60
>         Unprocessed request bytes=8296, Undelivered response bytes=0
>         State=up
>         377424 bytes allocated
>
> Note that there are unprocessed request bytes there.
>
> There is then another block for the same client machine:
>
> TCP 172.16.1.21:2735(X26A-Data): User="X26A User", V4.11, 1837 Channels,
> Priority=0
>         Task Id=0x12ec0a10, Socket FD=19
>         Secs since last send 119.81, Secs since last receive 119.82
>         Unprocessed request bytes=0, Undelivered response bytes=0
>         State=up
>         377424 bytes allocated
>
> There is also a UDP entry for that client machine:
>
> UDP Server:
> UDP 172.16.1.21:2733(): User="", V4.11, 0 Channels, Priority=0
>         Task Id=0x1293c4e0, Socket FD=11
>         Secs since last send 131525.68, Secs since last receive   3.06
>         Unprocessed request bytes=16, Undelivered response bytes=0
>         State=up
>         180 bytes allocated
>
>         Send Lock
>
> I am not sure how to interpret this.
>
> Thanks,
> Mark
>
>
>
>
> -----Original Message-----
> From: Jeff Hill [mailto:[email protected]]
> Sent: Tuesday, October 19, 2010 10:33 AM
> To: Mark Rivers; 'tech-talk'
> Cc: Antonio Lanzirotti
> Subject: RE: EPICS CA problems
>
> Hi Mark,
>
> This is the first I have heard of any issues the ca repeater crashing.
>
> Is this running under cygwin or mingw? Compiled by ms visual c or gnu?
>
> The stack trace has no symbols so it's hard to determine a cause. If you
> could fire up the relevant debugger and get a stack trace with symbols
> that
> would help. You might need to build base for debugging. Set HOST_OPT=YES
> in
> CONFIG_SITE. Also, if you save the debugging session in visual c++ and
> email
> it to me I might be able to identify the issue.
>
> Jeff
> ______________________________________________________
> Jeffrey O. Hill           Email        [email protected]
> LANL MS H820              Voice        505 665 1831
> Los Alamos NM 87545 USA   FAX          505 665 5107
>
> Message content: TSPA
>
>
> > -----Original Message-----
> > From: Mark Rivers [mailto:[email protected]]
> > Sent: Monday, October 18, 2010 7:58 PM
> > To: tech-talk; Jeff Hill
> > Cc: Antonio Lanzirotti
> > Subject: RE: EPICS CA problems
> >
> > Folks,
> >
> > I learned today that it appears that caRepeater has been crashing on
> > this system.  I don't know for sure that this problem happens when
> > caRepeater has died, but that seems likely.  The next time it happens
> > we will look to see if caRepeater is still running.
> >
> > Meanwhile, we have found that there are caRepeater stackdump files,
> > containing the following:
> >
> > Exception: STATUS_ACCESS_VIOLATION at eip=610B9F69
> > eax=00000000 ebx=00000001 ecx=00000000 edx=0014C6F0 esi=00000000
> > edi=011DCCD8
> > ebp=011DCB14 esp=011DCAEC program=C:\Program Files\EPICS WIN32
> > Extensions\caRepeater.exe, pid 2152, thread unknown (0xC44)
> > cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
> > Stack trace:
> > Frame     Function  Args
> > 011DCB14  610B9F69  (00000000, 00000000, 00000000, 00000000)
> > 011DCC24  610BA905  (00000000, 00000000, 00000000, 00000000)
> > 011DCCE4  610BB67A  (FFFFFFFF, FFFFFFFF, 00000000, 00000000)
> > 011DCD34  61027DE2  (00000002, 011DCE64, 00000002, 011DCE00)
> > 011DCDC8  7C87655C  (00000002, 011DCE00, 7C8763C0, 00000002)
> > End of stack trace
> >
> > Has anyone else seen such stackdumps from caRepeater?  This is the
> > version of caRepeater.exe that is included in the most recent (Nov. 2,
> > 2007) APS "EPICS Win32 Extensions" package.
> >
> > Thanks,
> > Mark
> >
> >
> > ________________________________
> >
> > From: Mark Rivers
> > Sent: Wed 10/13/2010 11:14 AM
> > To: tech-talk; 'Jeff Hill'
> > Cc: Antonio Lanzirotti
> > Subject: EPICS CA problems
> >
> >
> >
> > Folks,
> >
> > We are having trouble with a Windows IOC at NSLS.  Here are the
> > symptoms:
> >
> > - The IOC is running fine
> >
> > - The PC running the IOC has 2 local CA clients connected to the IOC,
> > medm and IDL.  Occassionally (1-2 times per day) one of these clients
> > loses its connection to the IOC.  Medm screens go white, IDL says it
> > cannot find a PV, etc.  This happens when the client was running fine.
> > It typically only happens to one or the other client, not to both.
> >
> > - Restarting the client fixes the problem.
> >
> > - The same 2 clients are running on another PC connected to the same
> > IOC.  Those clients are always fine, they do not lose connection when a
> > client on the PC with the IOC does.
> >
> > - Looking at the resources on the Windows machine (CPU, virtual and
> > physical memory usage) does not indicate any problems.
> >
> > How do we go about figuring out what is wrong?
> >
> > Thanks,
> > Mark
> >
>






Replies:
RE: EPICS CA problems Mark Rivers
References:
RE: EPICS CA problems Mark Rivers
RE: EPICS CA problems Jeff Hill
RE: EPICS CA problems Mark Rivers
RE: EPICS CA problems Jeff Hill

Navigate by Date:
Prev: ChannelArchiver build problem with 3.14.11 on Suse linux Burkhard Kolb
Next: Some Channel Access Questions Ben Franksen
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: EPICS CA problems Jeff Hill
Next: RE: EPICS CA problems Mark Rivers
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  <20102011  2012  2013  2014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 13 Nov 2010 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·