Argonne National Laboratory

Experimental Physics and
Industrial Control System

2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017 Index 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
<== Date ==> <== Thread ==>

Subject: Re: IOC error message on camonitor disconnect
From: Michael Davidsaver <mdavidsaver@gmail.com>
To: freddie.akeroyd@stfc.ac.uk
Cc: EPICS core-talk <core-talk@aps.anl.gov>
Date: Mon, 31 Jul 2017 11:58:19 +0200
Freddie,

Can you try adding

> CASDEBUG = 5;

somewhere in rsrv_init() in caservertask.c and report what you see?  (I
wouldn't be surprised if the assert() no longer fails)


Thanks for these packet captures.  I had thought/hoped there might be a
corrupt (truncated) message from the client, but this can be ruled out.
All of these messages look normal.  It seems like this must be entirely
a software issue on the server/IOC side.

There is an interesting hint involving one of the error messages you see:

>>> CAS: Request from 127.0.0.1:20272 => cmmd=26209 cid=0x150008
>>> type=14592
>>> count=0 postsize=24885

The first 6 bytes of this CA message are cmmd, postsize, and type.

> 6661 6135 3900

While hinting of ASCII, I didn't understand the significance of this
until I noticed in your packet captures that your CA client reports a
username of 'faa59'.

cid=0x150008 is then the beginning of the next message, hostname (id
0x15) with payload length 8.


These issues smell like RSRV incorrectly handling send and receive
buffers.  The assertion failure is detecting an inconsistency on the
send buffer.  This error message, coupled with correct messages in your
packet captures, looks like a mis-handling of the receive buffer.


The failing assertion is checking consistency between of the buffer
(stack) in which new messages are constructed before being sent.
Specifically, that the message size allocated by a call to
cas_copy_in_header() hasn't changed before a later call to cas_commit_msg().


>>> *pMsg
>>>
>>>    m_cmmd=256
>>>
>>>    m_postsize=6144
>>>
>>>    m_dataType=5120
>>>
>>>    m_count = 256
>>>
>>>    m_cid = 16777216
>>>
>>>   m_available = 16777216

0001 0018 0014 0001
0000 0001 0000 0001

This is an event message (id 0x01) count=1 dtype=DBR_FLOAT, which in
your packet captures is the last valid message being sent by the IOC
prior to disconnect (about 2 seconds prior).

The fact that you see this in vsend_err() is a clear error.


This smells like a race where the send/recv buffer "stack" pointers are
being changed unexpectedly, probably zeroed.  However, I don't yet see
where.


Michael


On 07/29/2017 12:42 AM, freddie.akeroyd@stfc.ac.uk wrote:
> Hi Michael,
> 
> I had a few issues with wireshark and my windows network interfaces not seeing local tcp packets, so in the end used RawCap to capture on loopback and then tshark to filter out junk
> 
> RawCap.exe -f 127.0.0.1 test1.pcap
> "c:\Program Files\Wireshark\tshark.exe" -Y "ip.src == 127.0.0.1 and (tcp.port == 5064 or tcp.port == 5065 or udp.port == 5064 or udp.port == 5065)" -r test1.pcap -w test1c.pcap
> 
> I was running an Ioc binding to all interfaces in one window, in another started camonitor with EPICS_CA_ADDR_LIST=127.0.0.1 and EPICS_CA_AUTO_ADDR_LIST=NO
> 
> The enclosed test1 covers IOC startup, attaching camonitor, ctrl-c camonitor
> 
> Test2 covers multiple connect and kills of camonitor with the ioc left running inbetween, on the first connect here it just printed "protocol mismatch" and gave no traceback but subsequent connect/kills gave tracebacks
> 
> Regards,
> 
> Freddie
> 
> -----Original Message-----
> From: Michael Davidsaver [mailto:mdavidsaver@gmail.com] 
> Sent: 28 July 2017 18:22
> To: Akeroyd, Freddie (STFC,RAL,ISIS)
> Subject: Re: IOC error message on camonitor disconnect
> 
> Since you can trigger this so easily, the best information you could provide would be a packet capture with wireshark covering at least one complete run of 'camonitor'.
> 
> FYI, I would take a look at this capture with:
> 
> https://github.com/mdavidsaver/cashark
> 
> 
> On 07/28/2017 07:07 PM, freddie.akeroyd@stfc.ac.uk wrote:
>> Hi Michael,
>>
>> I put a breakpoint in camessage where "misssaligned protocol" is 
>> initially detected and if I step a few lines to delay that thread 
>> before continuing, I always just get the "force disconnect message" in 
>> the IOC and no assert; if I immediately continue from the breakpoint I 
>> always get an assert failure instead, though I've noticed that 
>> sometimes the assert backtrace is to read_reply() ... event_read() 
>> rather than to send_err() ... camessage() as shown below. Let me know 
>> if there is anything that you would like me to try/examine
>>
>> Regards,
>>
>> Freddie
>>
>> -----Original Message-----
>> From: Michael Davidsaver [mailto:mdavidsaver@gmail.com]
>> Sent: 27 July 2017 11:01
>> To: Akeroyd, Freddie (STFC,RAL,ISIS); tech-talk@aps.anl.gov
>> Subject: Re: IOC error message on camonitor disconnect
>>
>> This looks like it is related to:
>>
>> https://bugs.launchpad.net/epics-base/+bug/541330
>>
>> Although I suspect there is more than one bug involved.
>>
>>
>> On 07/26/2017 07:58 PM, freddie.akeroyd@stfc.ac.uk wrote:
>>> Apologies, forgot to include the stack trace:
>>>
>>>  
>>>
>>>                 dbCore.dll!cas_commit_msg(client * pClient, unsigned int
>>> size) Line 357               C
>>>
>>>                dbCore.dll!vsend_err(const caHdrLargeArray * curp, int 
>>> status, client * client, const char * pformat, char * args) Line
>>> 245                C
>>>
>>>                dbCore.dll!send_err(const caHdrLargeArray * curp, int 
>>> status, client * client, const char * pformat, ...) Line 264  C
>>>
>>>                dbCore.dll!camessage(client * client) Line 2571 C
>>>
>>>                 dbCore.dll!camsgtask(void * pParm) Line 129     C
>>>
>>>                Com.dll!epicsWin32ThreadEntry(void * lpParameter) Line
>>> 501    C
>>>
>>>  
>>>
>>> Regards,
>>>
>>>  
>>>
>>> Freddie
>>>
>>>  
>>>
>>> *From:*tech-talk-bounces@aps.anl.gov
>>> [mailto:tech-talk-bounces@aps.anl.gov] *On Behalf Of 
>>> *freddie.akeroyd@stfc.ac.uk
>>> *Sent:* 26 July 2017 18:25
>>> *To:* tech-talk@aps.anl.gov
>>> *Subject:* IOC error message on camonitor disconnect
>>>
>>>  
>>>
>>> Hi,
>>>
>>>  
>>>
>>> I am seeing this using EPICS base 3.15.5 running on windows x64. I 
>>> start softIOC.exe and then repeatedly connect to a PV with camonitor 
>>> and then CTRL-C the camonitor, each time I see either:
>>>
>>>  
>>>
>>> epics> CAS: request from 127.0.0.1:20272 => CAS: Missaligned protocol
>>> rejected
>>>
>>> CAS: Request from 127.0.0.1:20272 => cmmd=26209 cid=0x150008
>>> type=14592
>>> count=0 postsize=24885
>>>
>>> CAS: Request from 127.0.0.1:20272 =>   available=0x0    N=1
>>> paddr=0000000000000000
>>>
>>> CAS: forcing disconnect from 127.0.0.1:20272
>>>
>>> epics>
>>>
>>>  
>>>
>>> or
>>>
>>>  
>>>
>>> A call to 'assert(size <= ntohs ( pMsg->m_postsize ))'
>>>
>>>     by thread 'CAS-client' failed in 
>>> ../../../src/ioc/rsrv/caserverio.c
>>> line 357.
>>>
>>> Dumping a stack trace of thread 'CAS-client':
>>>
>>> [  000007FEEE71D565]
>>>
>>> [  000007FEEE71D206]
>>>
>>> [  000007FEEE714EF8]
>>>
>>> [  000007FEED1A6AD5]
>>>
>>> [  000007FEED1AC21B]
>>>
>>> [  000007FEED1AC29B]
>>>
>>> [  000007FEED1AAF1C]
>>>
>>> [  000007FEED1AAAE5]
>>>
>>> [  000007FEEE7177F1]
>>>
>>> [  000007FECEB3B0A8]
>>>
>>> [  000007FECEB3AD11]
>>>
>>> [  00000000771459CD]
>>>
>>> [  000000007727A561]
>>>
>>> EPICS Release EPICS R3.15.5.
>>>
>>> Local time is 2017-07-26 16:38:50.636232257 GMT Daylight Time
>>>
>>> Please E-mail this message to the author or to tech-talk@aps.anl.gov 
>>> <mailto:tech-talk@aps.anl.gov>
>>>
>>> Calling epicsThreadSuspendSelf()
>>>
>>> Thread CAS-client (000000000281A040) suspended
>>>
>>>  
>>>
>>> It looks similar to the message reported in 
>>> http://www.aps.anl.gov/epics/tech-talk/2016/msg01828.php
>>>
>>>  
>>>
>>> If I break into the program at the point of the assert I see the 
>>> following details:
>>>
>>>  
>>>
>>> size = 56
>>>
>>>  
>>>
>>> *pMsg
>>>
>>>    m_cmmd=256
>>>
>>>    m_postsize=6144
>>>
>>>    m_dataType=5120
>>>
>>>    m_count = 256
>>>
>>>    m_cid = 16777216
>>>
>>>   m_available = 16777216
>>>
>>>  
>>>
>>> We only recently updated to 3.15.5 and I didn't remember seeing such 
>>> messages before, so I connected to a machine still running 3.14 and 
>>> confirmed that I don't get any errors reported for the same test on 
>>> 3.14.12.6.
>>>
>>>  
>>>
>>> Regards,
>>>
>>>  
>>>
>>> Freddie
>>>
>>>  
>>>
>>
> 


Replies:
Re: IOC error message on camonitor disconnect Michael Davidsaver

Navigate by Date:
Prev: Re: Q: Simulation mode issues Andrew Johnson
Next: Re: IOC error message on camonitor disconnect Michael Davidsaver
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
Navigate by Thread:
Prev: Re: Q: Simulation mode issues Andrew Johnson
Next: Re: IOC error message on camonitor disconnect Michael Davidsaver
Index: 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  <2017
ANJ, 31 Jul 2017 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·