EPICS mbuf depletion on vxWorks 5.5.2 with VME5100

Experimental Physics and Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 <2012> 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 <2012> 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
<== Date ==>		<== Thread ==>

Subject:	mbuf depletion on vxWorks 5.5.2 with VME5100
From:	Mark Rivers <[email protected]>
To:	"[email protected]" <[email protected]>
Cc:	GSECARS <[email protected]>
Date:	Sun, 2 Dec 2012 16:56:18 +0000

Folks,

We are having a problem with sudden loss of mbufs on vxWorks 5.5.2 on an MVME5100.  We have not seen the problem on MVME2700 or MVME2100 systems.

In the past when mbufs got low I could always explain the problem by finding a client that had stopped processing channel access monitors to which it had subscribed.  That could be found by doing "inetstatShow" on the vxWorks system, and looking for connections with large numbers of bytes in the Send-Q.  However, that is not the case now, inetstatShow does not indicate any problem.

The symptom is that a large number (>150) of 128 byte mbufs are consumed "instantly" where the time resolution of our archiver is 10 seconds.  The IOC will typically run for about 24 hours before this happens.  Before it happens there are less than 10 128 byte mbufs in use.  We can find no correlation with any IOC activity (new CA connections, CPU load, etc.) when it happens.

Here is the archive data for the vxStats minDataMbuf PV when an incident occurs.

# 13IDA::minDataMBuf.VAL (double)
# requested time_span:  [20121125 190000 , 20121125 200000]
# actual    time_span:  [20120827 102758 , 20121125 200013]
# n_points: 176
#-------------------------------
#  date      time          value
20120827 102758 1346081278.156 99.0
20121125 190333 1353891813.104 96.0
20121125 190343 1353891823.104 99.0
20121125 191143 1353892303.108 94.0
20121125 191153 1353892313.108 99.0
20121125 192403 1353893043.114 98.0
20121125 192413 1353893053.114 99.0
20121125 192633 1353893193.115 53.0
20121125 192653 1353893213.115 54.0
20121125 192713 1353893233.120 53.0
20121125 192723 1353893243.116 54.0
20121125 192733 1353893253.116 51.0
20121125 192743 1353893263.116 55.0
20121125 192753 1353893273.116 53.0
20121125 192803 1353893283.116 55.0
20121125 192813 1353893293.116 54.0
20121125 192833 1353893313.116 53.0
20121125 192843 1353893323.117 54.0
20121125 192853 1353893333.117 56.0
20121125 192903 1353893343.117 53.0
20121125 192913 1353893353.117 55.0
20121125 192923 1353893363.117 28.0
20121125 192933 1353893373.117 56.0
20121125 192943 1353893383.117 54.0
20121125 192953 1353893393.117 53.0
20121125 193023 1353893423.117 58.0
20121125 193033 1353893433.117 53.0
20121125 193043 1353893443.117 55.0
20121125 193053 1353893453.118 60.0
20121125 193103 1353893463.117 58.0
20121125 193113 1353893473.118 53.0
20121125 193123 1353893483.118 36.0
20121125 193133 1353893493.118 57.0
20121125 193143 1353893503.118 55.0
20121125 193153 1353893513.118 44.0
20121125 193203 1353893523.118 54.0
20121125 193223 1353893543.122 0.0
20121125 193233 1353893553.119 55.0
20121125 193243 1353893563.119 54.0


Note that minDataMbuf went from 99% to 53% within the 10 second time resolution of the archiver.

After the incident happens the mbufs look like this:

ioc13ida> mbufShow
type        number
---------   ------
FREE    :    1110
DATA    :    112
HEADER  :    128
SOCKET  :      0
PCB     :      0
RTABLE  :      0
HTABLE  :      0
ATABLE  :      0
SONAME  :      0
ZOMBIE  :      0
SOOPTS  :      0
FTABLE  :      0
RIGHTS  :      0
IFADDR  :      0
CONTROL :      0
OOBDATA :      0
IPMOPTS :      0
IPMADDR :      0
IFMADDR :      0
MRTABLE :      0
TOTAL   :    1350
number of mbufs: 1350
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
__________________
CLUSTER POOL TABLE
_______________________________________________________________________________
size     clusters  free      usage
-------------------------------------------------------------------------------
64       250       222       14342266
128      400       213       123390266
256      50        41        7517122
512      50        50        2204546
1024     25        25        29529
2048     125       125       3085861
-------------------------------------------------------------------------------


Occasionally the 256 byte mbufs will get temporarily depleted, which is what caused the value of 0 in the archive output.  But all of the mbuf clusters except the 128 byte ones go to zero usage as we close CA clients.  However, ~150 of the 128 byte mbufs appear to be permanently "missing".  I think many of these may be in the 128 HEADER count, which does not decrease as CA connections are closed.

The system continues to run when this happens.  We have not seen a second "incident" occur with further mbuf depletion before the system was rebooted, which we typically do within a week.

This is the output of inetstatShow under the above conditions.  It does not seem to indicate any problem.

ioc13ida> inetstatShow
Active Internet connections (including servers)
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)
-------- ----- ------ ------  ------------------ ------------------ -------
1ef19ae4 TCP        0     40  164.54.160.75.5064    164.54.160.61.53360   ESTABLISHED
1ef1874c TCP        0      0  164.54.160.75.5064    164.54.160.116.55939  ESTABLISHED
1ef19a60 TCP        0      0  164.54.160.75.5064    164.54.160.111.44241  ESTABLISHED
1ef19b68 TCP        0      0  164.54.160.75.5064    164.54.160.111.44071  ESTABLISHED
1ef185c0 TCP        0      0  164.54.160.75.5064    164.54.160.111.43901  ESTABLISHED
1ef19bec TCP        0      0  164.54.160.75.5064    164.54.160.111.42449  ESTABLISHED
1ef19118 TCP        0      0  164.54.160.75.5064    164.54.160.111.42182  ESTABLISHED
1ef18854 TCP        0      0  164.54.160.75.5064    164.54.160.46.3743    ESTABLISHED
1ef186c8 TCP        0      0  164.54.160.75.5064    164.54.160.82.51091   ESTABLISHED
1ef188d8 TCP        0      0  164.54.160.75.5064    164.54.160.24.49949   ESTABLISHED
1ef194b4 TCP        0      0  164.54.160.75.1041    164.54.160.94.5064    ESTABLISHED
1ef19748 TCP        0      0  164.54.160.75.5064    164.54.160.94.1028    ESTABLISHED
1ef1853c TCP        0      0  164.54.160.75.1040    164.54.160.94.5064    ESTABLISHED
1ef18224 TCP        0      0  164.54.160.75.1039    164.54.160.94.5064    ESTABLISHED
1ef19538 TCP        0      0  164.54.160.75.5064    164.54.160.94.1026    ESTABLISHED
1ef18f08 TCP        0      0  164.54.160.75.5064    164.54.160.116.64274  ESTABLISHED
1ef1895c TCP        0      0  164.54.160.75.5064    164.54.160.86.51123   ESTABLISHED
1ef199dc TCP        0      0  164.54.160.75.5064    164.54.160.42.49774   ESTABLISHED
1ef18e84 TCP        0      0  164.54.160.75.1038    164.54.160.103.5064   ESTABLISHED
1ef184b8 TCP        0      0  164.54.160.75.1037    164.54.160.103.5064   ESTABLISHED
1ef1832c TCP        0      0  164.54.160.75.5064    164.54.160.103.1027   ESTABLISHED
1ef19220 TCP        0      0  164.54.160.75.5064    164.54.160.93.60439   ESTABLISHED
1ef189e0 TCP        0      0  164.54.160.75.5064    164.54.160.93.60049   ESTABLISHED
1ef19850 TCP        0      0  164.54.160.75.5064    164.54.160.74.1121    ESTABLISHED
1ef197cc TCP        0      0  164.54.160.75.5064    164.54.160.82.56917   ESTABLISHED
1ef192a4 TCP        0      0  164.54.160.75.5064    164.54.160.82.56912   ESTABLISHED
1ef18e00 TCP        0      0  164.54.160.75.5064    164.54.160.100.4756   ESTABLISHED
1ef18cf8 TCP        0      0  164.54.160.75.1034    164.54.160.12.5064    ESTABLISHED
1ef18c74 TCP        0      0  164.54.160.75.1033    164.54.160.82.44628   ESTABLISHED
1ef18ae8 TCP        0      0  164.54.160.75.5064    164.54.160.82.56909   ESTABLISHED
1ef18644 TCP        0      0  164.54.160.75.5064    164.54.160.111.57929  ESTABLISHED
1ef182a8 TCP        0      0  164.54.160.75.1029    164.54.160.99.5064    ESTABLISHED
1ef181a0 TCP        0      0  164.54.160.75.1027    164.54.160.82.44628   ESTABLISHED
1ef1811c TCP        0      0  164.54.160.75.1026    164.54.160.12.5064    ESTABLISHED
1ef17f0c TCP        0      0  0.0.0.0.5064          0.0.0.0.0             LISTEN
1ef17c78 TCP        0      0  164.54.160.75.1025    164.54.160.11.10001   ESTABLISHED
1ef17a68 TCP        0      0  0.0.0.0.23            0.0.0.0.0             LISTEN
1ef196c4 UDP        0      0  127.0.0.1.1045        127.0.0.1.1038
1ef19010 UDP        0      0  0.0.0.0.976           0.0.0.0.0
1ef17d80 UDP        0      0  0.0.0.0.1044          0.0.0.0.0
1ef19640 UDP        0      0  127.0.0.1.1043        127.0.0.1.1036
1ef195bc UDP        0      0  127.0.0.1.1042        127.0.0.1.1035
1ef19430 UDP        0      0  164.54.160.75.1041    0.0.0.0.0
1ef193ac UDP        0      0  127.0.0.1.1040        127.0.0.1.1029
1ef19328 UDP        0      0  127.0.0.1.1039        127.0.0.1.1028
1ef19094 UDP        0      0  0.0.0.0.1038          0.0.0.0.0
1ef18f8c UDP        0      0  0.0.0.0.1037          0.0.0.0.0
1ef18d7c UDP        0      0  0.0.0.0.984           0.0.0.0.0
1ef18bf0 UDP        0      0  0.0.0.0.1036          0.0.0.0.0
1ef18b6c UDP        0      0  0.0.0.0.1035          0.0.0.0.0
1ef18a64 UDP        0      0  0.0.0.0.1034          0.0.0.0.0
1ef187d0 UDP        0      0  0.0.0.0.5065          0.0.0.0.0
1ef18434 UDP        0      0  0.0.0.0.1030          0.0.0.0.0
1ef183b0 UDP        0      0  0.0.0.0.1029          0.0.0.0.0
1ef18098 UDP        0      0  0.0.0.0.1028          0.0.0.0.0
1ef18014 UDP        0      0  0.0.0.0.5064          0.0.0.0.0
1ef17f90 UDP        0      0  164.54.160.75.1031    164.54.160.255.5065
1ef17e88 UDP        0      0  0.0.0.0.1027          0.0.0.0.0
1ef17e04 UDP        0      0  0.0.0.0.1026          0.0.0.0.0
1ef17cfc UDP        0      0  0.0.0.0.1025          0.0.0.0.0
value = 1 = 0x1

This is the output of casr:
ioc13ida> casr
Channel Access Server V4.13
Connected circuits:
TCP 164.54.160.111:57929(millenia.cars.aps.anl.gov): User="root", V4.13, 972 Channels, Priority=0
TCP 164.54.160.82:56909(corvette.cars.aps.anl.gov): User="epics", V4.13, 333 Channels, Priority=0
TCP 164.54.160.100:4756(miata): User="lvp_user", V4.11, 288 Channels, Priority=0
TCP 164.54.160.82:56912(corvette.cars.aps.anl.gov): User="epics", V4.13, 2 Channels, Priority=0
TCP 164.54.160.82:56917(corvette.cars.aps.anl.gov): User="epics", V4.13, 12 Channels, Priority=80
TCP 164.54.160.74:1121(ioc13bma): User="iocboot", V4.13, 1 Channels, Priority=80
TCP 164.54.160.93:60049(Ferrari): User="XAS_User", V4.13, 53 Channels, Priority=0
TCP 164.54.160.93:60439(Ferrari): User="XAS_User", V4.13, 260 Channels, Priority=0
TCP 164.54.160.103:1027(ioc13idd): User="iocboot", V4.13, 3 Channels, Priority=80
TCP 164.54.160.42:49774(Volt): User="xas_user", V4.13, 1 Channels, Priority=0
TCP 164.54.160.86:51123(vincent): User="epics", V4.13, 7 Channels, Priority=0
TCP 164.54.160.116:64274(Charger): User="gpd_user", V4.13, 17 Channels, Priority=0
TCP 164.54.160.94:1026(ioc13idc): User="iocboot", V4.13, 3 Channels, Priority=10
TCP 164.54.160.94:1028(ioc13idc): User="iocboot", V4.13, 4 Channels, Priority=0
TCP 164.54.160.24:49949(GT500): User="gpd_user", V4.13, 9 Channels, Priority=0
TCP 164.54.160.82:51091(corvette.cars.aps.anl.gov): User="gpd_user", V4.13, 3 Channels, Priority=10
TCP 164.54.160.46:3743(ringoTABLETPC): User="epics", V4.8, 9 Channels, Priority=0
TCP 164.54.160.111:42182(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:42449(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:43901(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:44071(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:44241(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.116:55939(Charger): User="gpd_user", V4.13, 46 Channels, Priority=0
TCP 164.54.160.61:53360(Patriot): User="gpd_user", V4.13, 9 Channels, Priority=0


Has anyone else seen a problem like this, or have any suggestions on how to track it down?

Thanks,
Mark

Navigate by Date:: Prev: Re: DBE_PROPERTY and CSS Re: Using CAJ in production (DBE_PROPERTY and CA gateway) Michael Davidsaver; Next: Re: mask for bitwise operation in CALC record haquin; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 <2012> 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Navigate by Thread:: Prev: RE: EPICS support for NewFocus 8752 controller? Mark Rivers; Next: waveform put in CSS Pavel Maslov; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 <2012> 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024