Experimental Physics and Industrial Control System
Folks,
We are having a problem with sudden loss of mbufs on vxWorks 5.5.2 on an MVME5100. We have not seen the problem on MVME2700 or MVME2100 systems.
In the past when mbufs got low I could always explain the problem by finding a client that had stopped processing channel access monitors to which it had subscribed. That could be found by doing "inetstatShow" on the vxWorks system, and looking for connections with large numbers of bytes in the Send-Q. However, that is not the case now, inetstatShow does not indicate any problem.
The symptom is that a large number (>150) of 128 byte mbufs are consumed "instantly" where the time resolution of our archiver is 10 seconds. The IOC will typically run for about 24 hours before this happens. Before it happens there are less than 10 128 byte mbufs in use. We can find no correlation with any IOC activity (new CA connections, CPU load, etc.) when it happens.
Here is the archive data for the vxStats minDataMbuf PV when an incident occurs.
# 13IDA::minDataMBuf.VAL (double)
# requested time_span: [20121125 190000 , 20121125 200000]
# actual time_span: [20120827 102758 , 20121125 200013]
# n_points: 176
#-------------------------------
# date time value
20120827 102758 1346081278.156 99.0
20121125 190333 1353891813.104 96.0
20121125 190343 1353891823.104 99.0
20121125 191143 1353892303.108 94.0
20121125 191153 1353892313.108 99.0
20121125 192403 1353893043.114 98.0
20121125 192413 1353893053.114 99.0
20121125 192633 1353893193.115 53.0
20121125 192653 1353893213.115 54.0
20121125 192713 1353893233.120 53.0
20121125 192723 1353893243.116 54.0
20121125 192733 1353893253.116 51.0
20121125 192743 1353893263.116 55.0
20121125 192753 1353893273.116 53.0
20121125 192803 1353893283.116 55.0
20121125 192813 1353893293.116 54.0
20121125 192833 1353893313.116 53.0
20121125 192843 1353893323.117 54.0
20121125 192853 1353893333.117 56.0
20121125 192903 1353893343.117 53.0
20121125 192913 1353893353.117 55.0
20121125 192923 1353893363.117 28.0
20121125 192933 1353893373.117 56.0
20121125 192943 1353893383.117 54.0
20121125 192953 1353893393.117 53.0
20121125 193023 1353893423.117 58.0
20121125 193033 1353893433.117 53.0
20121125 193043 1353893443.117 55.0
20121125 193053 1353893453.118 60.0
20121125 193103 1353893463.117 58.0
20121125 193113 1353893473.118 53.0
20121125 193123 1353893483.118 36.0
20121125 193133 1353893493.118 57.0
20121125 193143 1353893503.118 55.0
20121125 193153 1353893513.118 44.0
20121125 193203 1353893523.118 54.0
20121125 193223 1353893543.122 0.0
20121125 193233 1353893553.119 55.0
20121125 193243 1353893563.119 54.0
Note that minDataMbuf went from 99% to 53% within the 10 second time resolution of the archiver.
After the incident happens the mbufs look like this:
ioc13ida> mbufShow
type number
--------- ------
FREE : 1110
DATA : 112
HEADER : 128
SOCKET : 0
PCB : 0
RTABLE : 0
HTABLE : 0
ATABLE : 0
SONAME : 0
ZOMBIE : 0
SOOPTS : 0
FTABLE : 0
RIGHTS : 0
IFADDR : 0
CONTROL : 0
OOBDATA : 0
IPMOPTS : 0
IPMADDR : 0
IFMADDR : 0
MRTABLE : 0
TOTAL : 1350
number of mbufs: 1350
number of times failed to find space: 0
number of times waited for space: 0
number of times drained protocols for space: 0
__________________
CLUSTER POOL TABLE
_______________________________________________________________________________
size clusters free usage
-------------------------------------------------------------------------------
64 250 222 14342266
128 400 213 123390266
256 50 41 7517122
512 50 50 2204546
1024 25 25 29529
2048 125 125 3085861
-------------------------------------------------------------------------------
Occasionally the 256 byte mbufs will get temporarily depleted, which is what caused the value of 0 in the archive output. But all of the mbuf clusters except the 128 byte ones go to zero usage as we close CA clients. However, ~150 of the 128 byte mbufs appear to be permanently "missing". I think many of these may be in the 128 HEADER count, which does not decrease as CA connections are closed.
The system continues to run when this happens. We have not seen a second "incident" occur with further mbuf depletion before the system was rebooted, which we typically do within a week.
This is the output of inetstatShow under the above conditions. It does not seem to indicate any problem.
ioc13ida> inetstatShow
Active Internet connections (including servers)
PCB Proto Recv-Q Send-Q Local Address Foreign Address (state)
-------- ----- ------ ------ ------------------ ------------------ -------
1ef19ae4 TCP 0 40 164.54.160.75.5064 164.54.160.61.53360 ESTABLISHED
1ef1874c TCP 0 0 164.54.160.75.5064 164.54.160.116.55939 ESTABLISHED
1ef19a60 TCP 0 0 164.54.160.75.5064 164.54.160.111.44241 ESTABLISHED
1ef19b68 TCP 0 0 164.54.160.75.5064 164.54.160.111.44071 ESTABLISHED
1ef185c0 TCP 0 0 164.54.160.75.5064 164.54.160.111.43901 ESTABLISHED
1ef19bec TCP 0 0 164.54.160.75.5064 164.54.160.111.42449 ESTABLISHED
1ef19118 TCP 0 0 164.54.160.75.5064 164.54.160.111.42182 ESTABLISHED
1ef18854 TCP 0 0 164.54.160.75.5064 164.54.160.46.3743 ESTABLISHED
1ef186c8 TCP 0 0 164.54.160.75.5064 164.54.160.82.51091 ESTABLISHED
1ef188d8 TCP 0 0 164.54.160.75.5064 164.54.160.24.49949 ESTABLISHED
1ef194b4 TCP 0 0 164.54.160.75.1041 164.54.160.94.5064 ESTABLISHED
1ef19748 TCP 0 0 164.54.160.75.5064 164.54.160.94.1028 ESTABLISHED
1ef1853c TCP 0 0 164.54.160.75.1040 164.54.160.94.5064 ESTABLISHED
1ef18224 TCP 0 0 164.54.160.75.1039 164.54.160.94.5064 ESTABLISHED
1ef19538 TCP 0 0 164.54.160.75.5064 164.54.160.94.1026 ESTABLISHED
1ef18f08 TCP 0 0 164.54.160.75.5064 164.54.160.116.64274 ESTABLISHED
1ef1895c TCP 0 0 164.54.160.75.5064 164.54.160.86.51123 ESTABLISHED
1ef199dc TCP 0 0 164.54.160.75.5064 164.54.160.42.49774 ESTABLISHED
1ef18e84 TCP 0 0 164.54.160.75.1038 164.54.160.103.5064 ESTABLISHED
1ef184b8 TCP 0 0 164.54.160.75.1037 164.54.160.103.5064 ESTABLISHED
1ef1832c TCP 0 0 164.54.160.75.5064 164.54.160.103.1027 ESTABLISHED
1ef19220 TCP 0 0 164.54.160.75.5064 164.54.160.93.60439 ESTABLISHED
1ef189e0 TCP 0 0 164.54.160.75.5064 164.54.160.93.60049 ESTABLISHED
1ef19850 TCP 0 0 164.54.160.75.5064 164.54.160.74.1121 ESTABLISHED
1ef197cc TCP 0 0 164.54.160.75.5064 164.54.160.82.56917 ESTABLISHED
1ef192a4 TCP 0 0 164.54.160.75.5064 164.54.160.82.56912 ESTABLISHED
1ef18e00 TCP 0 0 164.54.160.75.5064 164.54.160.100.4756 ESTABLISHED
1ef18cf8 TCP 0 0 164.54.160.75.1034 164.54.160.12.5064 ESTABLISHED
1ef18c74 TCP 0 0 164.54.160.75.1033 164.54.160.82.44628 ESTABLISHED
1ef18ae8 TCP 0 0 164.54.160.75.5064 164.54.160.82.56909 ESTABLISHED
1ef18644 TCP 0 0 164.54.160.75.5064 164.54.160.111.57929 ESTABLISHED
1ef182a8 TCP 0 0 164.54.160.75.1029 164.54.160.99.5064 ESTABLISHED
1ef181a0 TCP 0 0 164.54.160.75.1027 164.54.160.82.44628 ESTABLISHED
1ef1811c TCP 0 0 164.54.160.75.1026 164.54.160.12.5064 ESTABLISHED
1ef17f0c TCP 0 0 0.0.0.0.5064 0.0.0.0.0 LISTEN
1ef17c78 TCP 0 0 164.54.160.75.1025 164.54.160.11.10001 ESTABLISHED
1ef17a68 TCP 0 0 0.0.0.0.23 0.0.0.0.0 LISTEN
1ef196c4 UDP 0 0 127.0.0.1.1045 127.0.0.1.1038
1ef19010 UDP 0 0 0.0.0.0.976 0.0.0.0.0
1ef17d80 UDP 0 0 0.0.0.0.1044 0.0.0.0.0
1ef19640 UDP 0 0 127.0.0.1.1043 127.0.0.1.1036
1ef195bc UDP 0 0 127.0.0.1.1042 127.0.0.1.1035
1ef19430 UDP 0 0 164.54.160.75.1041 0.0.0.0.0
1ef193ac UDP 0 0 127.0.0.1.1040 127.0.0.1.1029
1ef19328 UDP 0 0 127.0.0.1.1039 127.0.0.1.1028
1ef19094 UDP 0 0 0.0.0.0.1038 0.0.0.0.0
1ef18f8c UDP 0 0 0.0.0.0.1037 0.0.0.0.0
1ef18d7c UDP 0 0 0.0.0.0.984 0.0.0.0.0
1ef18bf0 UDP 0 0 0.0.0.0.1036 0.0.0.0.0
1ef18b6c UDP 0 0 0.0.0.0.1035 0.0.0.0.0
1ef18a64 UDP 0 0 0.0.0.0.1034 0.0.0.0.0
1ef187d0 UDP 0 0 0.0.0.0.5065 0.0.0.0.0
1ef18434 UDP 0 0 0.0.0.0.1030 0.0.0.0.0
1ef183b0 UDP 0 0 0.0.0.0.1029 0.0.0.0.0
1ef18098 UDP 0 0 0.0.0.0.1028 0.0.0.0.0
1ef18014 UDP 0 0 0.0.0.0.5064 0.0.0.0.0
1ef17f90 UDP 0 0 164.54.160.75.1031 164.54.160.255.5065
1ef17e88 UDP 0 0 0.0.0.0.1027 0.0.0.0.0
1ef17e04 UDP 0 0 0.0.0.0.1026 0.0.0.0.0
1ef17cfc UDP 0 0 0.0.0.0.1025 0.0.0.0.0
value = 1 = 0x1
This is the output of casr:
ioc13ida> casr
Channel Access Server V4.13
Connected circuits:
TCP 164.54.160.111:57929(millenia.cars.aps.anl.gov): User="root", V4.13, 972 Channels, Priority=0
TCP 164.54.160.82:56909(corvette.cars.aps.anl.gov): User="epics", V4.13, 333 Channels, Priority=0
TCP 164.54.160.100:4756(miata): User="lvp_user", V4.11, 288 Channels, Priority=0
TCP 164.54.160.82:56912(corvette.cars.aps.anl.gov): User="epics", V4.13, 2 Channels, Priority=0
TCP 164.54.160.82:56917(corvette.cars.aps.anl.gov): User="epics", V4.13, 12 Channels, Priority=80
TCP 164.54.160.74:1121(ioc13bma): User="iocboot", V4.13, 1 Channels, Priority=80
TCP 164.54.160.93:60049(Ferrari): User="XAS_User", V4.13, 53 Channels, Priority=0
TCP 164.54.160.93:60439(Ferrari): User="XAS_User", V4.13, 260 Channels, Priority=0
TCP 164.54.160.103:1027(ioc13idd): User="iocboot", V4.13, 3 Channels, Priority=80
TCP 164.54.160.42:49774(Volt): User="xas_user", V4.13, 1 Channels, Priority=0
TCP 164.54.160.86:51123(vincent): User="epics", V4.13, 7 Channels, Priority=0
TCP 164.54.160.116:64274(Charger): User="gpd_user", V4.13, 17 Channels, Priority=0
TCP 164.54.160.94:1026(ioc13idc): User="iocboot", V4.13, 3 Channels, Priority=10
TCP 164.54.160.94:1028(ioc13idc): User="iocboot", V4.13, 4 Channels, Priority=0
TCP 164.54.160.24:49949(GT500): User="gpd_user", V4.13, 9 Channels, Priority=0
TCP 164.54.160.82:51091(corvette.cars.aps.anl.gov): User="gpd_user", V4.13, 3 Channels, Priority=10
TCP 164.54.160.46:3743(ringoTABLETPC): User="epics", V4.8, 9 Channels, Priority=0
TCP 164.54.160.111:42182(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:42449(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:43901(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:44071(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.111:44241(millenia.cars.aps.anl.gov): User="nobody", V4.13, 1 Channels, Priority=0
TCP 164.54.160.116:55939(Charger): User="gpd_user", V4.13, 46 Channels, Priority=0
TCP 164.54.160.61:53360(Patriot): User="gpd_user", V4.13, 9 Channels, Priority=0
Has anyone else seen a problem like this, or have any suggestions on how to track it down?
Thanks,
Mark
- Navigate by Date:
- Prev:
Re: DBE_PROPERTY and CSS Re: Using CAJ in production (DBE_PROPERTY and CA gateway) Michael Davidsaver
- Next:
Re: mask for bitwise operation in CALC record haquin
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
RE: EPICS support for NewFocus 8752 controller? Mark Rivers
- Next:
waveform put in CSS Pavel Maslov
- Index:
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
<2012>
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024