Channel Access servers send beacons, which are UDP broadcast packets, to indicate their health. When a server is healthy, it sends a beacon every EPICS_CA_BEACON_PERIOD (15 sec. by default). This is like a heartbeat. When it is coming up, each server broadcasts a startup sequence of beacons. This sequence starts with a small interval (25 ms, 75 ms for VxWorks) that doubles each time until it reaches EPICS_CA_BEACON_PERIOD and then stays there, becoming the heartbeat. It takes about 10 beacons and 40 s to get to steady state with the defaults. Note that UDP packets are one-way and unreliable. There is no guarantee that they will be received.
The beacons actually get sent to CaRepeater, and CaRepeater sends them to the clients. This is to overcome the problem in early implementations of UDP that all recipients on a given host (i.e. workstation) were not guaranteed to receive all the broadcasts.
A beacon anomaly is any change from the normal beacon interval. This could be no beacons (indicating a disconnect), the startup sequence (indicating a server has come online), or an abnormal interval (perhaps owing to an unhealthy server or to network problems).
Clients monitor the beacons to determine their connection status. If a Channel Access client has not received any beacons for EPICS_CA_CONN_TMO (30 sec. by default) from a server and has not received any other communication from the server for that time, then it sends an "Are you there?" message over its TCP connection to the server. If there is no reply for 5 sec, then the client decides that it is disconnected. MEDM screens turn white at that time. Before EPICS Base 3.14.5 the clients then reissued searches for any unresolved process variables. Starting with Base 3.14.5, that is not done. A client can also decide it is disconnected if it receives a hangup message over the TCP connection and in that case does reissue searches.
Clients also monitor the beacons to see if a server has come online. If they detect the startup sequence, then they reissue searches for any unresolved process variables. Thus, beacon anomalies are associated with searches, which can impact the servers and possibly the network.
When debugging Channel Access network problems, it may be useful to keep track of beacon anomalies. There is a routine, CASW (standing for Channel Access Server Watcher) that comes with EPICS Base and which prints a line with the server name and a timestamp whenever it detects a beacon anomaly. The anomaly logic used should be the same as that used by the client for that version of EPICS Base. This is useful information, but it can be hard to interpret by just looking at the output lines. The output from multiple servers is mixed together and the intervals, which are important, are hard to easily determine from the timestamps.
ParseCASW was written to help interpret the CASW data. It can be used to interpret a file of lines output by CASW, or CASW can be piped into ParseCASW so that the CASW lines are interpreted in real time as they occur.
Beacon anomalies come in groups. ParseCASW defines a group to have ended when there are no more anomalies from that server for 60 sec. (This default interval can be changed via a command line option.) ParseCASW puts these groups into categories as follows:
Single Anomaly | Only one anomaly in the group. Could be owing to a delay in the server or a delay caused by network traffic. |
Short Sequence | 5 or fewer anomalies. |
Regular Beacons | Network coming back. The sequence is not short and the difference between the maximum and minimum intervals is < 0.25 sec. Note that the server is sending out regular beacons in this case, but the average beacon rate estimate is initially very slow because none have been received recently. It typically takes on the order of ten beacons for the average rate to become the same as the received rate. |
Server Coming Up | A sequence that is monotonic and not short with the ratio of the maximum interval to the minimum interval being > 25. |
Probably Server Coming Up | Same as a server coming up except that 2 intervals are allowed to be non-increasing. Owing to network delays the short intervals, especially, may not be received at the same intervals at which they were sent. |
Medium Long Sequence | None of the above and 15 or fewer anomalies. This case may be caused by a server coming up or may not. |
Long Sequence | None of the above and 50 or fewer anomalies. |
Very Long Sequence | More than 50 anomalies. |
Anomalies Out of Order | The time stamps of the anomalies are out of order. (Should not happen unless there is an error in the data.) |
Note that since ParseCASW has to wait until 60 sec. after the last anomaly in a sequence to determine its category, the output when CASW is piped into ParseCASW happens some time after the sequence started. In other words, if you start up a server, do not expect any output to happen until 60 sec. after the average interval gets close to the beacon period.
The image below shows the output from CASW and the output from ParseCASW for the same data as saved and viewed in an editor (Emacs).
(Click here for full-size image)
First note that the CASW output file is much larger, as indicated by the size of the handles on the scroll bars to the right of the windows. The red line shows how the data from iocid04b gets converted to a line in ParseCASW. This is a medium long sequence. Note that there are three medium long sequences from this same IOC in the lower window. This is most likely an unhealthy IOC. The green line shows how the data from ioclid5 gets converted. The line shown in the upper window is part of a short sequence, the rest of which is not visible. This IOC also has three other medium or short sequences and is also most likely unhealthy. (Healthy servers do not have beacon anomalies unless the network is causing delays.) The blue line shows how the lines for iocid1 are interpreted as a server coming up.
The output in the lower window is the default output. The terse output for these three sequences would be:
The verbose output would be:
For information on obtaining ParseCASW, consult the EPICS Documentation. There is another diagnostic, CaSnooper, that monitors search requests instead of beacons. It can be obtained in the same place. It is often useful to use these two routines together, along with CASW.
ParseCASW is started by typing the following on a command line:
parsecasw [options] [filename]
or to pipe CASW into ParseCASW:
casw | parsecasw [options]
The options are:
-help | Help. |
-echo | Echo the input lines as they are read. |
-int <integer> | Do checking and output every <integer> seconds. |
-oag | Use OAG data format. This is the format for CASW archival data at the Advanced Photon Source. |
-server | Sort output by server. The default is by group. |
-terse | Terse output. The default is between terse and verbose. |
-verbose | Verbose output. |
The first character of the options is sufficient. (That is, -h is the same as -help.)
A typical command line is:
% parsecasw -v casw.txt > parsecasw.txt
ParseCASW writes its output to standard output (stdout). You have a choice of terse, default, or verbose output. The following is an example of CASW output (part of the file shown in the editor above) and how it is converted with each of these three options:
iocid03:5064 2004-05-18 14:40:53.126354159 iocid03:5064 2004-05-18 14:41:49.537067355 iocid03:5064 2004-05-18 14:42:04.537256674 iocid03:5064 2004-05-18 14:42:19.538077826 iocid03:5064 2004-05-18 14:42:34.536262728 iocid03:5064 2004-05-18 14:42:49.535585714 iocid03:5064 2004-05-18 14:45:36.613654888 iocid03:5064 2004-05-18 14:45:51.927365873 iocid03:5064 2004-05-18 14:46:07.344241109 iocid03:5064 2004-05-18 14:46:22.328719845 iocid03:5064 2004-05-18 14:46:37.328110497 ioclid5:5064 2004-05-18 15:00:50.452589267 ioclid5:5064 2004-05-18 15:01:05.453590670 ioclid5:5064 2004-05-18 15:01:20.455975488 ioclid5:5064 2004-05-18 15:04:17.460584986 ioclid5:5064 2004-05-18 15:04:35.429871969 ioclid5:5064 2004-05-18 15:04:53.013039452 ioclid5:5064 2004-05-18 15:05:08.042793438 ioclid5:5064 2004-05-18 15:05:42.011020322 ioclid5:5064 2004-05-18 15:06:35.505764438 ioclid5:5064 2004-05-18 15:06:50.512081340 iocid04b:5064 2004-05-18 15:07:02.382328579 iocid04b:5064 2004-05-18 15:07:02.699825995 iocid04b:5064 2004-05-18 15:07:02.820376078 iocid04b:5064 2004-05-18 15:07:02.949850578 iocid04b:5064 2004-05-18 15:07:03.227257161 iocid04b:5064 2004-05-18 15:07:03.773111244 iocid04b:5064 2004-05-18 15:07:04.860120243 iocid04b:5064 2004-05-18 15:07:06.983292074 iocid04b:5064 2004-05-18 15:07:11.248919487 ioclid5:5064 2004-05-18 15:08:02.752896604 iocid01:5064 2004-05-18 15:08:06.104330352 iocid01:5064 2004-05-18 15:08:06.106132852 iocid01:5064 2004-05-18 15:08:06.138858851 iocid01:5064 2004-05-18 15:08:06.208655601 iocid01:5064 2004-05-18 15:08:06.338916851 iocid01:5064 2004-05-18 15:08:06.605788351 iocid01:5064 2004-05-18 15:08:07.145271684 iocid01:5064 2004-05-18 15:08:08.219170849 iocid01:5064 2004-05-18 15:08:10.339235597 iocid01:5064 2004-05-18 15:08:14.606061676 ioclid5:5064 2004-05-18 15:08:50.501317309
iocid03:5064 May 18 14:40:53 Medium long sequence iocid03:5064 May 18 14:45:36 Short sequence ioclid5:5064 May 18 15:00:50 Short sequence ioclid5:5064 May 18 15:04:17 Medium long sequence iocid04b:5064 May 18 15:07:02 Probably server coming up ioclid5:5064 May 18 15:08:02 Short sequence iocid01:5064 May 18 15:08:06 Server coming up
iocid03:5064 Medium long sequence May 18 14:40:53 6 event(s) for 116.41 sec = 1.94 min = 0.03 hours iocid03:5064 Short sequence May 18 14:45:36 5 event(s) for 60.71 sec = 1.01 min = 0.02 hours ioclid5:5064 Short sequence May 18 15:00:50 3 event(s) for 30.00 sec = 0.50 min = 0.01 hours ioclid5:5064 Medium long sequence May 18 15:04:17 7 event(s) for 153.05 sec = 2.55 min = 0.04 hours iocid04b:5064 Probably server coming up May 18 15:07:02 9 event(s) for 8.87 sec = 0.15 min = 0.00 hours ioclid5:5064 Short sequence May 18 15:08:02 2 event(s) for 47.75 sec = 0.80 min = 0.01 hours iocid01:5064 Server coming up May 18 15:08:06 10 event(s) for 8.50 sec = 0.14 min = 0.00 hours
iocid03:5064 Medium long sequence 6 event(s) May 18 14:40:53 to May 18 14:42:49 (116.41 sec = 1.94 min = 0.03 hours) Mean=23.28 Sigma=16.56 Min=15.00 Max=56.41 Increasing=3 iocid03:5064 Short sequence 5 event(s) May 18 14:45:36 to May 18 14:46:37 (60.71 sec = 1.01 min = 0.02 hours) Mean=15.18 Sigma=0.19 Min=14.98 Max=15.42 Increasing=3 ioclid5:5064 Short sequence 3 event(s) May 18 15:00:50 to May 18 15:01:20 (30.00 sec = 0.50 min = 0.01 hours) Mean=15.00 Sigma=0.00 Min=15.00 Max=15.00 Increasing=2 Monotonically increasing ioclid5:5064 Medium long sequence 7 event(s) May 18 15:04:17 to May 18 15:06:50 (153.05 sec = 2.55 min = 0.04 hours) Mean=25.51 Sigma=14.11 Min=15.01 Max=53.49 Increasing=3 iocid04b:5064 Probably server coming up 9 event(s) May 18 15:07:02 to May 18 15:07:11 (8.87 sec = 0.15 min = 0.00 hours) Mean=1.11 Sigma=1.35 Min=0.12 Max=4.27 Increasing=7 ioclid5:5064 Short sequence 2 event(s) May 18 15:08:02 to May 18 15:08:50 (47.75 sec = 0.80 min = 0.01 hours) Mean=47.75 Sigma=0.00 Min=47.75 Max=47.75 Increasing=1 Monotonically increasing iocid01:5064 Server coming up 10 event(s) May 18 15:08:06 to May 18 15:08:14 (8.50 sec = 0.14 min = 0.00 hours) Mean=0.94 Sigma=1.34 Min=0.00 Max=4.27 Increasing=9 Monotonically increasing
Channel Access is the part of EPICS that governs communication between servers and clients. You can find out more information by looking at the EPICS Channel Access Reference Manual. There is a version included with each EPICS release. They can be found by starting at IOC Software in the EPICS Documentation and following links to the release and point release of the desired EPICS base.
Jeff Hill of Los Alamos National Laboratory, the person responsible for Channel Access, was of great help in understanding the operation of Channel Access.
ParseCASW is governed by the EPICS Open License.