EPICS Controls Argonne National Laboratory

Experimental Physics and
Industrial Control System

1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024  Index 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
<== Date ==> <== Thread ==>

Subject: RE: Using all the cores available on modern processors
From: Mark Rivers <[email protected]>
To: "'[email protected]'" <[email protected]>, "[email protected]" <[email protected]>
Date: Fri, 21 Jun 2013 21:31:11 +0000
Hi Nick,

I've now studied the effect of enabling file writing plugins.

I ran the simDetector at 2048x2048 frames at 50 frames/s (Linux) and 65 frames/s (Windows).  

I then studied what happened when the following file plugins were enabled:

TIFF - This plugin always writes single frame/file
netCDF: Tested both single frame/file and streaming 2000 frames to a single file.
HDF5: Tested both single frame/file and streaming 2000 frames to a single file.

The details of the tests are reported in an updated version of the document I pointed to earlier:

https://subversion.xor.aps.anl.gov/synApps/areaDetector/trunk/documentation/PluginPerformance.pdf

Here are the major conclusions:

Conclusion 1: The file writing plugins do not slow down the simDetector thread if they are running in Stream mode, where there is a single file creation for 2000 frames. This is true on Linux and Windows.

Conclusion 2: There appears to be a lock problem when files are created, so that if individual files are being written it slows down the simDetector thread. The slowdown is only about 50% on Linux, but is a factor of 5-6 on Windows. Because the simDetector thread is running more slowly the other plugins do not drop as many frames. This problem needs to be investigated and fixed.

Conclusion 3: The throughput of netCDF and HDF5 file writers in Stream mode were as follows:
netCDF, Windows: (2000-1041)/31.3*4MB = 122 MB/s
HDF5, Windows: : (2000-884)/31.4*4MB = 142 MB/s
netCDF, Linux: (2000-238)/40*4MB = 176 MB/s
HDF5, Linux: : (2000-208)/41*4MB = 175 MB/s

The Linux machine is a server with fast disks, the Windows machine is a laptop with relatively
slow disk.

On Windows writing individual TIFF, netCDF, or HDF5 files reduced the simDetector from 64 frames/s to 11-12 frames/s.

I have carefully documented how these tests were all done, so you should be able to try to reproduce them.

Mark


-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Friday, June 21, 2013 4:04 PM
To: Mark Rivers; [email protected]
Subject: RE: Using all the cores available on modern processors

Hi Mark,

That sounds interesting. If I understand correctly, the Windows system was most interesting because it had 8 threads and 11 processes and wasn't getting into the same problem that Matt and I had seen. Maybe it is the file writing that is killing it because it probably is yielding the CPU and not getting it back when it needs. However, your results are different to ours so we will have to look at this in detail. I think all our problems have been seen when we had real detectors (and/or were listening on sockets) and were writing data to a file.

Cheers,

Nick Rees
Principal Software Engineer           Phone: +44 (0)1235-778430
Diamond Light Source                  Fax:   +44 (0)1235-446713

-----Original Message-----
From: Mark Rivers [mailto:[email protected]] 
Sent: 21 June 2013 18:34
To: Rees, Nick (DLSLtd,RAL,DIA); [email protected]
Subject: RE: Using all the cores available on modern processors

Hi Nick,

I just did a quick study of the performance on a multi-core Linux and Windows system using the simDetector and 11 plugins all running in their own threads.  The simDetector was producing 1024x1024 8-bit images.  The details are in the attached PDF file.

Here are the main conclusions on Linux system with dual quad-core CPUS (Intel(R) Xeon(R) CPU, E5630 @ 2.53GHz). The system thus has 8 physical cores, but has hyper-threading enabled, and so has 16 virtual cores.

As the frame rate increased from 20 frames/s to 440 frames/s the percentage CPU increased from 300% to 1050%. Plugins began saturating their threads and dropping frames at different frame rates depending on the computation required by that plugin.

A simple view of the system is that there are 11 plugins, each running in its own thread, plus the simDetector thread that computes the images. When the system is running at its maximum possible rate there should thus be 12 cores running at 100% CPU, or 1200% CPU time in "top". In fact we reached 1050% CPU, or 10.5 cores, and at the point the ROI threads were not saturated since they are dropping only a few frames.

Conclusions: the areaDetector plugin architecture and the Linux scheduler are not getting in the way of nearly ideal scaling as the frame rate increases.

I then repeated the tests on a Windows 7 64-bit computer system with dual quad-core CPUS (Intel(R) Core(TM) i7-2820QM CPU@ 2.30GHz). The system thus has 8 physical cores, and does not have hyper-threading, so has 8 cores total.

On Windows I increased the frame size to 2048 x 2048 because the minimum time for epicsThreadSleep() did not permit as high a frame rate as with the simDetector as on Linux.

As the frame rate increased from 20 frames/s to 150 frames/s the percent CPU (using same scale as Linux) increased from 240% to 800% (fully saturated). Plugins began saturating their threads and dropping frames at different frame rates depending on the computation required by that plugin.

Under the maximum frame rate conditions the system is 100% CPU busy. But the simDetector thread was not being held up by the plugins. As on Linux the processing and some statistics plugins are dropping over 90% of the frames, but other plugins and the simDetector main thread are not being held back by these plugins.

Conclusion: the areaDetector plugin architecture and the Windows scheduler are not getting in the way of nearly ideal scaling as the frame rate increases.

This study was performed without any file-writing plugins active.

I will now extend the test to include file-writing plugins.

Mark


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of [email protected]
Sent: Thursday, June 20, 2013 5:55 AM
To: [email protected]
Subject: Using all the cores available on modern processors

This is probably aimed at Mark Rivers (sorry Mark!) but I am hoping other people may have seen the same symptoms. As I mentioned at the last EPICS meeting, on a number of development projects over the past year (mostly with area detector systems, but with some others) we have run into what superficially looks like a common problem.

When we configure a system simply (e.g. an areaDetector configuration which just reads out data and writes it to disk and does nothing else apart from scalar status callbacks) then the main CPU load is on a single core and it can use up to 90% of that core quite happily. 

When we make things more complicated and add processing plugins, for example, the data throughput to disk drops off dramatically and frames start buffering up (and some may be dropped). The CPU load is distributed across all cores, but they are only loaded at the 20-30% level, so the system seems largely idle with plenty of processing power available, but it can't be utilised.

Typically we have seen these problems on dual socket NUMA architecture Intel systems, with a non pre-emptively scheduled Windows or Linux OS.

In theorising about this we have conjectured a number of scenarios (in some sort of priority order):

1. AreaDetector has a locking issue and processing is held up by tasks holding locks that they shouldn't have.
2. Because this has happened on non-preemptively scheduled systems, the high priority tasks aren't getting CPU for some reason (despite there being CPU available). This may be related to (1) so maybe the system isn't overcoming priority inversions properly.
3. There is some other resource bottleneck - such as a bottleneck in the QPI because of the NUMA architecture.
4. There is a problem with top or the Windows performance monitor which doesn't account for overheads properly in a busy multi-core system.

So, at this stage we have not looked at this problem in any detail, we have just noticed the symptoms. Has anyone else come across it and have any pointers before we invest a serious amount of effort looking at it. I suspect that if we are going to understand this it may take some time, and may have some design implications for EPICS on modern architectures.

Cheers,

Nick Rees
Principal Software Engineer           Phone: +44 (0)1235-778430
Diamond Light Source                  Fax:   +44 (0)1235-446713


-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 





-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 





References:
Using all the cores available on modern processors nick.rees
RE: Using all the cores available on modern processors nick.rees

Navigate by Date:
Prev: RE: Using all the cores available on modern processors nick.rees
Next: Re: Problem in errlogRemoveListener Andrew Johnson
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
Navigate by Thread:
Prev: RE: Using all the cores available on modern processors nick.rees
Next: Re: Using all the cores available on modern processors Pearson, Matthew R.
Index: 1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  <20132014  2015  2016  2017  2018  2019  2020  2021  2022  2023  2024 
ANJ, 20 Apr 2015 Valid HTML 4.01! · Home · News · About · Base · Modules · Extensions · Distributions · Download ·
· Search · EPICS V4 · IRMIS · Talk · Bugs · Documents · Links · Licensing ·