Experimental Physics and
Industrial Control System

<[email protected]> · Thu, 29 Jun 2017 12:10:57 +0000

Hi Mark,

Apologies, I'll just use this opportunity for a bit of shameless promotion: Using HDF5 with high performance detectors is a very wide and complex topic that can be discussed in great lengths. So come along and do just that at the HDF5 workshop at ICALEPCS2017 in sunny Barcelona: http://www.icalepcs2017.org/index.php/program/workshops#HDF5

> I know that I could probably create multiple HDF file plugins - just no idea of what would actually happen if they all look at the same image buffer stream. Would they lock so that each one would get a unique NDArray, or might the same image appear in multiple output files?

There is no point in creating multiple instances of the HDF5 file writer plugin: the HDF5 library implements thread-safety with a crude global lock. So you don't get a performance increase.

> In theory, under Cygwin or Linux, one can build the Open-MPI Libs that PHDF5 requires, but I would suspect that the HDF file Plugin would still not automagically become multithreaded and vastly faster running the same code.

I have done some work a few years ago to use the parallel HDF5 to try and increase performance. MPI scales in processes, not threads. Each process runs a single thread HDF5. You can't build the parallel HDF5 library with MPI into areaDetector as it runs in a single process. If you receive the datastream in an areaDetector driver then you would need to split and fan-out the data stream over some IPC mechanism to the MPI/pHDF5 file writer processes. 

In my experience this does not scale to perform as well as one would expect and I would not advise to go this way. At least not unless you scale to 100s or 1000s of writer nodes. Writing from multiple independent (i.e. not mpi) processes to individual files works much faster. You can then tie these datasets together using the new HDF5 Virtual Dataset (VDS) feature to provide a single, coherent dataset 'view' for reading/processing. VDS is available in from HDF5 version 1.10. This is what we are doing at Diamond now for new fast/parallel detectors.

> "Direct Chunk". Hmmm, don't supposed that This is already in the HDF5 libs and one could tweak some code to make use out of it, would be interested to see a spped test on the nvme hardware.

Yes, Direct Chunk Write is available from HDF5 1.8.11 onwards. Even h5py now have support for that so you can script some benchmarking in python!

Cheers,
Ulrik

-----Original Message-----
From: Mark S. Engbretson [mailto:[email protected]] 
Sent: 28 June 2017 23:05
To: Pedersen, Ulrik (DLSLtd,RAL,TEC)
Cc: [email protected]; [email protected]
Subject: RE: Area Detector and high performance NVME devices

The impressive hardware is probably the Euresys Coaxlink frame grabber with a 2.5 GB/s readout rate and the current NVME disk technology which claims write benchmarks of 8 GB/s (in a raid-0 configuration).  But I have heard people talking about cameras with a 17 GB/s acquires rates and even one in the 100's that are/will be available Real Soon Now.

Yes, everyone wants HDF5 formatted data, but as you pointed out, the single thread HDF5 pipeline appears to choke before it hits wrtite limits on modern hardware. Raw binary file writing on this hardware is actually easy, with similar cavets as Lustre  has (i.e., writes must be a multiple of the nvme sector size, buffers might need to be memory page aligned). But when the hardware runs at twice the write speed of the camera, lots of issues go away.

I know that I could probably create multiple HDF file plugins - just no idea of what would actually happen if they all look at the same image buffer stream. Would they lock so that each one would get a unique NDArray, or might the same image appear in multiple output files? 

In theory, under Cygwin or Linux, one can build the Open-MPI Libs that PHDF5 requires, but I would suspect that the HDF file Plugin would still not automagically become multithreaded and vastly faster running the same code. 

"Direct Chunk". Hmmm, don't supposed that This is already in the HDF5 libs and one could tweak some code to make use out of it, would be interested to see a spped test on the nvme hardware.

Me

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Wednesday, June 28, 2017 3:40 PM
To: [email protected]
Cc: [email protected]; [email protected]
Subject: Re: Area Detector and high performance NVME devices

Hi Marks,

So this is quite an impressive piece of detector! We have some fast detectors here at Diamond but we have not used the areaDetector HDF5 file writer beyond what can fit through a 10gbps Ethernet pipe. Writing faster than that presents a number of challenges as you have noticed.

Writing ‘raw’ binary files will probably always be the most performant options (if you do it right and tune the I/O pattern for the file system). However, you lose a lot of goodness by not having a container like HDF5 around it…

There are a lot of tuning parameters available in the HDF5 library that I assume you have played around with: chunking and flushing parameters, boundary alignment, etc. The first thing you have to figure out about your file system is what size of ‘chunks’ (i.e. individual IO writes) it likes in order to perform best - and does it require write operations to start on specified boundaries? Our Lustre and GPFS file systems like 1MB and 4MB boundaries for example.

The HDF5 library operate a pipeline when doing file I/O: all read/write operations pass the data through this pipeline by default in order to do things like compression or datatype conversions. However, even when you’re not using these features (like when you’re “just” streaming a lot of uint16 pixels to an image dataset) the data pass through the pipeline and that has a certain performance overhead - i.e. a CPU or perhaps even memory I/O bottleneck.

Fortunately there is a way to circumvent this internal HDF5 pipeline using a feature that was developed for Dectris because they wanted to be able to write out HDF5 datasets consisting of pre-compressed images. The single-threaded HDF5 pipeline was too slow for their compression requirements. The Direct Chunk Write [1] feature can be used with or without compressed datasets and because it circumvents the pipeline it basically just does a simple write under the hood.

From my tests (a good while ago now) the Direct Chunk Write from multiple processes performed much better than the parallel HDF5 (which btw is not supported in the areaDetector HDF5 file writer). 

The Direct Chunk Write functionality should be added to the areaDetector HDF5 file writer. I will raise a ticket on github to discuss how best to do that. That should enable nearly the same performance as your ‘raw binary’ write.

If that turns out to not be enough, the next step would be to parallelise the problem by splitting the stream into multiple file writer processes. This is also something we are working on at Diamond for our parallel high-performance detector systems.

Cheers,
Ulrik

[1]: Direct Chunk Write https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/ 

> On 27 Jun 2017, at 21:07, Mark S. Engbretson <[email protected]> wrote:
> 
> Pete suggested something like that,  have the HDF file have a pointer 
> to an each image raw file, but can not sustain the write rate with single files.
> Why I was originally trying to tweak this file writer into something 
> that would acquire all the data, but would then generate reasonable 
> output after the fact.
> 
> 2BM seems to be willing to have a starting point of perhaps one 15 
> minute acquire per hour, in which the  other 45 minutes either 
> processing the data or getting it off the computer to prepare for 
> another. Which may just be enough if the raw data is buffered and HDF 
> write is out at whatever rate that it can keep up at.
> 
> I hadn't thought of this raw plugin abstracting a much smaller image 
> that could be a placeholder in an HDF file. Mostly since someone would 
> still have to post process both of these files after the fact. No file 
> plugins have implemented the read funcstions yet.
> 
> -----Original Message-----
> From: Mark Rivers [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 2:48 PM
> To: Mark S. Engbretson <[email protected]>; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
> 
> Hi Mark,
> 
> I just tried with simDetector on a Windows 7 system with 8 cores, 15K 
> RPM SAS Raid-0 disk, 96 GB RAM.
> 
> 4096 x 3078 Int8 images = 12 MB/image.
> 
> The simDetector is generating about 150 frames/s = 1.8 GB/s.
> 
> This is the output of camonitor on the ArrayRate_RBV and WriteFile_RBV 
> PVs in the HDF5 plugin:
> 
> corvette:simDetectorIOC/iocBoot/iocSimDetector>camonitor -tc 
> 13SIM1:HDF1:ArrayRate_RBV 13SIM1:HDF1:WriteFile_RBV
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:21.809029) 0
> 13SIM1:HDF1:WriteFile_RBV      (2017-06-27 14:15:21.809162) Done
> 13SIM1:HDF1:WriteFile_RBV      (2017-06-27 14:15:31.451841) Writing STATE
> MINOR
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:33.410624) 34
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:34.411782) 54
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:35.410801) 55
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:37.408966) 52
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:38.408085) 54
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:39.407156) 48
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:40.408254) 53
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:41.409449) 55
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:42.410515) 45
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:43.411498) 50
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:44.411601) 47
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:45.410666) 53
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:46.409790) 50
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:47.408919) 54
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:48.407977) 53
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:49.407138) 52
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:50.408088) 60
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:51.409265) 62
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:52.410457) 52
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:53.411547) 59
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:54.411524) 62
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:55.411724) 55
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:56.410814) 60
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:57.409946) 57
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:58.408960) 59
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:15:59.407069) 60
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:00.408095) 42
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:01.409358) 17
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:02.410576) 27
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:03.411677) 25
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:04.413487) 27
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:05.411605) 25
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:06.411612) 24
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:07.409803) 23
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:11.410220) 19
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:12.411320) 18
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:13.412261) 21
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:15.412525) 20
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:16.411553) 22
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:17.410808) 24
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:20.409031) 28
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:21.411129) 25
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:22.411199) 24
> 13SIM1:HDF1:WriteFile_RBV      (2017-06-27 14:16:22.818068) Done
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:23.412388) 1
> 13SIM1:HDF1:ArrayRate_RBV      (2017-06-27 14:16:24.413361) 0
> 
> So for the first 25 seconds or so it is writing at about 55 frames/s =
> 660 MB/s.  This is probably filling the Windows file cache.  It then 
> slows down to about 23 frames/s = 280 MB/s, which is probably the 
> steady state write speed of the disks.
> 
> So I agree that it will probably be difficult to write HDF5 files at 
> the full rate of your camera, which is 190 frame/s = 2.3 GB/s.
> 
> One possible solution would be to write your RAW files that can keep 
> up, and also write HDF5 files of "thumbnail" data that is either 
> cropped or binned to 512x384 for example.  You can then store all the 
> metadata in the HDF file and the images in the RAW file.  Later on you 
> can either merge these 2 files into a large HDF5 file, or just keep them separate.
> 
> Mark
> 
> ________________________________
> From: Mark S. Engbretson [[email protected]]
> Sent: Tuesday, June 27, 2017 1:51 PM
> To: Mark Rivers; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
> 
> I do not have enough memory to create a queue large enough to buffer 
> all the images.  2BM wants to acquire the camera stream for at least
> 15 minutes and ideally as long as possible. So talking about 2-4 TB.  
> Or larger if the buy huge nvme chips or multiple turbo Z units - the 
> computer supports having 3 of them.
> 
> I have allocated very large buffers, but for a 4096 by 3078 image 
> being created at 190 FPS, the file plugins would have to be able to keep up . . .
> and they don't.
> 
> Using the SimDetector, I can create such images at ~260 FPS. HDF is 
> only writing the file out at about 60 FPS, using whatever the defaults settings.
> NetCFD writes about 30 FPS.  A buffer of 4000 in both cases only 
> lasted for about a minute. The raw file plugin slows the simDetector 
> acquire rate to about 170-180 FPS, which the plugin can keep up with.
> The stardardarray plugin by itself also slows simDetecor to about  the same thing.
> 
> 
> From: Mark Rivers [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 1:06 PM
> To: 'Mark S. Engbretson' <[email protected]>; [email protected]
> Subject: RE: Area Detector and high performance NVME devices
> 
> Hi Mark,
> 
> I am surprised that a raw file plugin is significantly faster than 
> netCDF or HDF5.  I would like to see the tests, and figure out what is 
> actually slowing them down, i.e. is it CPU bound, waiting for a 
> semaphore, etc.?  Can you post actual benchmark results for the 
> different plugins, i.e. frames/s and MB/s?
> 
> You should not need to do anything special to create a FIFO to buffer 
> images while the disk is busy.  Every areaDetector plugin comes with 
> such a FIFO, i.e. its input queue.  Just increase the QueueSize to be 
> large enough to buffer all the images you need to store in one 
> "burst".  You can also use the CircularBuffer plugin to do this, but 
> it should really not be necessary, that is intended more for 
> "triggered" applications where the buffer is emptied when a trigger condition is satisfied.
> 
> Mark
> 
> 
> From: Mark S. Engbretson [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 12:24 PM
> To: Mark Rivers; [email protected]<mailto:[email protected]>
> Subject: Area Detector and high performance NVME devices
> 
> Mark -
> 
> I have the adimec camera which generates data at  ~2.5 GB/s. I 
> recently got my hands on a newer HP 840 with a HP Turbo Z nvme drive 
> which claims a sustained write speed of 6 GB/S. None of the existing 
> file plugin see any performance increase when writing to this device - 
> I do not think that any are actually write limited.  I have modified a 
> raw binary file plugin that I obtained from Keenan Lang that easily 
> sustains the cameras write rate until the device is full.
> 
> Problem is - Raw data really doesn't do anyone much good. I was 
> thinking that perhaps a quick solution to my problem might be to 
> change this Raw File plugin to look/act like a disk based fifo or 
> circular buffer. This could collect to the limit of the hardware at 
> full speed, and if someone wanted HDF output, they would just drain 
> this queue at the speed that HDF files are generated.  Or is there an 
> easier/better solution? I.e. any way that file plugins can use the new multi-thread model of AD 3.0?
> 
> I know the HDF files can be generates at very high speeds on Lustre 
> file systems, but this seems to be using parallel HDF5. Is this 
> something that Area Detector supports?
> 

----------------------------------------------
Ulrik Kofoed Pedersen
Head of Beamline Controls
Diamond Light Source Ltd
Phone: +44 1235 77 8580

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System