|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0002784||OpenFOAM||[All Projects] Bug||public||2017-12-05 13:03||2017-12-08 15:57|
|Platform||GNU/Linux||OS||Cray Linux Environment (CLE)||OS Version||5|
|Target Version||Fixed in Version|
|Summary||0002784: Collated parallel I/O performance significantly worse than uncollated|
|Description||After some recent testing it has been found that using the collated file format on the UK national supercomputing service "Archer" appears to be approximately 100% slower than using the traditional uncollated format.|
This has been tested using the simpleFoam solver and the motorBike tutorial case tweaked to increase the problem size so decomposition to 1032 ranks was viable. Compilation was performed using the default Cray GNU toolchain available on Archer, which includes Cray MPI (MPICH based). All collated runs were made with threading enabled in the MPI library. Tests were performed at 528 and 1032 MPI ranks with a 1 to 1 rank to CPU core mapping.
It was expected that performance of the collated format might not always be faster with variables such as Lustre file system performance and problem-specific factors to take into account, but to see performance diverging negatively for the collated format as the number of ranks was increased was unexpected behaviour.
|Steps To Reproduce||1) Compile the latest OpenFOAM 5.0 package on Archer (or equivalent machine).|
2) Create 4 instances of the simpleFoam motorBike tutorial case.
3) Modify the blockMeshDict file to read as:
hex (0 1 2 3 4 5 6 7) (60 24 24) simpleGrading (1 1 1)
4) Modify the snappyHexMesh file so the following are set: "maxLocalCells 200000;" and "maxGlobalCells 3000000;"
5) Modify the controlDict file so endTime = 100, deltaT = 0.25 and writeInterval = 5.
6) Modify decomposeParDict so the method is "scotch"
7) Post-process each of the 4 cases so 2 use the uncollated file format with 528 ranks and 1032 ranks respectively and the other 2 the same number of ranks but using the collated file format.
8) Submit each job in the usual manner but ensure the following system variables are set in the job script before anything is executed:
export FOAM_FILEHANDLER=uncollated/collated (delete as appropriate)
|Additional Information||As an example of timings, using 1032 ranks with the uncollated file format at Time = 50 (200 iterations): ExecutionTime = 184.82 s ClockTime = 258 s and using the collated format at exactly the same point: ExecutionTime = 447.54 s ClockTime = 456 s. (obtained based on two runs each).|
Timings for the case with 528 ranks are as expected though not quite linear, whatever causes the slowdown with the collated format appears to get worse as the number of ranks increases.
Correction: whilst probably not important, the blockMeshDict file was inputted incorrectly into the original bug report, the actual values used were:
hex (0 1 2 3 4 5 6 7) (80 32 32) simpleGrading (1 1 1)
Additional note: Performance of the collated cases does not appear altered when the Lustre stripe size is modified. This is to be expected as all data is written by the master process (i.e. this is not a true parallel I/O implementation).
It can therefore be suggested that this particular performance degradation is down to the fact that a single node on the Archer (and most current HPC) systems can sustain around 50MB/S, when using the uncollated file format this load is spread over the 40+ nodes of the 1032 MPI rank case whereas the collated case relies on a single node to write all data.
I will leave this open as a bug that requires a tweak as currently the collated implementation, while reducing the number of files significantly (and therefore making large parallel jobs on HPC systems feasible) the fact that the master process does all I/O means scalability is very low and in fact significantly worse than the old uncollated method.
The suggestion would be that the collated system needs to be further improved to use per-rank parallel I/O rather than relying on the master node.
I don't know details of parallel I/O but only have an overview of it. Nevertheless, a couple of questions spring to mind:
1) The parallel I/O is threaded, so if the memory buffer size of the thread is large enough, the master process writes simultaneously while the solver continues to run in parallel. So the write time should not affect the overall simulation time. So why does the write time matter?
2) The written file is a concatenation of the data on each processor, from processor 0, 1, etc. If you wrote using all processes, how would you speed up the overall write time if the data from each processor is written sequentially?
If there is a plan to improve this, who will fund the work? Bear in mind HPC users would be the beneficiaries of the work and the OpenFOAM Foundation has never received funding or code contributions from operators or manufacturers of HPC systems. More information about funding here: https://openfoam.org/funding/
1) Is the slow-down due to startup or when writing? The implementation in OpenFOAM 5 is not very good for reading collated format and the number of file operations still scales with the number of processor. We're working on this. (the number of file operations for writing and file checking should be independent of number of processors)
2) The threading should make sure that the writing is done whilst the simulation is still running. Could you run with the debug switch for OFstreamCollator set to 1 and e.g. do some timestamping (mpirun --timestamp-output)
It should give you feedback on starting/exiting the write thread. If you don't have enough memory allocated for the thread buffer it will ultimately block until there is time.
3) If your ultimate file size is < 2Gb the current dev line will do the communication using a non-blocking gather, in the simulation thread and only do the writing in the thread.
4) Does your actual write speed reach the 50Mb/s? I.e. is this the limiting factor? In which case you can forget about 1-3 and the only way out is indeed using running with the original, non-collated format with distributed files (= per-rank IO) or an inbetween form, e.g. per-node or per-rack IO.
5) What kind of sustained throughput does a modern SSD have?
Thank you both for your comments. I will do a bit more research over the coming weeks when I have some free time and see if I can pinpoint the reason for the slowdown.
The test case is writing less than 500mb per time-step but it is doing it every 5th iteration. This was deliberately designed as a stress test.
What I didn't expect however was that, given Archer has a parallel file system, Lustre, in theory multiple processes writing to a single file should be faster (in fact it has been found that the number of ranks writing per compute node can be balanced to fit each system), in theory the uncollated file format of OpenFOAM, apart from creating huge numbers of files with only moderate numbers of MPI ranks (something that parallel file systems frown upon) should also be hugely taxing for its parallel structure. Even if the collated format is using an MPI collective to the master process, which then writes to a single file asynchronously using threading, this should be faster intuitively.
Ill properly instrument the code and get some runs with performance data collected to see if that helps pinpoint things down.
|2017-12-05 13:03||StephenL||New Issue|
|2017-12-05 13:03||StephenL||Tag Attached: Parallel I/O|
|2017-12-05 13:08||StephenL||Note Added: 0009131|
|2017-12-05 14:40||StephenL||Note Added: 0009132|
|2017-12-06 11:30||chris||Note Added: 0009135|
|2017-12-07 19:48||MattijsJ||Note Added: 0009138|
|2017-12-08 15:57||StephenL||Note Added: 0009141|