View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0002784||OpenFOAM||[All Projects] Bug||public||2017-12-05 13:03||2018-07-17 11:27|
|Platform||GNU/Linux||OS||Cray Linux Environment (CLE)||OS Version||5|
|Fixed in Version|
|Summary||0002784: Collated parallel I/O performance significantly worse than uncollated|
|Description||After some recent testing it has been found that using the collated file format on the UK national supercomputing service "Archer" appears to be approximately 100% slower than using the traditional uncollated format.|
This has been tested using the simpleFoam solver and the motorBike tutorial case tweaked to increase the problem size so decomposition to 1032 ranks was viable. Compilation was performed using the default Cray GNU toolchain available on Archer, which includes Cray MPI (MPICH based). All collated runs were made with threading enabled in the MPI library. Tests were performed at 528 and 1032 MPI ranks with a 1 to 1 rank to CPU core mapping.
It was expected that performance of the collated format might not always be faster with variables such as Lustre file system performance and problem-specific factors to take into account, but to see performance diverging negatively for the collated format as the number of ranks was increased was unexpected behaviour.
|Steps To Reproduce||1) Compile the latest OpenFOAM 5.0 package on Archer (or equivalent machine).|
2) Create 4 instances of the simpleFoam motorBike tutorial case.
3) Modify the blockMeshDict file to read as:
hex (0 1 2 3 4 5 6 7) (60 24 24) simpleGrading (1 1 1)
4) Modify the snappyHexMesh file so the following are set: "maxLocalCells 200000;" and "maxGlobalCells 3000000;"
5) Modify the controlDict file so endTime = 100, deltaT = 0.25 and writeInterval = 5.
6) Modify decomposeParDict so the method is "scotch"
7) Post-process each of the 4 cases so 2 use the uncollated file format with 528 ranks and 1032 ranks respectively and the other 2 the same number of ranks but using the collated file format.
8) Submit each job in the usual manner but ensure the following system variables are set in the job script before anything is executed:
export FOAM_FILEHANDLER=uncollated/collated (delete as appropriate)
|Additional Information||As an example of timings, using 1032 ranks with the uncollated file format at Time = 50 (200 iterations): ExecutionTime = 184.82 s ClockTime = 258 s and using the collated format at exactly the same point: ExecutionTime = 447.54 s ClockTime = 456 s. (obtained based on two runs each).|
Timings for the case with 528 ranks are as expected though not quite linear, whatever causes the slowdown with the collated format appears to get worse as the number of ranks increases.
Correction: whilst probably not important, the blockMeshDict file was inputted incorrectly into the original bug report, the actual values used were:
hex (0 1 2 3 4 5 6 7) (80 32 32) simpleGrading (1 1 1)
Additional note: Performance of the collated cases does not appear altered when the Lustre stripe size is modified. This is to be expected as all data is written by the master process (i.e. this is not a true parallel I/O implementation).
It can therefore be suggested that this particular performance degradation is down to the fact that a single node on the Archer (and most current HPC) systems can sustain around 50MB/S, when using the uncollated file format this load is spread over the 40+ nodes of the 1032 MPI rank case whereas the collated case relies on a single node to write all data.
I will leave this open as a bug that requires a tweak as currently the collated implementation, while reducing the number of files significantly (and therefore making large parallel jobs on HPC systems feasible) the fact that the master process does all I/O means scalability is very low and in fact significantly worse than the old uncollated method.
The suggestion would be that the collated system needs to be further improved to use per-rank parallel I/O rather than relying on the master node.
I don't know details of parallel I/O but only have an overview of it. Nevertheless, a couple of questions spring to mind:
1) The parallel I/O is threaded, so if the memory buffer size of the thread is large enough, the master process writes simultaneously while the solver continues to run in parallel. So the write time should not affect the overall simulation time. So why does the write time matter?
2) The written file is a concatenation of the data on each processor, from processor 0, 1, etc. If you wrote using all processes, how would you speed up the overall write time if the data from each processor is written sequentially?
If there is a plan to improve this, who will fund the work? Bear in mind HPC users would be the beneficiaries of the work and the OpenFOAM Foundation has never received funding or code contributions from operators or manufacturers of HPC systems. More information about funding here: https://openfoam.org/funding/
1) Is the slow-down due to startup or when writing? The implementation in OpenFOAM 5 is not very good for reading collated format and the number of file operations still scales with the number of processor. We're working on this. (the number of file operations for writing and file checking should be independent of number of processors)
2) The threading should make sure that the writing is done whilst the simulation is still running. Could you run with the debug switch for OFstreamCollator set to 1 and e.g. do some timestamping (mpirun --timestamp-output)
It should give you feedback on starting/exiting the write thread. If you don't have enough memory allocated for the thread buffer it will ultimately block until there is time.
3) If your ultimate file size is < 2Gb the current dev line will do the communication using a non-blocking gather, in the simulation thread and only do the writing in the thread.
4) Does your actual write speed reach the 50Mb/s? I.e. is this the limiting factor? In which case you can forget about 1-3 and the only way out is indeed using running with the original, non-collated format with distributed files (= per-rank IO) or an inbetween form, e.g. per-node or per-rack IO.
5) What kind of sustained throughput does a modern SSD have?
Thank you both for your comments. I will do a bit more research over the coming weeks when I have some free time and see if I can pinpoint the reason for the slowdown.
The test case is writing less than 500mb per time-step but it is doing it every 5th iteration. This was deliberately designed as a stress test.
What I didn't expect however was that, given Archer has a parallel file system, Lustre, in theory multiple processes writing to a single file should be faster (in fact it has been found that the number of ranks writing per compute node can be balanced to fit each system), in theory the uncollated file format of OpenFOAM, apart from creating huge numbers of files with only moderate numbers of MPI ranks (something that parallel file systems frown upon) should also be hugely taxing for its parallel structure. Even if the collated format is using an MPI collective to the master process, which then writes to a single file asynchronously using threading, this should be faster intuitively.
Ill properly instrument the code and get some runs with performance data collected to see if that helps pinpoint things down.
Just a comment:
I see also a slowdown using Cray MPI on a CRAY cluster with Lustre. The same test case calculated on our local cluster using the collated format and OpenMPI runs much more faster. Although we have beegfs and not Lustre I think the slowdown may occur through the handling of MPI threading in the CRAY MPI version.
According to the man pages of CRAY MPI, the threading implementation is not the fastest one.
On the back of matthias' post, I have done a little more analysis on Archer. Firstly, I tried to compile with OpenMPI but really struggled to get things working (Archer only provides the Cray MPI and Intel MPI libraries as modules), as I have limited time and resources on Archer I therefore concentrated on doing some testing using their recommended tool-chain for OpenFOAM, that's GNU GCC with Cray MPI.
Running identical cases (as described earlier) using 1032 ranks on 1032 cores twice, the average timings were:
Uncollated: Execution time: 296.8s, Clock time: 376s
Collated (with threading disabled using global controlDict option but enabled in MPI library): Execution time: 207.43s, Clock time: 237s
Collated (with threading enabled both in global controlDict and MPI library): Execution time: 756.885s, Clock time: 774.5s
Clearly using the threaded collated format, at least on Archer with the Crap MPI library, has serious performance implications. Saying that, the performance seen using the collated format without threading is impressive.
For reference, each time output equates to around 400mb of data and 71 are output, meaning this case writes out just under 30gb of data in total, whether larger cases would see better performance using the threaded mode is a question yet to be answered, however the above figures would suggest there is a real issue around the threading model when we start to scale things up.
I personally do not have enough time at the moment to look into this further but welcome any feedback or thoughts from the community as getting the collated I/O working on large, practical HPC hardware is paramount to the future of OpenFOAM in a HPC context. I would love to spend time doing some big scaling studies with this code but I simply cannot do that if limited by the classic uncollated format, it simply produces too many files (I have already been told off by Archer for running the uncollated case with 1032 ranks a number of times!)
||It appears that this issue is in the Cray MPI implementation and should be reported to cray. Have you tested OpenFOAM with the Intel MPI implementation available on Archer?|
I would tend to agree with you Henry. When I next get a chance I will re-run using Intel MPI, I started this yesterday but ran into a few issues getting things going with the Intel compilers so will need to pick it up again when I have time.
For the moment, a colleague is going to re-run on our IBM system using OpenMPI and see if he notices anything similar. This is GPFS but I don't envisage that to make much difference compared to Lustre.
It is possibly worth noting for now though that when using OpenFOAM on Archer, the collated format is recommended as Archer is very tight on users when it comes to file numbers but that the threading model should be disabled. As Matthias mentioned the Cray MPI library mentions this itself (though the performance difference is fairly astounding).
I should have also mentioned, I have messed with the file striping options with Lustre and found no difference in performance, though this is to be expected if the actual data writes aren't being done in parallel so as to make use of the Lustre parallel performance.
||You shouldn't need to use the Intel compilers to link to the Intel MPI libraries, they should be gcc and clang compatible.|
This issue relates to limitations of the thread support in the Cray MPI implementation.
In a4de83a425dcf17f975143fe6b9d068139012cb4 we've added the option of having multiple IO ranks. These can either be on the same node or on different nodes. With the collated format this automatically puts its portion of files into separate processors directories.
See e.g. the IO/fileHandler tutorial.
I'd be very interested in seeing this tested (more than me on my two nodes). Feel free to contact me off-forum (gmail.com, mattijs.janssens)
The outcome of issues reported on this platform should be reported openly, not off-forum. Let's summarise as a matter of public record:
- The following commit was made:
- This included native c++11 threading, rather than using thread support in MPI implementations.
- Performance on ARCHER improved significantly, confirming thread support in Cray MPI as the problem on ARCHER.
Out of the £43m spent on ARCHER, £0.00 was invested in paying the people who were willing and able to develop the collated parallel I/O functionality in OpenFOAM and subsequently fix this and other issues.
Future HPC procurements must include funding for maintenance of the software applications they use, in particular open source software. Details of OpenFOAM Maintenance here:
I agree Chris, I have been working with Mattijs offline until there was a signifucant, single response to add rather than make this bug report into a discussion thread.
However, as you state, the latest commit appears to operate nicely on Archer now using the standard collated format, the new multi rank collated io format is still under testing.
As a further note i would just like to publicly thank the Archer HPC resource for providing many AU's of compute time via the STFC resource pool to allow this testing to take place.
||@StephenL How many AUs (allocation units) were used for testing?|
Mattijs and I have been doing some more work on the multi-node I/O he mentions in note 0009439, we can confirm that the threaded, collated I/O format now appears to work very nicely on the Cray XC30 platform with the GCC toolchain and Cray MPI implementation, we haven't tested scalability beyond 43 nodes though it all looks very promising. It is reasonable to assume that similar performance would be seen on other HPC platforms though whether other I/O systems such as GPFS (Archer is Lustre) would make a difference is hard to say without further testing.
At the moment, the multi-node I/O works correctly, however it appears to introduce a level of overhead that results in overall performance being less even for larger cases (tested to cases where the 500mb/s per node of Archer is being stretched), though computational time reduces significantly, overall wall-time increases. Mattijs is currently investigating this as it suggests excessive time may be being spent at an MPI barrier.
I haven't been keeping exact track of kAU use and some of this testing has been rolled into other work I have been doing on other OpenFOAM related projects, however I would hazard a guess at around 500 kAUs (though as mentioned not all dedicated specifically to this task). I only mentioned this as I like to make sure any resource provider is properly acknowledged in any public outlet.
500 kAU = £260
1 node hr = 24 core hr = 0.36 kAU
500 kAU = 33,333 core hr
I am interested in comparing this cost to:
- the total cost of the collated file format for parallel I/O
- the cost of performance improvement of the collated format (this issue)
- cost of future maintenance of collated format
||I'm afraid I'm missing the point here? Are these reports not to try and improve OpenFOAM for it's communities? What does the cost of the compute time have to do with anything?|
Yes, the Issue Tracking System is meant to improve OpenFOAM for its users. The infrastructure of the Issue Tracking System, and the work people do to resolve issues requires funding. The OpenFOAM Foundation promotes the need for funding: https://openfoam.org/funding/
In order to get funding, we have to demonstrate value and this issue could provide a useful case study for that. It is evident here that the cost of compute time is practically insignificant compared to the cost of software development and maintenance. Rather than be complacent by giving ARCHER an acknowledgement for a small amount of compute time, I would encourage ARCHER to become a sponsor of OpenFOAM through a Maintenance Plan:
This would be good for everyone involved and ARCHER users who benefit from the improvements to OpenFOAM.
Hi Chris, I understand now, I'm afraid as I'm just a user of Archer that isn't something i can comment on, however what you say makes sense, probably for other significant open source codes used on the national system as well.
I'll report back once we know more about the multi node format.
||Waiting for feedback from reporter.|
|2017-12-05 13:03||StephenL||New Issue|
|2017-12-05 13:03||StephenL||Tag Attached: Parallel I/O|
|2017-12-05 13:08||StephenL||Note Added: 0009131|
|2017-12-05 14:40||StephenL||Note Added: 0009132|
|2017-12-06 11:30||chris||Note Added: 0009135|
|2017-12-07 19:48||MattijsJ||Note Added: 0009138|
|2017-12-08 15:57||StephenL||Note Added: 0009141|
|2018-01-16 18:40||matthias||Note Added: 0009205|
|2018-01-29 12:28||StephenL||Note Added: 0009231|
|2018-01-29 14:55||henry||Note Added: 0009232|
|2018-01-29 16:16||StephenL||Note Added: 0009233|
|2018-01-29 16:28||henry||Note Added: 0009234|
|2018-02-02 16:25||henry||Assigned To||=> henry|
|2018-02-02 16:25||henry||Status||new => closed|
|2018-02-02 16:25||henry||Resolution||open => no change required|
|2018-02-02 16:25||henry||Note Added: 0009247|
|2018-03-23 16:07||henry||Status||closed => feedback|
|2018-03-23 16:07||henry||Resolution||no change required => reopened|
|2018-03-23 16:12||MattijsJ||Note Added: 0009439|
|2018-04-07 12:13||chris||Note Added: 0009471|
|2018-04-07 12:35||StephenL||Note Added: 0009472|
|2018-04-07 12:35||StephenL||Status||feedback => assigned|
|2018-04-09 07:34||chris||Note Added: 0009473|
|2018-04-10 09:56||StephenL||Note Added: 0009475|
|2018-04-10 18:15||chris||Note Added: 0009479|
|2018-04-10 18:40||StephenL||Note Added: 0009480|
|2018-04-12 18:30||chris||Note Added: 0009489|
|2018-04-12 19:30||StephenL||Note Added: 0009490|
|2018-07-17 11:27||henry||Status||assigned => closed|
|2018-07-17 11:27||henry||Resolution||reopened => suspended|
|2018-07-17 11:27||henry||Note Added: 0009856|