0001330: Scalability issue: Exchange of message sizes - OpenFOAM Issue Tracking

ID	Project	Category	View Status	Date Submitted	Last Update

0001330	OpenFOAM	Bug	public	2014-06-22 16:50	2016-04-07 17:44

Reporter	~~user826~~	Assigned To	henry
Priority	normal	Severity	minor	Reproducibility	have not tried
Status	resolved	Resolution	fixed
Fixed in Version	dev

Summary	0001330: Scalability issue: Exchange of message sizes
Description	In the framework the EU project PRACE (http://www.prace-ri.eu/) we have been working with OpenFOAM 2.2.x for carrying out FSI simulations for aircraft designs. In the course of our work we identified a scalability issue due to the exchange of buffer sizes, which is done in Pstream::exchange() (more precicely in combineReduce()). Our findings are described in detail in Section 4.4 of the following Whitepaper: http://www.prace-ri.eu/IMG/pdf/wp172.pdf Note that we implemented a workaround, using directly MPI_Alltoall() instead of combineReduce() for buffer size communication, resulting in a significant scalability improvement. Attached is the unified diff for this workaround, (based on OpenFoam 2.2.x, commit 6c399a448855e18d335202434b4304203de65).
Additional Information	Note that the workaround applies to just one of the overloads of the method PstreamBuffers::finishedSends() (the one without the sizes output parameter). It has not been investigated in which cases the second overload, which replicates the complete matrix of message sizes on all MPI ranks, is really needed and how far it can be optimized by a similar approach, using appropriate collective MPI routines directly.
Tags	No tags attached.

~~user826~~ 2014-06-22 16:50	changes.diff (7,320 bytes) diff --git a/etc/bashrc b/etc/bashrc index 0f0ff50..9ce7bc3 100644 --- a/etc/bashrc +++ b/etc/bashrc @@ -63,7 +63,7 @@ foamCompiler=system #- Compiler: # WM_COMPILER = Gcc \| Gcc45 \| Gcc46 \| Gcc47 \| Clang \| Icc (Intel icc) -export WM_COMPILER=Gcc +export WM_COMPILER=Icc unset WM_COMPILER_ARCH WM_COMPILER_LIB_ARCH #- Architecture: @@ -76,12 +76,12 @@ export WM_PRECISION_OPTION=DP #- Optimised, debug, profiling: # WM_COMPILE_OPTION = Opt \| Debug \| Prof -export WM_COMPILE_OPTION=Opt +export WM_COMPILE_OPTION=Debug #- MPI implementation: # WM_MPLIB = SYSTEMOPENMPI \| OPENMPI \| MPICH \| MPICH-GM \| HPMPI # \| GAMMA \| MPI \| QSMPI \| SGIMPI -export WM_MPLIB=OPENMPI +export WM_MPLIB=SGIMPI #- Operating System: # WM_OSTYPE = POSIX \| ??? diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H b/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H index c26e0d7..3d2f3d9 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H +++ b/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H @@ -294,6 +294,14 @@ public: const bool block = true ); + template<class Container, class T> + static void exchange + ( + const List<Container >&, + List<Container >&, + const int tag = UPstream::msgType(), + const bool block = true + ); }; diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C b/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C index 24eec63..35b4170 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C +++ b/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C @@ -88,7 +88,6 @@ void Foam::PstreamBuffers::finishedSends(const bool block) ( sendBuf_, recvBuf_, - sizes, tag_, block ); diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H b/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H index 1ad4aa9..d8d46bc 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H +++ b/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H @@ -376,7 +376,7 @@ public: //- Abort program static void abort(); - + static void alltoall(int* sendCounts, int* recvCounts); }; diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C b/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C index 8c0cb4b..62a9419 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C +++ b/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C @@ -38,7 +38,6 @@ namespace Foam // * * * * * * * * * * * * * * * Member Functions * * * * * * * * * * * * * // -//template<template<class> class ListType, class T> template<class Container, class T> void Pstream::exchange ( @@ -149,6 +148,128 @@ void Pstream::exchange recvBufs[Pstream::myProcNo()] = sendBufs[Pstream::myProcNo()]; } +// My version without sizes argument and alltoall +//template<template<class> class ListType, class T> +template<class Container, class T> +void Pstream::exchange +( + const List<Container>& sendBufs, + List<Container>& recvBufs, + const int tag, + const bool block +) +{ + if (!contiguous<T>()) + { + FatalErrorIn + ( + "Pstream::exchange(..)" + ) << "Continuous data only." << Foam::abort(FatalError); + } + + if (sendBufs.size() != UPstream::nProcs()) + { + FatalErrorIn + ( + "Pstream::exchange(..)" + ) << "Size of list:" << sendBufs.size() + << " does not equal the number of processors:" + << UPstream::nProcs() + << Foam::abort(FatalError); + } + + /sizes.setSize(UPstream::nProcs()); + labelList& nsTransPs = sizes[UPstream::myProcNo()]; + nsTransPs.setSize(UPstream::nProcs());/ + int nProcs = UPstream::nProcs(); + int sendCounts = new int[nProcs]; + int recvCounts = new int[nProcs]; + + forAll(sendBufs, procI) + { + //nsTransPs[procI] = sendBufs[procI].size(); + sendCounts[procI] = sendBufs[procI].size(); + } + + // Send sizes across. Note: blocks. + //combineReduce(sizes, UPstream::listEq(), tag); + UPstream::alltoall(sendCounts, recvCounts); + + if (Pstream::parRun()) + { + label startOfRequests = Pstream::nRequests(); + + // Set up receives + // ~~~~~~~~~~~~~~~ + + recvBufs.setSize(sendBufs.size()); + + //forAll(sizes, procI) + for(int procI = 0; procI < nProcs; procI++) + { + //label nRecv = sizes[procI][UPstream::myProcNo()]; + label nRecv = recvCounts[procI]; + + if (procI != Pstream::myProcNo() && nRecv > 0) + { + recvBufs[procI].setSize(nRecv); + UIPstream::read + ( + UPstream::nonBlocking, + procI, + reinterpret_cast<char>(recvBufs[procI].begin()), + nRecvsizeof(T), + tag + ); + } + } + + + // Set up sends + // ~~~~~~~~~~~~ + + forAll(sendBufs, procI) + { + if (procI != Pstream::myProcNo() && sendBufs[procI].size() > 0) + { + if + ( + !UOPstream::write + ( + UPstream::nonBlocking, + procI, + reinterpret_cast<const char>(sendBufs[procI].begin()), + sendBufs[procI].size()sizeof(T), + tag + ) + ) + { + FatalErrorIn("Pstream::exchange(..)") + << "Cannot send outgoing message. " + << "to:" << procI << " nBytes:" + << label(sendBufs[procI].size()sizeof(T)) + << Foam::abort(FatalError); + } + } + } + + + // Wait for all to finish + // ~~~~~~~~~~~~~~~~~~~~~~ + + if (block) + { + Pstream::waitRequests(startOfRequests); + } + } + + delete[] sendCounts; + delete[] recvCounts; + + // Do myself + recvBufs[Pstream::myProcNo()] = sendBufs[Pstream::myProcNo()]; +} + // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // diff --git a/src/Pstream/dummy/UPstream.C b/src/Pstream/dummy/UPstream.C index 183a86c..5a95ac6 100644 --- a/src/Pstream/dummy/UPstream.C +++ b/src/Pstream/dummy/UPstream.C @@ -104,5 +104,7 @@ bool Foam::UPstream::finishedRequest(const label i) return false; } +void Foam::UPstream::alltoall(int* sendCounts, int* recvCounts) { +} // ************************************************************************* // diff --git a/src/Pstream/mpi/UPstream.C b/src/Pstream/mpi/UPstream.C index bf7af5d..b95d789 100644 --- a/src/Pstream/mpi/UPstream.C +++ b/src/Pstream/mpi/UPstream.C @@ -378,5 +378,9 @@ bool Foam::UPstream::finishedRequest(const label i) return flag != 0; } +void Foam::UPstream::alltoall(int* sendCounts, int* recvCounts) { + MPI_Alltoall(sendCounts, 1, MPI_INT, recvCounts, 1, MPI_INT, MPI_COMM_WORLD); +} + // ************************************************************************* // changes.diff (7,320 bytes)

~~user4~~ 2014-06-24 12:12 ~0003137	Thanks for the detailed analysis. Have you tried any other mpi implementations? With the use of MPI_Alltoall what is the next bottleneck? Mattijs

~~user826~~ 2014-06-24 13:01 ~0003138 Last edited: 2014-06-24 13:34	Dear Mattijs! No, we haven't tried other MPI implementations. We only used Bullxmpi 1.1.16 (a derivate of OpenMPI). Considering our specific testcase, the simple beam example with uniform motion diffusion (which corresponds in the above mentioned Whitepaper to Figure 5 on the right), when using MPI_Alltoall within one of the two overloads for PstreamBuffers::finishedSends(), 40% of the scaling overhead from 2048 to 4096 processes (i.e. the difference between actual runtime and theoretical runtime on optimal scaling) is caused by the second (unmodified) overload of PstreamBuffers::finishedSends() (and more precisely again by combineReduce()). I could also send you the corresponding HPCToolkit profiles (~7MB) via Email in case you are interested. Thomas

~~user4~~ 2014-06-24 14:14 ~0003139	Hi Thomas, that might be interesting. I'm m.janssens at the company email.

MattijsJ 2016-04-07 16:48 reporter ~0006098	Resolved in commit 56668b24066c32665026b0613109f8627a7c7789 Many thanks for the analysis. Please re-open if you come across other bottlenecks.

Date Modified	Username	Field	Change
2014-06-22 16:50	~~user826~~	New Issue
2014-06-22 16:50	~~user826~~	File Added: changes.diff
2014-06-24 12:12	~~user4~~	Note Added: 0003137
2014-06-24 13:01	~~user826~~	Note Added: 0003138
2014-06-24 13:34	~~user826~~	Note Edited: 0003138
2014-06-24 14:14	~~user4~~	Note Added: 0003139
2016-04-07 16:48	MattijsJ	Note Added: 0006098
2016-04-07 17:44	henry	Status	new => resolved
2016-04-07 17:44	henry	Fixed in Version	=> dev
2016-04-07 17:44	henry	Resolution	open => fixed
2016-04-07 17:44	henry	Assigned To	=> henry