View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001330 | OpenFOAM | Bug | public | 2014-06-22 16:50 | 2016-04-07 17:44 |
Reporter | Assigned To | henry | |||
Priority | normal | Severity | minor | Reproducibility | have not tried |
Status | resolved | Resolution | fixed | ||
Fixed in Version | dev | ||||
Summary | 0001330: Scalability issue: Exchange of message sizes | ||||
Description | In the framework the EU project PRACE (http://www.prace-ri.eu/) we have been working with OpenFOAM 2.2.x for carrying out FSI simulations for aircraft designs. In the course of our work we identified a scalability issue due to the exchange of buffer sizes, which is done in Pstream::exchange() (more precicely in combineReduce()). Our findings are described in detail in Section 4.4 of the following Whitepaper: http://www.prace-ri.eu/IMG/pdf/wp172.pdf Note that we implemented a workaround, using directly MPI_Alltoall() instead of combineReduce() for buffer size communication, resulting in a significant scalability improvement. Attached is the unified diff for this workaround, (based on OpenFoam 2.2.x, commit 6c399a448855e18d335202434b4304203de65). | ||||
Additional Information | Note that the workaround applies to just one of the overloads of the method PstreamBuffers::finishedSends() (the one without the sizes output parameter). It has not been investigated in which cases the second overload, which replicates the complete matrix of message sizes on all MPI ranks, is really needed and how far it can be optimized by a similar approach, using appropriate collective MPI routines directly. | ||||
Tags | No tags attached. | ||||
2014-06-22 16:50
|
changes.diff (7,320 bytes)
diff --git a/etc/bashrc b/etc/bashrc index 0f0ff50..9ce7bc3 100644 --- a/etc/bashrc +++ b/etc/bashrc @@ -63,7 +63,7 @@ foamCompiler=system #- Compiler: # WM_COMPILER = Gcc | Gcc45 | Gcc46 | Gcc47 | Clang | Icc (Intel icc) -export WM_COMPILER=Gcc +export WM_COMPILER=Icc unset WM_COMPILER_ARCH WM_COMPILER_LIB_ARCH #- Architecture: @@ -76,12 +76,12 @@ export WM_PRECISION_OPTION=DP #- Optimised, debug, profiling: # WM_COMPILE_OPTION = Opt | Debug | Prof -export WM_COMPILE_OPTION=Opt +export WM_COMPILE_OPTION=Debug #- MPI implementation: # WM_MPLIB = SYSTEMOPENMPI | OPENMPI | MPICH | MPICH-GM | HPMPI # | GAMMA | MPI | QSMPI | SGIMPI -export WM_MPLIB=OPENMPI +export WM_MPLIB=SGIMPI #- Operating System: # WM_OSTYPE = POSIX | ??? diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H b/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H index c26e0d7..3d2f3d9 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H +++ b/src/OpenFOAM/db/IOstreams/Pstreams/Pstream.H @@ -294,6 +294,14 @@ public: const bool block = true ); + template<class Container, class T> + static void exchange + ( + const List<Container >&, + List<Container >&, + const int tag = UPstream::msgType(), + const bool block = true + ); }; diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C b/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C index 24eec63..35b4170 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C +++ b/src/OpenFOAM/db/IOstreams/Pstreams/PstreamBuffers.C @@ -88,7 +88,6 @@ void Foam::PstreamBuffers::finishedSends(const bool block) ( sendBuf_, recvBuf_, - sizes, tag_, block ); diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H b/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H index 1ad4aa9..d8d46bc 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H +++ b/src/OpenFOAM/db/IOstreams/Pstreams/UPstream.H @@ -376,7 +376,7 @@ public: //- Abort program static void abort(); - + static void alltoall(int* sendCounts, int* recvCounts); }; diff --git a/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C b/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C index 8c0cb4b..62a9419 100644 --- a/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C +++ b/src/OpenFOAM/db/IOstreams/Pstreams/exchange.C @@ -38,7 +38,6 @@ namespace Foam // * * * * * * * * * * * * * * * Member Functions * * * * * * * * * * * * * // -//template<template<class> class ListType, class T> template<class Container, class T> void Pstream::exchange ( @@ -149,6 +148,128 @@ void Pstream::exchange recvBufs[Pstream::myProcNo()] = sendBufs[Pstream::myProcNo()]; } +// My version without sizes argument and alltoall +//template<template<class> class ListType, class T> +template<class Container, class T> +void Pstream::exchange +( + const List<Container>& sendBufs, + List<Container>& recvBufs, + const int tag, + const bool block +) +{ + if (!contiguous<T>()) + { + FatalErrorIn + ( + "Pstream::exchange(..)" + ) << "Continuous data only." << Foam::abort(FatalError); + } + + if (sendBufs.size() != UPstream::nProcs()) + { + FatalErrorIn + ( + "Pstream::exchange(..)" + ) << "Size of list:" << sendBufs.size() + << " does not equal the number of processors:" + << UPstream::nProcs() + << Foam::abort(FatalError); + } + + /*sizes.setSize(UPstream::nProcs()); + labelList& nsTransPs = sizes[UPstream::myProcNo()]; + nsTransPs.setSize(UPstream::nProcs());*/ + int nProcs = UPstream::nProcs(); + int *sendCounts = new int[nProcs]; + int *recvCounts = new int[nProcs]; + + forAll(sendBufs, procI) + { + //nsTransPs[procI] = sendBufs[procI].size(); + sendCounts[procI] = sendBufs[procI].size(); + } + + // Send sizes across. Note: blocks. + //combineReduce(sizes, UPstream::listEq(), tag); + UPstream::alltoall(sendCounts, recvCounts); + + if (Pstream::parRun()) + { + label startOfRequests = Pstream::nRequests(); + + // Set up receives + // ~~~~~~~~~~~~~~~ + + recvBufs.setSize(sendBufs.size()); + + //forAll(sizes, procI) + for(int procI = 0; procI < nProcs; procI++) + { + //label nRecv = sizes[procI][UPstream::myProcNo()]; + label nRecv = recvCounts[procI]; + + if (procI != Pstream::myProcNo() && nRecv > 0) + { + recvBufs[procI].setSize(nRecv); + UIPstream::read + ( + UPstream::nonBlocking, + procI, + reinterpret_cast<char*>(recvBufs[procI].begin()), + nRecv*sizeof(T), + tag + ); + } + } + + + // Set up sends + // ~~~~~~~~~~~~ + + forAll(sendBufs, procI) + { + if (procI != Pstream::myProcNo() && sendBufs[procI].size() > 0) + { + if + ( + !UOPstream::write + ( + UPstream::nonBlocking, + procI, + reinterpret_cast<const char*>(sendBufs[procI].begin()), + sendBufs[procI].size()*sizeof(T), + tag + ) + ) + { + FatalErrorIn("Pstream::exchange(..)") + << "Cannot send outgoing message. " + << "to:" << procI << " nBytes:" + << label(sendBufs[procI].size()*sizeof(T)) + << Foam::abort(FatalError); + } + } + } + + + // Wait for all to finish + // ~~~~~~~~~~~~~~~~~~~~~~ + + if (block) + { + Pstream::waitRequests(startOfRequests); + } + } + + delete[] sendCounts; + delete[] recvCounts; + + // Do myself + recvBufs[Pstream::myProcNo()] = sendBufs[Pstream::myProcNo()]; +} + // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // diff --git a/src/Pstream/dummy/UPstream.C b/src/Pstream/dummy/UPstream.C index 183a86c..5a95ac6 100644 --- a/src/Pstream/dummy/UPstream.C +++ b/src/Pstream/dummy/UPstream.C @@ -104,5 +104,7 @@ bool Foam::UPstream::finishedRequest(const label i) return false; } +void Foam::UPstream::alltoall(int* sendCounts, int* recvCounts) { +} // ************************************************************************* // diff --git a/src/Pstream/mpi/UPstream.C b/src/Pstream/mpi/UPstream.C index bf7af5d..b95d789 100644 --- a/src/Pstream/mpi/UPstream.C +++ b/src/Pstream/mpi/UPstream.C @@ -378,5 +378,9 @@ bool Foam::UPstream::finishedRequest(const label i) return flag != 0; } +void Foam::UPstream::alltoall(int* sendCounts, int* recvCounts) { + MPI_Alltoall(sendCounts, 1, MPI_INT, recvCounts, 1, MPI_INT, MPI_COMM_WORLD); +} + // ************************************************************************* // |
|
Thanks for the detailed analysis. Have you tried any other mpi implementations? With the use of MPI_Alltoall what is the next bottleneck? Mattijs |
|
Dear Mattijs! No, we haven't tried other MPI implementations. We only used Bullxmpi 1.1.16 (a derivate of OpenMPI). Considering our specific testcase, the simple beam example with uniform motion diffusion (which corresponds in the above mentioned Whitepaper to Figure 5 on the right), when using MPI_Alltoall within one of the two overloads for PstreamBuffers::finishedSends(), 40% of the scaling overhead from 2048 to 4096 processes (i.e. the difference between actual runtime and theoretical runtime on optimal scaling) is caused by the second (unmodified) overload of PstreamBuffers::finishedSends() (and more precisely again by combineReduce()). I could also send you the corresponding HPCToolkit profiles (~7MB) via Email in case you are interested. Thomas |
|
Hi Thomas, that might be interesting. I'm m.janssens at the company email. |
|
Resolved in commit 56668b24066c32665026b0613109f8627a7c7789 Many thanks for the analysis. Please re-open if you come across other bottlenecks. |
Date Modified | Username | Field | Change |
---|---|---|---|
2014-06-22 16:50 |
|
New Issue | |
2014-06-22 16:50 |
|
File Added: changes.diff | |
2014-06-24 12:12 |
|
Note Added: 0003137 | |
2014-06-24 13:01 |
|
Note Added: 0003138 | |
2014-06-24 13:34 |
|
Note Edited: 0003138 | |
2014-06-24 14:14 |
|
Note Added: 0003139 | |
2016-04-07 16:48 | MattijsJ | Note Added: 0006098 | |
2016-04-07 17:44 | henry | Status | new => resolved |
2016-04-07 17:44 | henry | Fixed in Version | => dev |
2016-04-07 17:44 | henry | Resolution | open => fixed |
2016-04-07 17:44 | henry | Assigned To | => henry |