View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0003071||OpenFOAM||Bug||public||2018-09-13 07:07||2018-10-14 12:59|
|Summary||0003071: Segfault bug in OpenMPI, 2.1.3-2.1.5, 3.1-3.1.2|
|Description||Note: This is a bug in OpenMPI, not OpenFOAM, this report is just for spreading information. It does relate to OpenFOAM indirectly as the default OpenMPI versions in etc/config.sh/mpi point to problematic versions both in 6 and dev. (The system OpenMPI in Ubuntus should be ok at least up to 18.04)|
In OpenMPI versions 2.1.3-2.1.5 and 3.1-3.1.2 there is a race condition bug that affects the "vader" shared memory transport layer and leads to a segfault crash. Vader is the default method which will be used within a single node. The occurrence of the bug is random, but it will most likely happen during GAMG solution and the probability increases the more cores there are in a single shared memory node. Example stack trace of a crash:
#0 Foam::error::printStack(Foam::Ostream&) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/printStack.C:218
#1 Foam::sigSegv::sigHandler(int) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/signals/sigSegv.C:54
#2 ? in "/lib64/libc.so.6"
#3 ? at btl_vader_component.c:?
#4 opal_progress in "/openmpi-3.1.2/lib64/libopen-pal.so.40"
#5 ompi_request_default_wait in "/openmpi-3.1.2/lib64/libmpi.so.40"
#6 ompi_coll_base_sendrecv_actual in "/openmpi-3.1.2/lib64/libmpi.so.40"
#7 ompi_coll_base_allreduce_intra_recursivedoubling in "/openmpi-3.1.2/lib64/libmpi.so.40"
#8 PMPI_Allreduce in "/openmpi-3.1.2/lib64/libmpi.so.40"
A fix for the bug is now known, but it is not yet available in any OpenMPI release. I can update this bug report when the fix is available. In the meantime, it is possible to avoid the crashes by reverting to 2.1.2/3.0.2 or older MPI, or forcing the older sm-shared memory layer by adding "--mca btl self,sm" command line option to mpirun.
|Additional Information||From the middle of https://github.com/open-mpi/ompi/issues/5375|
|Tags||No tags attached.|
I have set the default OpenMPI version to 3.0.2
Resolved in OpenFOAM-6 by commit af7d7f427be78e9b9beb6aceca8fe7d5d4636876
Resolved in OpenFOAM-dev by commit 721b8071227c9f55f176c1c311b139a822cab415
Further testing by Timo has shown that the issues with the new "vader" shared memory module in OpenMPI affects all of the version 3 releases and the later version 2 releases as well.
We have now downgraded the OpenMPI used by default in OpenFOAM-6 and -dev to OpenMPI-2.1.1 which has proved to be very reliable.
|2018-09-13 07:07||tniemi||New Issue|
|2018-09-14 00:04||henry||Assigned To||=> henry|
|2018-09-14 00:04||henry||Status||new => resolved|
|2018-09-14 00:04||henry||Resolution||open => fixed|
|2018-09-14 00:04||henry||Note Added: 0010061|
|2018-09-21 10:01||henry||Note Added: 0010064|
|2018-10-14 12:59||wyldckat||Relationship added||related to 0003089|