View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0003071 | OpenFOAM | Bug | public | 2018-09-13 07:07 | 2018-10-14 12:59 |
Reporter | tniemi | Assigned To | henry | ||
Priority | low | Severity | minor | Reproducibility | sometimes |
Status | resolved | Resolution | fixed | ||
Product Version | dev | ||||
Summary | 0003071: Segfault bug in OpenMPI, 2.1.3-2.1.5, 3.1-3.1.2 | ||||
Description | Note: This is a bug in OpenMPI, not OpenFOAM, this report is just for spreading information. It does relate to OpenFOAM indirectly as the default OpenMPI versions in etc/config.sh/mpi point to problematic versions both in 6 and dev. (The system OpenMPI in Ubuntus should be ok at least up to 18.04) In OpenMPI versions 2.1.3-2.1.5 and 3.1-3.1.2 there is a race condition bug that affects the "vader" shared memory transport layer and leads to a segfault crash. Vader is the default method which will be used within a single node. The occurrence of the bug is random, but it will most likely happen during GAMG solution and the probability increases the more cores there are in a single shared memory node. Example stack trace of a crash: #0 Foam::error::printStack(Foam::Ostream&) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/printStack.C:218 #1 Foam::sigSegv::sigHandler(int) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/signals/sigSegv.C:54 #2 ? in "/lib64/libc.so.6" #3 ? at btl_vader_component.c:? #4 opal_progress in "/openmpi-3.1.2/lib64/libopen-pal.so.40" #5 ompi_request_default_wait in "/openmpi-3.1.2/lib64/libmpi.so.40" #6 ompi_coll_base_sendrecv_actual in "/openmpi-3.1.2/lib64/libmpi.so.40" #7 ompi_coll_base_allreduce_intra_recursivedoubling in "/openmpi-3.1.2/lib64/libmpi.so.40" #8 PMPI_Allreduce in "/openmpi-3.1.2/lib64/libmpi.so.40" A fix for the bug is now known, but it is not yet available in any OpenMPI release. I can update this bug report when the fix is available. In the meantime, it is possible to avoid the crashes by reverting to 2.1.2/3.0.2 or older MPI, or forcing the older sm-shared memory layer by adding "--mca btl self,sm" command line option to mpirun. | ||||
Additional Information | From the middle of https://github.com/open-mpi/ompi/issues/5375 | ||||
Tags | No tags attached. | ||||
|
Thanks Timo, I have set the default OpenMPI version to 3.0.2 Resolved in OpenFOAM-6 by commit af7d7f427be78e9b9beb6aceca8fe7d5d4636876 Resolved in OpenFOAM-dev by commit 721b8071227c9f55f176c1c311b139a822cab415 |
|
Further testing by Timo has shown that the issues with the new "vader" shared memory module in OpenMPI affects all of the version 3 releases and the later version 2 releases as well. We have now downgraded the OpenMPI used by default in OpenFOAM-6 and -dev to OpenMPI-2.1.1 which has proved to be very reliable. |
Date Modified | Username | Field | Change |
---|---|---|---|
2018-09-13 07:07 | tniemi | New Issue | |
2018-09-14 00:04 | henry | Assigned To | => henry |
2018-09-14 00:04 | henry | Status | new => resolved |
2018-09-14 00:04 | henry | Resolution | open => fixed |
2018-09-14 00:04 | henry | Note Added: 0010061 | |
2018-09-21 10:01 | henry | Note Added: 0010064 | |
2018-10-14 12:59 | wyldckat | Relationship added | related to 0003089 |