View Issue Details

IDProjectCategoryView StatusLast Update
0003071OpenFOAM[All Projects] Bugpublic2018-10-14 12:59
ReportertniemiAssigned Tohenry 
PrioritylowSeverityminorReproducibilitysometimes
Status resolvedResolutionfixed 
Product Versiondev 
Fixed in Version 
Summary0003071: Segfault bug in OpenMPI, 2.1.3-2.1.5, 3.1-3.1.2
DescriptionNote: This is a bug in OpenMPI, not OpenFOAM, this report is just for spreading information. It does relate to OpenFOAM indirectly as the default OpenMPI versions in etc/config.sh/mpi point to problematic versions both in 6 and dev. (The system OpenMPI in Ubuntus should be ok at least up to 18.04)

In OpenMPI versions 2.1.3-2.1.5 and 3.1-3.1.2 there is a race condition bug that affects the "vader" shared memory transport layer and leads to a segfault crash. Vader is the default method which will be used within a single node. The occurrence of the bug is random, but it will most likely happen during GAMG solution and the probability increases the more cores there are in a single shared memory node. Example stack trace of a crash:

#0 Foam::error::printStack(Foam::Ostream&) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/printStack.C:218
#1 Foam::sigSegv::sigHandler(int) at ~/OpenFOAM/OpenFOAM-6/src/OSspecific/POSIX/signals/sigSegv.C:54
#2 ? in "/lib64/libc.so.6"
#3 ? at btl_vader_component.c:?
#4 opal_progress in "/openmpi-3.1.2/lib64/libopen-pal.so.40"
#5 ompi_request_default_wait in "/openmpi-3.1.2/lib64/libmpi.so.40"
#6 ompi_coll_base_sendrecv_actual in "/openmpi-3.1.2/lib64/libmpi.so.40"
#7 ompi_coll_base_allreduce_intra_recursivedoubling in "/openmpi-3.1.2/lib64/libmpi.so.40"
#8 PMPI_Allreduce in "/openmpi-3.1.2/lib64/libmpi.so.40"

A fix for the bug is now known, but it is not yet available in any OpenMPI release. I can update this bug report when the fix is available. In the meantime, it is possible to avoid the crashes by reverting to 2.1.2/3.0.2 or older MPI, or forcing the older sm-shared memory layer by adding "--mca btl self,sm" command line option to mpirun.
Additional InformationFrom the middle of https://github.com/open-mpi/ompi/issues/5375
TagsNo tags attached.

Relationships

related to 0003089 resolvedhenry ThirdParty Open-MPI version mentioned in the README.org wasn't updated 

Activities

henry

2018-09-14 00:04

manager   ~0010061

Thanks Timo,

I have set the default OpenMPI version to 3.0.2

Resolved in OpenFOAM-6 by commit af7d7f427be78e9b9beb6aceca8fe7d5d4636876
Resolved in OpenFOAM-dev by commit 721b8071227c9f55f176c1c311b139a822cab415

henry

2018-09-21 10:01

manager   ~0010064

Further testing by Timo has shown that the issues with the new "vader" shared memory module in OpenMPI affects all of the version 3 releases and the later version 2 releases as well.
We have now downgraded the OpenMPI used by default in OpenFOAM-6 and -dev to OpenMPI-2.1.1 which has proved to be very reliable.

Issue History

Date Modified Username Field Change
2018-09-13 07:07 tniemi New Issue
2018-09-14 00:04 henry Assigned To => henry
2018-09-14 00:04 henry Status new => resolved
2018-09-14 00:04 henry Resolution open => fixed
2018-09-14 00:04 henry Note Added: 0010061
2018-09-21 10:01 henry Note Added: 0010064
2018-10-14 12:59 wyldckat Relationship added related to 0003089