View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000296 | ThirdParty | Bug | public | 2011-09-19 15:41 | 2011-10-07 09:11 |
Reporter | andras | Assigned To | |||
Priority | high | Severity | major | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Platform | x86_64 | OS | CentOS | OS Version | 5.4u3 |
Summary | 0000296: mpirun -np NUMPROCS not working | ||||
Description | Trying to run a simple tutorial in parallel does not work. OpenFOAM-2.0.1 (gcc-4.5.1, gmp-5.0.1, mpc-0.8.1, mpfr-2.4.2) openmpi-1.5.3 (configure-options += --with-sge) | ||||
Steps To Reproduce | Run e.g. icoFoam in parallel. | ||||
Additional Information | An error like this is produced: --%<-- [n201:31552] 3 more processes have sent help message help-mpi-api.txt / mpi-abort [n201:31552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [ahorvath@n201 cavity]$ [2] FOAM parallel run exiting [2] [3] [3] [3] --> FOAM FATAL IO ERROR: [3] incorrect first token, expected <int> or '(', found on line 0 the word 'z' [3] [3] file: IOstream at line 0. [3] [3] From function operator>>(Istream&, List<T>&) [3] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.C at line 149. [3] FOAM parallel run exiting [3] -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 2 with PID 31555 on node n201 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be --%<-- | ||||
Tags | No tags attached. | ||||
|
Parallel runs also don't work with openmpi-1.4.3 (the latest stable release). The error messages are the same. |
|
Did you try without the --with-sge? Can you attach the whole output? |
|
Yes, I compiled openmpi without "--with-sge" first. The results were the same for both the latest stable (1.4.3) and latest beta-version of openmpi that gets installed with OF-2.0.1. Anyhow: for a local run SGE is not invoked. --%<-- [ahorvath@n201] . .bashrc [ahorvath@n201 ~]$ foam [ahorvath@n201 OpenFOAM-2.0.1]$ pwd /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1 [ahorvath@n201 OpenFOAM-2.0.1]$ which mpirun ~/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/bin/mpirun [ahorvath@n201 OpenFOAM-2.0.1]$ ldd `which mpirun` libopen-rte.so.0 => /hpc_home/ahorvath/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/lib/libopen-rte.so.0 (0x00002b016dc55000) libopen-pal.so.0 => /hpc_home/ahorvath/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/lib/libopen-pal.so.0 (0x00002b016dea5000) libdl.so.2 => /lib64/libdl.so.2 (0x000000356d400000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003570800000) libutil.so.1 => /lib64/libutil.so.1 (0x000000357d200000) libm.so.6 => /lib64/libm.so.6 (0x000000356d800000) libpthread.so.0 => /lib64/libpthread.so.0 (0x000000356dc00000) libc.so.6 => /lib64/libc.so.6 (0x000000356d000000) /lib64/ld-linux-x86-64.so.2 (0x000000356cc00000) [ahorvath@n201 OpenFOAM-2.0.1]$ run [ahorvath@n201 run]$ ls cavity chtMultiRegionFoam [ahorvath@n201 run]$ cd cavity/ [ahorvath@n201 cavity]$ ls 0 constant processor0 processor1 processor2 processor3 system [ahorvath@n201 cavity]$ mpirun -np 4 icoFoam -parallel 2>&1 > log.iF & [1] 6969 [ahorvath@n201 cavity]$ [0] [1] [1] [1] --> FOAM FATAL IO ERROR: [2] [2] [2] --> FOAM FATAL IO ERROR: [2] incorrect first token, expected <int> or '(', found on line 0 the word 'z' [2] [3] [3] [3] --> FOAM FATAL IO ERROR: [3] incorrect first token, expected <int> or '(', found on line 0 the word 'z' [3] [3] file: [1] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token [1] [1] file: IOstream at line 0. [1] [1] [2] file: IOstream at line 0. [2] [2] From function operator>>(Istream&, List<T>&) [2] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.CIOstream at line 0. [3] [3] From function operator>>(Istream&, List<T>&) [3] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.C at line 149. [3] FOAM parallel run exiting [3] at line 149. [2] FOAM parallel run exiting [2] From function IOstream::fatalCheck(const char*) const [1] in file db/IOstreams/IOstreams/IOstream.C at line 114. [1] FOAM parallel run exiting [1] [0] [0] --> FOAM FATAL IO ERROR: [0] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token [0] [0] file: IOstream at line 0. [0] [0] From function IOstream::fatalCheck(const char*) const [0] in file db/IOstreams/IOstreams/IOstream.C at line 114. [0] FOAM parallel run exiting [0] -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 2 with PID 6972 on node n201 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [n201:06969] 3 more processes have sent help message help-mpi-api.txt / mpi-abort [n201:06969] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages --%<-- |
|
Has there been any progress on this? I've come across the same issue when running in parallel on SLES10 sp2. The compilation of OpenFOAM-2.0.x is fine (I've seen no Errors in the log) but when running in parallel I get the same error as andras. I've tested both scotch and simple as decomposition methods. For the compilation of OpenFOAM I used gcc-4.3.3 gmp-4.2.4 mpfr-2.4.1 cmake-2.8.2, (currently recompiling OF with cmake 2.8.4). OpenFOAM-1.7.x runs fine when compiled with the above ThirdParty apps. I turned on the IOobject debug flag. Here's and excerpt of the output: ... IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/neighbour" .... read IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/neighbour" .... read IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/boundary" .... read IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/boundary" .... read This is where the crash occurs. From a working version (on ubuntu 11.04) I can tell that the next file to be read is .../processor0/../system/fvSchemes. Do you have any advice as to what I can test to find where the error lies? Best Regards Nicolas |
|
- Check that boundary file on all processors. - set FOAM_ABORT to 1 to cause a traceback at the location of error - make sure your hostnames and userid are valid words so do not start with a number or contain invalid characters. |
|
I checked the boundary files and they looked fine. The crashed occurred even if I decomposed with OpenFOAM-1.7.x. I could run the same case just fine with pisoFoam from 1.7.x. Perhaps I should mention that I tested the incompressible pitzDaily tutorial case in parallel. However after a thorough clean ("rm -rf .../OpenFOAM-2.0.x/platforms" and "find .../OpenFOAM-2.0.x -name '*.so' -or -name '*.dep' -or .name '*.o' | xargs rm") and then recompiling OpenFOAM with cmake-2.8.4 I can't reproduce the error. So I'm not sure if I fixed it by cleaning and rebuilding or by switching from cmake-2.8.2 to 2.8.4. I'm think that the former is more likely. Perhaps andras would also benefit from recompiling (and cleaning!) again? Best Regards Nicolas |
|
It must be the cleanout. Could it be that some files were compiled with a different mpi version? |
|
Yes, that is one probable cause, the first time around I tried with the mpi version (1.5.3) that's supplied in ThirdParty. Seeing the error I switched back to the version (1.4.1) that was supplied with ThirdParty-1.7. I recompiled but still had the error. When recompiling I didn't clean as thoroughly (meaning not at all. I thought the script would automagically do it for me). I'm still not sure why it didn't work the first time around. I probably messed up some way or another. Unless you have good reason to upgrade to 1.5.3, I'm content that works with this version. /Nicolas |
|
Cleaning everything as described by nsf and running Allwmake again resolved the issue. I am using mpirun 1.4.3 (stable) now. Thanks guys... Cheers, Andras |
Date Modified | Username | Field | Change |
---|---|---|---|
2011-09-19 15:41 | andras | New Issue | |
2011-09-19 16:36 | andras | Note Added: 0000659 | |
2011-09-23 17:14 |
|
Note Added: 0000669 | |
2011-09-24 09:55 | andras | Note Added: 0000670 | |
2011-10-04 17:28 | nsf | Note Added: 0000679 | |
2011-10-04 18:09 |
|
Note Added: 0000680 | |
2011-10-05 17:22 | nsf | Note Added: 0000687 | |
2011-10-05 17:58 |
|
Note Added: 0000688 | |
2011-10-05 18:15 | nsf | Note Added: 0000689 | |
2011-10-06 21:36 | andras | Note Added: 0000695 | |
2011-10-07 09:11 |
|
Status | new => resolved |
2011-10-07 09:11 |
|
Resolution | open => fixed |
2011-10-07 09:11 |
|
Assigned To | => user4 |