View Issue Details

IDProjectCategoryView StatusLast Update
0003320OpenFOAMBugpublic2020-11-21 20:07
Reporteralpha754293 Assigned Tohenry  
PrioritynormalSeveritymajorReproducibilityalways
Status closedResolutionunable to reproduce 
PlatformLinuxOSCentOSOS Version7.6.1810
Summary0003320: Distributed parallel processing doesn't work properly
DescriptionI have a fresh install of CentOS 7.6.1810 and executed the steps to install OpenFOAM on CentOS 7 here (https://openfoamwiki.net/index.php/Installation/Linux/OpenFOAM-6/CentOS_SL_RHEL#CentOS_7.5_.281804.29).

I also have Infiniband (Mellanox ConnectX-4 dual port VPI 100 Gbps NIC) and also a Mellanox MSB-7890 externally managed 36 port 100 Gbps Infiniband switch.

I've installed the Infiniband modules and opensm is running and the Infiniband devices have an IPv4 address assigned to them (ib0) along with passwordless ssh set up on both nodes.

There were no errors in the log files as a result of executing the copy-and-paste steps for the installation of OpenFOAM per the link above.

I have a NFS share set up between the two nodes so that I can try and execute the Motorbike OpenFOAM benchmark across the two nodes.

Each node has dual Intel gigabit network interfaces (eno1 and eno2).

When I try to run snappyHexMesh over the two nodes, using this command:

[ewen@aes003 run_32]$ time -p /home/ewen/OpenFOAM/ThirdParty-6/platforms/linux64Gcc/openmpi-2.1.1/bin/mpirun -x LD_LIBRARY_PATH -x PATH -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x MPI_BUFFER_SIZE --hostfile ../../machines.txt -np 32 snappyHexMesh -overwrite -parallel 2>&1 | tee mesh.log

this is the error I get:

[aes004][[17070,1],16][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[17070,1],23]

machines.txt looks like this:

[ewen@aes003 run_32]$ cat ../../machines.txt
aes003 cpu=16
aes004 cpu=16

When I run it on a single node, it only takes about 3.5 minutes for snappyHexMesh to complete its task.

And I can run it on a single, but remote node, but not distributed acrossed the two nodes.

And this is the best-case scenario. Other attempts won't even start at all (e.g. if I don't export all of those environment variables out to the slave nodes).

Thank you.
Steps To ReproduceSee above.
Additional InformationSee above.
TagsNo tags attached.

Activities

wyldckat

2019-07-28 20:17

updater   ~0010653

The instructions at openfoamwiki.net are provided by the community that uses OpenFOAM. Those specific instructions (were actually written by me) even state where questions about those instructions should be asked at...
Either way, the problem is that those instructions were designed for getting things up and running, leaving further tweaking to the person taking care of the installation.

There are two possible solutions to your problem:

Solution 1- You can edit the file "ThirdParty-6/Allwmake" and look for the line "Infiniband support" then remove the # from the start of the lines after that one, so that it will also build support for Infiniband. You will then need to delete the folder that is defined in "$MPI_ARCH_PATH" and run Allwmake once again.

Solution 2- The other alternative is to use the system's own Open-MPI installation. Look for the "WM_MPLIB=OPENMPI" entry on the installation instructions and remove from your own calls to the 'bashrc' file.

alpha754293

2019-07-29 01:43

reporter   ~0010654

Yes, unfortunately, I no longer have access to cfd-online.com due to an unrelated issue.

I suppose that I can try that.

Thank you.

I'm only raising these points up because when I install OpenFOAM in Ubuntu (same system, hardware, etc.), that one works very nicely. But I was hoping to be able to consolidate everything to CentOS rather than having mutli-boot into different Linux distros based on the application that I am trying to run.

Thank you.

wyldckat

2019-07-29 11:24

updater   ~0010655

My apologies, I forgot to mention that there is indeed a limitation on 'Allwmake' script on how compiling Open-MPI is being handled... I'm planning on moving out the script code that builds Open-MPI and MPICH to their own scripts, so that users can have a better control over the build options.

Please let us know if building Open-MPI with Infiniband support works for you, or if you go with using the system's Open-MPI.

alpha754293

2019-07-29 14:20

reporter   ~0010656

I can try it both ways and report back on the results of each.

I think that it might help the community out if I do that as I suspect that I can't possibly be the first person to have encountered this problem with OpenFOAM v6 on CentOS 7.6.1810 and OpenMPI and Infiniband, so I'll actually run it both ways and report back on the results of each.

(I have four nodes in total so I can split up my micro compute cluster in two smaller clusters with two nodes each so that I can test both methods simultaneously -- so that the previous build of one isn't going to affect the build of the next, which represents more of a bare metal deployment.)

Thank you.

I might need more of your support/assistance as I run through this because I am not an expert in regards to building and compiling OpenFOAM with OpenMPI with Infiniband by any stretch of the imagination, so please do forgive my forthcoming dumb questions as I try this.

Thank you.

wyldckat

2019-07-29 15:58

updater   ~0010657

If you can test both methods, then it can be very useful for the community.

If you edit the file "$HOME/.bashrc", you should find at the end the 'alias' command for 'of6'. If you copy-paste the line and give a new name, e.g.:

    alias of6sysompi='source $HOME/OpenFOAM/OpenFOAM-6/etc/bashrc FOAMY_HEX_MESH=yes'

Save and close the file. Then you can start a new terminal and use on the new terminal the new command 'of6sysompi' to start the environment that relies on the system's Open-MPI. Keep in mind that you might need to use a 'module load' command to load the system's Open-MPI.

This to day that you can use the same installation and be able to use each Open-MPI in function of the respective alias you activate on each terminal.

alpha754293

2019-07-29 17:12

reporter   ~0010658

I'll definitely test both methods.

I think that this could be a symbiotic relationship where if you help me or point me specifically to the parts where I need to edit and how I might need to edit it (again, I'm not really an "IT" person, but just an end user who is forced to have to deal with the "IT" stuff just by virtue of it, and not necessarily by choice), so any help that you will be able to provide (a la copy-paste), I will be able to execute that on my micro cluster and feedback the results to this community.

You help me help you, so to speak.

"Then you can start a new terminal and use on the new terminal the new command 'of6sysompi' to start the environment that relies on the system's Open-MPI."

But this would mean that the alias for of6 and of6sysompi would be identical, wouldn't it?

And wouldn't this also mean that the only difference would be whether I execute the 'module load' command or not, wouldn't it?

I just want to make sure that I am understanding this sufficiently in regards to how I need to set this up to test it so that when I report the results back, it would be what's expected.

If you don't mind, please treat me like an extreme novice, so that the more explicit the instructions you are able to provide, the better chance it would be that I won't screw it up. :)

Solution 1 - is to enable Infiniband support. I'm not in a position (at work now) to be able to get access to my system -- once I delete the definition for the variable $MPI_ARCH_PATH, I should be able to just run through the installation instructions from beginning to end again, correct? (I am going to do everything all over again, from a clean install so that I can be sure that there isn't problems with old stuff lingering around and interfering with the new install.)

Solution 2 - "Look for the "WM_MPLIB=OPENMPI" entry on the installation instructions and remove from your own calls to the 'bashrc' file."

Can you expand a little further, for novices such as myself, what you mean by "...from your own calls to the 'bashrc' file"?

Is that related to what you wrote above and how I might have to use 'module load' to load the system's openmpi installation?

My apologies for my dumb questions, which might seem repetative, but I want to make sure that I will be executing what you want/need me to execute so that when I report the results back, it will make sense to you.

I lack the requisite knowledge and understanding, so your help is greatly appreciated.

Thank you.

alpha754293

2019-07-30 02:49

reporter   ~0010659

Darn it - I just lost my post/comment. :(

1) The system provided:

# yum groupinstall 'Infiniband Support' doesn't have an ofed directory.

So I'm not sure what files/libraries the build is point to/looking for, but if you can help identify some of those files, I can either:

a) try and find those files elsewhere in a fresh, baremetals CentOS 7.6.1810 installation or

b) install the Mellanox OFED drivers.

2) re: installing with the system provided OpenMPI 1.10.7

Is there a way to set it up so that I won't have to run ThirdParty-6/Allwmake again and just point it to the system OpenMPI installation right from the get go?

The other part where I am confused is that you mentioned to delete the folder that is defined for $MPI_ARCH_PATH, and also WM_MPLIB, so how would it know to use the default OpenMPI or does that only mean that it won't build OpenMPI 2.2.1?

I can have one of my pairs of nodes run it with building OpenMPI 2.2.1 (after enabling Infiniband support) whilst I can have the other pair of nodes do it with the system OpenMPI 1.10.7.

My apologies, but I got confused at this step and didn't know what to do.

Your help in clarifying that is greatly appreciated because I don't know/understand enough of this to be able to solve this on my own.

Thank you.

Issue History

Date Modified Username Field Change
2019-07-27 00:36 alpha754293 New Issue
2019-07-28 20:17 wyldckat Note Added: 0010653
2019-07-29 01:43 alpha754293 Note Added: 0010654
2019-07-29 11:24 wyldckat Note Added: 0010655
2019-07-29 14:20 alpha754293 Note Added: 0010656
2019-07-29 15:58 wyldckat Note Added: 0010657
2019-07-29 17:12 alpha754293 Note Added: 0010658
2019-07-30 02:49 alpha754293 Note Added: 0010659
2020-11-21 20:07 henry Assigned To => henry
2020-11-21 20:07 henry Status new => closed
2020-11-21 20:07 henry Resolution open => unable to reproduce