View Issue Details

IDProjectCategoryView StatusLast Update
0002461OpenFOAMBugpublic2017-03-31 17:50
Reporterguin Assigned Tohenry  
PrioritynormalSeveritycrashReproducibilityalways
Status closedResolutionno change required 
PlatformGNU/LinuxOSUbuntuOS Version14.04
Summary0002461: Runtime load balancing for dynamicRefineFvMesh: dgraphFold2 error
DescriptionI have recently ported a minimal version of the load balancing tool developed by Tyler V (tgvosk) to OpenFOAM-4.1. Though having several limitations, it seems to do what one expects in low-scale parallel simulations. However, from time to time simulations crash with following error:

[...]
ExecutionTime = 74.88 s ClockTime = 75 s

Courant Number mean: 0.0227096 max: 0.489102
Interface Courant Number mean: 0.0020943 max: 0.489102
deltaT = 0.000833333
Time = 0.203333

PIMPLE: iteration 1
Selected 119 cells for refinement out of 115612.
Refined from 115612 to 116445 cells.
Selected 217 split points out of a possible 10399.
Unrefined from 116445 to 114926 cells.
Maximum imbalance = 27.2854 %
Re-balancing dynamically refined mesh
Selecting decompositionMethod ptscotch
(1): ERROR: dgraphFold2: out of memory (2)
(2): ERROR: dgraphFold2: out of memory (2)
(4): ERROR: dgraphFold2: out of memory (2)
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 199479 on
node nozomi exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Up to now I identified that the crash happens at the decompose(...) method. I have found very few information concerning such error and it appears to be related to ptscotch. In a previous bug-report it was mentioned that Scotch-6.0.4 solved dgraphFold2 errors (https://bugs.openfoam.org/view.php?id=1792). Unfortunately that seems not to be the case here.

The frequency with which this error appears increases with the amount of processes used, not so with the re-balancing frequency when only a few processes are involved.

Though I provide here a simple case to reproduce it I have noticed a similar behavior with more complex cases and in different environments:

Machine 1: workstation - Ubuntu 16.04 openfoam-4 (repository version)
Machine 2: server - Ubuntu 14.04 openfoam-4 (repository version)
Machine 3: HPC-cluster - CentOS 7.2.1511 OpenFOAM-4.1 (with scotch/6.0.4, compiler intel/17 and intel-MPI/2017)
Steps To Reproduce1 Compile the attached utility ("dynamicRefineBalancedFvMesh").

2 copy "tutorials/multiphase/interDyMFoam/ras/damBreakWithObstacle" in the user workspace and modify it accordingly (or just use the attached one):

 2.a. Add following line to system/controlDict in order to load the library:

        libs ( "libdynamicRefineBalancedFvMesh.so" );

 2.b. In constant/dynamicMeshDict:

        //dynamicFvMesh dynamicRefineFvMesh;
          dynamicFvMesh dynamicRefineBalancedFvMesh;

        dynamicRefineFvMeshCoeffs
        {
            // Enable dynamic load balancing (parallel jobs)
            enableBalancing true;

            // Maximal alowable load imbalance among processes (parallel jobs)
            allowableImbalance 0.25;
[...]
            // Stop refinement if maxCells reached
            maxCells 2000000;//200000; (optionally, just to ensure we don't reach this limit)
        }

 2.c In system/decomposeParDict:

       numberOfSubdomains 16; //6

 2.d Add new dictionary file system/balanceParDict with following relevant entries:

       numberOfSubdomains 16;

       method ptscotch;

 2.e Change Allrun to run the case in parallel:

[...]
       #runApplication `getApplication`

       runApplication decomposePar
       runParallel `getApplication`

With 16 processes the simulation crashes at around "Time = 0.2" (no matter the constraints used in balanceParDict). I have tested parallel cases up to 160 cores and they tend to crash earlier.

PS: When testing this case in a self-compiled version of OF-4.1 GAMG solver complained due to a conflict between the subdomains size and the default parameters.
Additional InformationI am aware of redistributePar tool, which also handles the redistribution of lagrangian fields, but as fas as I understood that requires the simulations to stop and resume for every re-balance step. I really think that a runtime alternative would be very appreciated by the users. Of course, the presented utility has a lot of place for improvement in terms of limitations, code quality/style, etc. On the other hand, being this a standalone utility permits to avoid conflicts with other libraries as well as a minimal code maintenance.

This utility throws a lot of warning messages when re-balancing takes place of the type:

    --> FOAM Warning :
        From function fixedValueFvPatchField<Type>::fixedValueFvPatchField
    (
        const fixedValueFvPatchField<Type>&,
        const fvPatch&,
        const DimensionedField<Type, volMesh>&,
        const fvPatchFieldMapper&
    )

According to another bug-report (https://bugs.openfoam.org/view.php?id=619) "This is a side effect of the way in which redistribution currently is implemented. The bits get added from all processors so temporarily there are unmapped values and this warning is from that intermediate stage." Is there a way to get rid of them now?
TagsdynamicRefineFvMesh, load balance, parallel, ptscotch, redistributePar

Activities

guin

2017-02-16 10:45

reporter  

guin

2017-02-16 10:55

reporter   ~0007763

In case it gets relevant:

Bug-report from tgvosk concerning a few issues found when implementing the utility (https://bugs.openfoam.org/view.php?id=1203)

Github-site of the original code for OF-2.3.x, which had additional functionalities to select the refinement regions (https://github.com/tgvoskuilen/meshBalancing)

MattijsJ

2017-02-24 15:38

reporter   ~0007819

As a work-around you can use the non-distributed 'scotch' decomposition method instead. For parallel running it will decompose the graph on the master. This of course will only work if your memory on the master processor is large enough to hold the whole graph.

guin

2017-02-28 10:43

reporter   ~0007823

Thank you for the prompt response, the suggested work-around works indeed.

The non-distributed 'scotch' worked fine up to 2-3 million cells in the machine I tested (Xeon E5 processor family), which is OK for middle-size simulations. However, as you mentioned, this method cannot be used for big-scale simulations due to CPU memory restrictions.

However I still wonder why we cannot rely on the distributed variant 'ptscotch'. Three possibilities come into my mind:

 1. The runtime load balancing utility produces some kind of nonsense. -> I would expect it to produce other kind of errors as well as more frequent difficult-to-reproduce crashes. The utility is built in a relative outer layer on top of 'dynamicRefineFvMesh' and it is difficult (at least for me) to understand why it works fine with the non-distributed method.

 2. This may be an issue from OF's interface library for 'ptscotch', namely 'ptscotchDecompose'. -> most relevant scenario for this thread, since a bug here may potentially affect both current as well as future features in OF-software. To be honest, I still didn't go over this piece of code. I expect debugging this to take a bit long due to the amount of code depending on it. I noticed that 'ptscotch' method is being used to run 'snappyHexMesh' in parallel, are you aware of similar problems using it?

 3. The problem would arise from PTSCOTCH library itself. -> though having the same potential effect on OF's code as mentioned above, this case shall be reported somewhere else ( http://gforge.inria.fr/tracker/?group_id=248 ). The question here would be how to isolate the problem enough from OF in order to support Scotch developers.


Please, feel free to close this thread if in your opinion we find ourselves in the 1st scenario (I understand that this is not the place to discuss support-related problems). In any case, it would be interesting to keep in mind the benefits of a runtime redistribution alternative for cases that cannot get rid from runtime AMR.


PS: at the time of writing this current version of SCOTCH / PTSCOTCH was 6.0.4.

MattijsJ

2017-03-24 12:58

reporter   ~0007977

The problem is most likely 3. The OpenFOAM stub (ptscotchDecomp) is stateless and only uses basic mesh addressing.

As a workaround maybe try multiLevel (=multi-pass) with hierarchical as first level and ptscotch/scotch as second. The first level will cut down the problem size so you might not get the ptscotch error.

guin

2017-03-31 12:05

reporter   ~0008000

The proposed multiLevel workaround certainly improve the situation.

The use of only hierarchical decompositions allowed the simulations to run until reaching about 9Mcells (using 5 refinements for the above given case). So far no matter which other combination of decomposition methods was tried, it was not possible to exceed that point in my single-node tests, so I inferred that this limit is hardware-imposed: right before the fatal balancing-step was attempted the memory use appeared to be above 50GB in a 64GB machine.

Concerning scotch/ptscotch, their combination with (one or two) previous hierarchical decomposition seems to shift the problem and I had simulations running with 3 and 4 refinements up to about 4Mcells. However for the same setup and just 2 refinements (similar to the uploaded case) one faces again the dgraphFold2-error with only about 0.2Mcells present in the domain. Here is an excerpt from balanceParDict settings used in these cases:

numberOfSubdomains 16;

method multiLevel;

multiLevelCoeffs
{
    level0
    {
        numberOfSubdomains 2;

        method hierarchical;

        hierarchicalCoeffs
        {
            n (2 1 1);
            delta 0.001;
            order xyz;
        }
    }

    level1
    {
        numberOfSubdomains 8;

        method ptscotch;
    }
}

I am thinking about possible reasons for this (some of which might be nonsense) such as scotch extending the size of the mesh in memory due to any sort of introduced element disordering…

Anyhow, at this point it seems more clear that this is not a bug coming from OpenFOAM-software itself, but from scotch library instead. I don’t know whether keeping this bug thread open would bring any further benefit. Other functionalities such snappyHexMesh are susceptible of facing similar problems but from my point of view there is not much more that can be done in this platform to prevent it, apart from passing the relevant information to Scotch development team (next week me do that).

henry

2017-03-31 17:50

manager   ~0008007

The issue is with ptscotch, please report to the maintainers of scotch/ptscotch.

Issue History

Date Modified Username Field Change
2017-02-16 10:45 guin New Issue
2017-02-16 10:45 guin File Added: dynRefBalFvMesh_test_stuff.tar.gz
2017-02-16 10:45 guin Tag Attached: parallel
2017-02-16 10:45 guin Tag Attached: redistributePar
2017-02-16 10:45 guin Tag Attached: dynamicRefineFvMesh
2017-02-16 10:45 guin Tag Attached: load balance
2017-02-16 10:45 guin Tag Attached: ptscotch
2017-02-16 10:55 guin Note Added: 0007763
2017-02-24 15:38 MattijsJ Note Added: 0007819
2017-02-28 10:43 guin Note Added: 0007823
2017-03-24 12:58 MattijsJ Note Added: 0007977
2017-03-31 12:05 guin Note Added: 0008000
2017-03-31 17:49 henry Category Contribution => Bug
2017-03-31 17:50 henry Assigned To => henry
2017-03-31 17:50 henry Status new => closed
2017-03-31 17:50 henry Resolution open => no change required
2017-03-31 17:50 henry Note Added: 0008007