0002461: Runtime load balancing for dynamicRefineFvMesh: dgraphFold2 error - OpenFOAM Issue Tracking

ID	Project	Category	View Status	Date Submitted	Last Update

0002461	OpenFOAM	Bug	public	2017-02-16 10:45	2017-03-31 17:50

Reporter	guin	Assigned To	henry
Priority	normal	Severity	crash	Reproducibility	always
Status	closed	Resolution	no change required
Platform	GNU/Linux	OS	Ubuntu	OS Version	14.04

Summary	0002461: Runtime load balancing for dynamicRefineFvMesh: dgraphFold2 error
Description	I have recently ported a minimal version of the load balancing tool developed by Tyler V (tgvosk) to OpenFOAM-4.1. Though having several limitations, it seems to do what one expects in low-scale parallel simulations. However, from time to time simulations crash with following error: [...] ExecutionTime = 74.88 s ClockTime = 75 s Courant Number mean: 0.0227096 max: 0.489102 Interface Courant Number mean: 0.0020943 max: 0.489102 deltaT = 0.000833333 Time = 0.203333 PIMPLE: iteration 1 Selected 119 cells for refinement out of 115612. Refined from 115612 to 116445 cells. Selected 217 split points out of a possible 10399. Unrefined from 116445 to 114926 cells. Maximum imbalance = 27.2854 % Re-balancing dynamically refined mesh Selecting decompositionMethod ptscotch (1): ERROR: dgraphFold2: out of memory (2) (2): ERROR: dgraphFold2: out of memory (2) (4): ERROR: dgraphFold2: out of memory (2) -------------------------------------------------------------------------- mpirun has exited due to process rank 2 with PID 199479 on node nozomi exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- Up to now I identified that the crash happens at the decompose(...) method. I have found very few information concerning such error and it appears to be related to ptscotch. In a previous bug-report it was mentioned that Scotch-6.0.4 solved dgraphFold2 errors (https://bugs.openfoam.org/view.php?id=1792). Unfortunately that seems not to be the case here. The frequency with which this error appears increases with the amount of processes used, not so with the re-balancing frequency when only a few processes are involved. Though I provide here a simple case to reproduce it I have noticed a similar behavior with more complex cases and in different environments: Machine 1: workstation - Ubuntu 16.04 openfoam-4 (repository version) Machine 2: server - Ubuntu 14.04 openfoam-4 (repository version) Machine 3: HPC-cluster - CentOS 7.2.1511 OpenFOAM-4.1 (with scotch/6.0.4, compiler intel/17 and intel-MPI/2017)
Steps To Reproduce	1 Compile the attached utility ("dynamicRefineBalancedFvMesh"). 2 copy "tutorials/multiphase/interDyMFoam/ras/damBreakWithObstacle" in the user workspace and modify it accordingly (or just use the attached one): 2.a. Add following line to system/controlDict in order to load the library: libs ( "libdynamicRefineBalancedFvMesh.so" ); 2.b. In constant/dynamicMeshDict: //dynamicFvMesh dynamicRefineFvMesh; dynamicFvMesh dynamicRefineBalancedFvMesh; dynamicRefineFvMeshCoeffs { // Enable dynamic load balancing (parallel jobs) enableBalancing true; // Maximal alowable load imbalance among processes (parallel jobs) allowableImbalance 0.25; [...] // Stop refinement if maxCells reached maxCells 2000000;//200000; (optionally, just to ensure we don't reach this limit) } 2.c In system/decomposeParDict: numberOfSubdomains 16; //6 2.d Add new dictionary file system/balanceParDict with following relevant entries: numberOfSubdomains 16; method ptscotch; 2.e Change Allrun to run the case in parallel: [...] #runApplication `getApplication` runApplication decomposePar runParallel `getApplication` With 16 processes the simulation crashes at around "Time = 0.2" (no matter the constraints used in balanceParDict). I have tested parallel cases up to 160 cores and they tend to crash earlier. PS: When testing this case in a self-compiled version of OF-4.1 GAMG solver complained due to a conflict between the subdomains size and the default parameters.
Additional Information	I am aware of redistributePar tool, which also handles the redistribution of lagrangian fields, but as fas as I understood that requires the simulations to stop and resume for every re-balance step. I really think that a runtime alternative would be very appreciated by the users. Of course, the presented utility has a lot of place for improvement in terms of limitations, code quality/style, etc. On the other hand, being this a standalone utility permits to avoid conflicts with other libraries as well as a minimal code maintenance. This utility throws a lot of warning messages when re-balancing takes place of the type: --> FOAM Warning : From function fixedValueFvPatchField<Type>::fixedValueFvPatchField ( const fixedValueFvPatchField<Type>&, const fvPatch&, const DimensionedField<Type, volMesh>&, const fvPatchFieldMapper& ) According to another bug-report (https://bugs.openfoam.org/view.php?id=619) "This is a side effect of the way in which redistribution currently is implemented. The bits get added from all processors so temporarily there are unmapped values and this warning is from that intermediate stage." Is there a way to get rid of them now?
Tags	dynamicRefineFvMesh, load balance, parallel, ptscotch, redistributePar

guin 2017-02-16 10:45 reporter	dynRefBalFvMesh_test_stuff.tar.gz (8,551 bytes)

guin 2017-02-16 10:55 reporter ~0007763	In case it gets relevant: Bug-report from tgvosk concerning a few issues found when implementing the utility (https://bugs.openfoam.org/view.php?id=1203) Github-site of the original code for OF-2.3.x, which had additional functionalities to select the refinement regions (https://github.com/tgvoskuilen/meshBalancing)

MattijsJ 2017-02-24 15:38 reporter ~0007819	As a work-around you can use the non-distributed 'scotch' decomposition method instead. For parallel running it will decompose the graph on the master. This of course will only work if your memory on the master processor is large enough to hold the whole graph.

guin 2017-02-28 10:43 reporter ~0007823	Thank you for the prompt response, the suggested work-around works indeed. The non-distributed 'scotch' worked fine up to 2-3 million cells in the machine I tested (Xeon E5 processor family), which is OK for middle-size simulations. However, as you mentioned, this method cannot be used for big-scale simulations due to CPU memory restrictions. However I still wonder why we cannot rely on the distributed variant 'ptscotch'. Three possibilities come into my mind: 1. The runtime load balancing utility produces some kind of nonsense. -> I would expect it to produce other kind of errors as well as more frequent difficult-to-reproduce crashes. The utility is built in a relative outer layer on top of 'dynamicRefineFvMesh' and it is difficult (at least for me) to understand why it works fine with the non-distributed method. 2. This may be an issue from OF's interface library for 'ptscotch', namely 'ptscotchDecompose'. -> most relevant scenario for this thread, since a bug here may potentially affect both current as well as future features in OF-software. To be honest, I still didn't go over this piece of code. I expect debugging this to take a bit long due to the amount of code depending on it. I noticed that 'ptscotch' method is being used to run 'snappyHexMesh' in parallel, are you aware of similar problems using it? 3. The problem would arise from PTSCOTCH library itself. -> though having the same potential effect on OF's code as mentioned above, this case shall be reported somewhere else ( http://gforge.inria.fr/tracker/?group_id=248 ). The question here would be how to isolate the problem enough from OF in order to support Scotch developers. Please, feel free to close this thread if in your opinion we find ourselves in the 1st scenario (I understand that this is not the place to discuss support-related problems). In any case, it would be interesting to keep in mind the benefits of a runtime redistribution alternative for cases that cannot get rid from runtime AMR. PS: at the time of writing this current version of SCOTCH / PTSCOTCH was 6.0.4.

MattijsJ 2017-03-24 12:58 reporter ~0007977	The problem is most likely 3. The OpenFOAM stub (ptscotchDecomp) is stateless and only uses basic mesh addressing. As a workaround maybe try multiLevel (=multi-pass) with hierarchical as first level and ptscotch/scotch as second. The first level will cut down the problem size so you might not get the ptscotch error.

guin 2017-03-31 12:05 reporter ~0008000	The proposed multiLevel workaround certainly improve the situation. The use of only hierarchical decompositions allowed the simulations to run until reaching about 9Mcells (using 5 refinements for the above given case). So far no matter which other combination of decomposition methods was tried, it was not possible to exceed that point in my single-node tests, so I inferred that this limit is hardware-imposed: right before the fatal balancing-step was attempted the memory use appeared to be above 50GB in a 64GB machine. Concerning scotch/ptscotch, their combination with (one or two) previous hierarchical decomposition seems to shift the problem and I had simulations running with 3 and 4 refinements up to about 4Mcells. However for the same setup and just 2 refinements (similar to the uploaded case) one faces again the dgraphFold2-error with only about 0.2Mcells present in the domain. Here is an excerpt from balanceParDict settings used in these cases: numberOfSubdomains 16; method multiLevel; multiLevelCoeffs { level0 { numberOfSubdomains 2; method hierarchical; hierarchicalCoeffs { n (2 1 1); delta 0.001; order xyz; } } level1 { numberOfSubdomains 8; method ptscotch; } } I am thinking about possible reasons for this (some of which might be nonsense) such as scotch extending the size of the mesh in memory due to any sort of introduced element disordering… Anyhow, at this point it seems more clear that this is not a bug coming from OpenFOAM-software itself, but from scotch library instead. I don’t know whether keeping this bug thread open would bring any further benefit. Other functionalities such snappyHexMesh are susceptible of facing similar problems but from my point of view there is not much more that can be done in this platform to prevent it, apart from passing the relevant information to Scotch development team (next week me do that).

henry 2017-03-31 17:50 manager ~0008007	The issue is with ptscotch, please report to the maintainers of scotch/ptscotch.

Date Modified	Username	Field	Change
2017-02-16 10:45	guin	New Issue
2017-02-16 10:45	guin	File Added: dynRefBalFvMesh_test_stuff.tar.gz
2017-02-16 10:45	guin	Tag Attached: parallel
2017-02-16 10:45	guin	Tag Attached: redistributePar
2017-02-16 10:45	guin	Tag Attached: dynamicRefineFvMesh
2017-02-16 10:45	guin	Tag Attached: load balance
2017-02-16 10:45	guin	Tag Attached: ptscotch
2017-02-16 10:55	guin	Note Added: 0007763
2017-02-24 15:38	MattijsJ	Note Added: 0007819
2017-02-28 10:43	guin	Note Added: 0007823
2017-03-24 12:58	MattijsJ	Note Added: 0007977
2017-03-31 12:05	guin	Note Added: 0008000
2017-03-31 17:49	henry	Category	Contribution => Bug
2017-03-31 17:50	henry	Assigned To	=> henry
2017-03-31 17:50	henry	Status	new => closed
2017-03-31 17:50	henry	Resolution	open => no change required
2017-03-31 17:50	henry	Note Added: 0008007