View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000027 | OpenFOAM | Bug | public | 2010-09-06 12:42 | 2010-09-13 17:39 |
Reporter | Assigned To | ||||
Priority | high | Severity | crash | Reproducibility | sometimes |
Status | resolved | Resolution | fixed | ||
Platform | Intel Nehalem, IB interlink | OS | custom linux | OS Version | ? |
Summary | 0000027: Error during writing of fields in massive parallel simulations | ||||
Description | I am performing LES simulations on a big cluster using ~1000 cores. In some cases during writing of the fields half of the processor directories get written for time step n and the other half of the processor directories gets written for the time step n+1. This issue becomes apparent when trying to use the stored solutions for restart etc... | ||||
Steps To Reproduce | -Wall clock time per time steps is ~0.5s. -Version 1.6 from the git repository from ~Dec 2009 -A lustre file system is used | ||||
Additional Information | Some modified settings: minBufferSize=300000000 OptimisationSwitches { fileModificationSkew 10; commsType nonBlocking; //scheduled; //blocking; floatTransfer 0; nProcsSimpleSum 0; } | ||||
Tags | Input/output | ||||
|
Could you send system/controlDict and also an 'ls' of the processor directories that shows the problem (so a time directory present in some but not in others). It might be a time precision issue. - are you sure that it is not a Lustre problem? - how often (in real time) are the dumps apart. Is is less than e.g. the 10s of fileModificationSkew? |
2010-09-07 13:20
|
|
|
The data you have requested is in the attached archive. You find there the output of: find processor* -name '0.00741825' >00741825 find processor* -name '0.0074185' >0074185 and the controldict It is in my opinion very unlikely that this is a Lustre problem. The dumps are ~30s appart (see below): ls processor0 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:10.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:51.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:10:34.000000000 +0200 0.0026285 drwx------ 3 4096 2010-09-03 04:00:10.000000000 +0200 0.004997 drwx------ 3 4096 2010-09-03 06:50:03.000000000 +0200 0.0074185 ls processor1 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:11.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:52.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:08:57.000000000 +0200 0.00262825 drwx------ 3 4096 2010-09-03 03:58:53.000000000 +0200 0.00499675 drwx------ 3 4096 2010-09-03 06:49:34.000000000 +0200 0.00741825 |
|
My guess is that the problem is the writeControl clockTime; which checks the time-since-start-of-the-run and on different processors might occasionally decide different things. As a workaround you might want to use one of the other writeControl modes. |
|
added a reduce of elapsedCpuTime, elapsedClockTime before using them. commit 4160af412a8df26be1a7c284000f888f9dbe0c89 |
Date Modified | Username | Field | Change |
---|---|---|---|
2010-09-06 12:42 |
|
New Issue | |
2010-09-07 09:24 |
|
Assigned To | => user4 |
2010-09-07 09:24 |
|
Status | new => assigned |
2010-09-07 10:49 |
|
Note Added: 0000026 | |
2010-09-07 13:20 |
|
File Added: writeIssue.tar.gz | |
2010-09-07 13:33 |
|
Note Added: 0000027 | |
2010-09-07 13:53 |
|
Note Added: 0000028 | |
2010-09-07 16:00 |
|
Note Added: 0000029 | |
2010-09-07 16:00 |
|
Status | assigned => resolved |
2010-09-07 16:00 |
|
Fixed in Version | => 1.7.x |
2010-09-07 16:00 |
|
Resolution | open => fixed |
2010-09-13 17:39 |
|
Tag Attached: Input/output |