View Issue Details

IDProjectCategoryView StatusLast Update
0000027OpenFOAMBugpublic2010-09-13 17:39
Reporteruser28Assigned Touser4 
PriorityhighSeveritycrashReproducibilitysometimes
Status resolvedResolutionfixed 
PlatformIntel Nehalem, IB interlinkOScustom linuxOS Version?
Summary0000027: Error during writing of fields in massive parallel simulations
DescriptionI am performing LES simulations on a big cluster using ~1000 cores. In some cases during writing of the fields half of the processor directories get written for time step n and the other half of the processor directories gets written for the time step n+1. This issue becomes apparent when trying to use the stored solutions for restart etc...



Steps To Reproduce-Wall clock time per time steps is ~0.5s.
-Version 1.6 from the git repository from ~Dec 2009
-A lustre file system is used
Additional InformationSome modified settings:

minBufferSize=300000000

OptimisationSwitches
{
fileModificationSkew 10;
commsType nonBlocking; //scheduled; //blocking;
floatTransfer 0;
nProcsSimpleSum 0;
}
TagsInput/output

Activities

user4

2010-09-07 10:49

  ~0000026

Could you send system/controlDict and also an 'ls' of the processor directories that shows the problem (so a time directory present in some but not in others). It might be a time precision issue.

- are you sure that it is not a Lustre problem?
- how often (in real time) are the dumps apart. Is is less than e.g. the 10s of fileModificationSkew?

user28

2010-09-07 13:20

 

writeIssue.tar.gz (3,009 bytes)

user28

2010-09-07 13:33

  ~0000027

The data you have requested is in the attached archive. You find there the output of:
find processor* -name '0.00741825' >00741825
find processor* -name '0.0074185' >0074185
and the controldict

It is in my opinion very unlikely that this is a Lustre problem.

The dumps are ~30s appart (see below):

ls processor0 -ltr --full-time

drwxr-xr-x 3 4096 2010-09-01 18:58:10.000000000 +0200 constant
drwxr-xr-x 2 4096 2010-09-01 19:04:51.000000000 +0200 0
drwx------ 3 4096 2010-09-03 01:10:34.000000000 +0200 0.0026285
drwx------ 3 4096 2010-09-03 04:00:10.000000000 +0200 0.004997
drwx------ 3 4096 2010-09-03 06:50:03.000000000 +0200 0.0074185

ls processor1 -ltr --full-time

drwxr-xr-x 3 4096 2010-09-01 18:58:11.000000000 +0200 constant
drwxr-xr-x 2 4096 2010-09-01 19:04:52.000000000 +0200 0
drwx------ 3 4096 2010-09-03 01:08:57.000000000 +0200 0.00262825
drwx------ 3 4096 2010-09-03 03:58:53.000000000 +0200 0.00499675
drwx------ 3 4096 2010-09-03 06:49:34.000000000 +0200 0.00741825

user4

2010-09-07 13:53

  ~0000028

My guess is that the problem is the

 writeControl clockTime;

which checks the time-since-start-of-the-run and on different processors might occasionally decide different things. As a workaround you might want to use one of the other writeControl modes.

user4

2010-09-07 16:00

  ~0000029

added a reduce of elapsedCpuTime, elapsedClockTime before using them.

commit 4160af412a8df26be1a7c284000f888f9dbe0c89

Issue History

Date Modified Username Field Change
2010-09-06 12:42 user28 New Issue
2010-09-07 09:24 user2 Assigned To => user4
2010-09-07 09:24 user2 Status new => assigned
2010-09-07 10:49 user4 Note Added: 0000026
2010-09-07 13:20 user28 File Added: writeIssue.tar.gz
2010-09-07 13:33 user28 Note Added: 0000027
2010-09-07 13:53 user4 Note Added: 0000028
2010-09-07 16:00 user4 Note Added: 0000029
2010-09-07 16:00 user4 Status assigned => resolved
2010-09-07 16:00 user4 Fixed in Version => 1.7.x
2010-09-07 16:00 user4 Resolution open => fixed
2010-09-13 17:39 user2 Tag Attached: Input/output