[Pw_forum] Use of pool

Tue Mar 10 08:11:01 CET 2009

Axel and list-users,

I'm terribly sorry for my delayed response.  I personally want to
thank Axel for his thorough investigation, to-the-point analysis and
detailed report, every bit of experience will benefit us a lot in our
future life in computation.

As I said, the test maching I used is an AMD box with 2 way
quad core  Shanghai (opteron 23xx, 2.3 GHz) , which has only
1 MB L2 cache, the test case may make the CPU cache inefficient
more quickly than Intel ones (usually 4-6 MB L2 cache) according
to your findings.

BTW, the input file I sent you is absolutely the same as I used here.
AMD's shanghai and Intel's nehalem, which I have done several tests
on both platforms, are much better than their previous processors.
It seems Axel needs to connect more powerful machines with his
high-end infiniBand network(^o^)

Thanks again, Axel!

Dr. Huiqun Zhou
@Earth Sciences, Nanjing University, China

----- Original Message ----- 
From: "Axel Kohlmeyer" <akohlmey at cmm.chem.upenn.edu>
To: "PWSCF Forum" <pw_forum at pwscf.org>
Sent: Wednesday, March 04, 2009 10:51 AM
Subject: Re: [Pw_forum] Use of pool

> On Tue, Feb 24, 2009 at 1:45 AM, Huiqun Zhou <hqzhou at nju.edu.cn> wrote:
>> Dear list users:
>
> hi all,
>
>> I happened to test duration times of calculating the system I'm
>> investigating against number of pools used. There are totally
>> 36 k points. But the results surprised me quite a lot.
>>
>> no pool:  6m21.02s CPU time,     6m45.88s wall time
>> 2 pools:  7m19.39s CPU time,     7m38.99s wall time
>> 4 pools: 11m59.09s CPU time,    12m14.66s wall time
>> 8 pools: 21m28.77s CPU time,    21m38.71s wall time
>>
>> The machine I'm using is an AMD box with 2 quad core shanghai.
>>
>> Is my understanding of usage of pool wrong?
>
> sorry for replying to an old mail in this thread, but it has the
> proper times to compare to. the input you sent me, does not
> seem to be the exactly the same as the one you used for the
> benchmarks (rather a bit larger). but i reduced the number of
> k-points to yield 36 and have some numbers here.
> this is on dual intel quad core E5430 @ 2.66GHz cpus with 8GB DDR2 ram.
> i also modified the input to set wfcdir to use the local scratch rather 
> than my
> working directory (as this is on an NFS server) and test with
> disk_io='high' and 'low'.
> on a single node (always with 8 MPI tasks) i get:
>
> 1node-1pools-high.out:     PWSCF        : 18m55.62s CPU time,    26m
> 7.20s wall time
> 1node-2pools-high.out:     PWSCF        : 14m46.03s CPU time,    18m
> 0.26s wall time
> 1node-4pools-high.out:     PWSCF        : 14m 5.27s CPU time,
> 16m44.03s wall time
> 1node-8pools-high.out:     PWSCF        : 32m29.71s CPU time,    35m
> 0.35s wall time
>
> 1node-1pools-low.out:     PWSCF        : 18m36.88s CPU time,
> 19m24.71s wall time
> 1node-2pools-low.out:     PWSCF        : 15m 0.98s CPU time,
> 15m42.56s wall time
> 1node-4pools-low.out:     PWSCF        : 14m 6.97s CPU time,
> 14m55.57s wall time
> 1node-8pools-low.out:     PWSCF        : 31m51.68s CPU time,
> 32m46.77s wall time
>
> so the result is not quite as drastic, but with 8 pools on the node,
> the machine is suffering.
> one can also see that disk_io='low' is helping to reduce waiting time
> (disk_io='high' still
> writes files into the working directory, which is on slow NFS). so for
> my machine it looks
> as if 4 pools is the optimal compromise. to further investigate
> whether pools or gspace
> parallelization is more efficient i then started to run the same job
> across multiple nodes.
> this uses only 4 cores per node, i.e. the total number of mpi tasks is 
> still 8.
>
> 2node-1pools-high.out:     PWSCF        : 12m 0.88s CPU time,
> 17m42.01s wall time
> 2node-2pools-high.out:     PWSCF        :  8m42.96s CPU time,
> 11m44.88s wall time
> 2node-4pools-high.out:     PWSCF        :  6m26.72s CPU time,
> 8m54.83s wall time
> 2node-8pools-high.out:     PWSCF        : 12m47.61s CPU time,
> 15m18.67s wall time
>
> 2node-1pools-low.out:     PWSCF        : 10m53.87s CPU time,
> 11m35.94s wall time
> 2node-2pools-low.out:     PWSCF        :  8m37.37s CPU time,
> 9m23.17s wall time
> 2node-4pools-low.out:     PWSCF        :  6m22.87s CPU time,
> 7m11.22s wall time
> 2node-8pools-low.out:     PWSCF        : 13m 7.30s CPU time,
> 13m57.71s wall time
>
> in the next test, i doubled the number of nodes again, but this time
> kept 4 mpi tasks per node,
> also i'm only using disk_io='low'.
>
> 4node-4pools-low.out:     PWSCF        :  4m52.92s CPU time,
> 5m38.90s wall time
> 4node-8pools-low.out:     PWSCF        :  4m29.73s CPU time,
> 5m17.86s wall time
>
> interesting, now the striking difference between 4 pools and 8 pools
> is gone. since i
> doubled the number of nodes, the memory consumption per mpi task in the 8 
> pools
> case should have dropped to a similar level as in the 4 pools case with 2 
> nodes.
> to confirm this, lets run the same job with 16 pools:
>
> 4node-16pools-low.out:     PWSCF        : 10m54.57s CPU time,
> 11m53.59s wall time
>
> bingo! the only explanation for this is cache memory. so in this specific 
> case,
> up to about "half a wavefunction" memory consumption per node, the caching
> of the cpu is much more effective. so the "more pools is better"-rule has 
> to be
> augmented by "unless it makes the cpu cache less efficient".
>
> since 36 kpoints is wholly divisible by 6  but not by 8, now a test
> with 6 nodes.
>
> 6node-4pools-low.out:     PWSCF        :  3m41.65s CPU time,
> 4m25.15s wall time
> 6node-6pools-low.out:     PWSCF        :  3m40.12s CPU time,
> 4m23.33s wall time
> 6node-8pools-low.out:     PWSCF        :  3m14.13s CPU time,
> 3m57.76s wall time
> 6node-12pools-low.out:     PWSCF        :  3m37.96s CPU time,
> 4m25.91s wall time
> 6node-24pools-low.out:     PWSCF        : 10m55.18s CPU time,
> 11m47.87s wall time
>
> so 6 pools is more efficient than 4, but 8 even more than 6 or 12,
> which should lead to
> a better distribution of the work. so the modified "rule" from above
> seems to hold.
> ok, can we get any faster. ~4min walltime for a 21 scf cycle single
> point run is already pretty
> good and the serial overhead (and  wf_collect=.true.) should kick in.
> so now with 8 nodes
> and 32 mpi tasks.
>
> 8node-4pools-low.out:     PWSCF        :  3m22.02s CPU time,     4m
> 7.06s wall time
> 8node-8pools-low.out:     PWSCF        :  3m14.52s CPU time,
> 3m58.86s wall time
> 8node-16pools-low.out:     PWSCF        :  3m36.18s CPU time,
> 4m24.21s wall time
>
> hmmm, not much better, but now for the final test. since we have 36
> k-points and
> we need at least two mpi tasks per pool to get good performance, lets
> try 18 nodes
> with 4 mpi tasks each:
>
> 18node-9pools-low.out:     PWSCF        :  1m57.06s CPU time,
> 3m37.31s wall time
> 18node-18pools-low.out:     PWSCF        :  2m 2.62s CPU time,
> 2m45.51s wall time
> 18node-36pools-low.out:     PWSCF        :  2m45.61s CPU time,
> 3m33.00s wall time
>
> not spectacular scaling, but still improving. but it looks like
> writing the final wavefunction
> costs about 45 seconds or more, as indicated by the difference between
> cpu and walltime.
>
> at this level, you better not use disk_io='high', as that will put a
> _severe_ disk load on
> the machine that is carrying the working directory (particularly bad
> for NFS servers),
> the code will generate and continuously rewrite in this case 144 files...
> and the walltime to cputime ratio quickly rises (a factor of 5 in my
> case so i stopped the
> job before the NFS server would die).
>
> in summary, it is obviously getting more complicated to define a
> "rule" of what gives
> the best performance. some experimentation is always required and
> sometimes there
> will be surprises. i have not touched the issue of network speed (all
> tests were done
> across a 4xDDR infiniband network).
>
> i hope this little benchmark excursion was as interesting and thought
> provoking for you
> as it was for me. thanks for everybody that gave their input to this 
> discussion.
>
> cheers,
>   axel.
>
> p.s.: perhaps at some point it might be interesting to organize a workshop 
> on
> "post-compilation optimization" for pw.x  for different types of jobs
> and hardware.
>
>> Huiqun Zhou
>> @Nanjing University, China
>> _______________________________________________
>> Pw_forum mailing list
>> Pw_forum at pwscf.org
>> http://www.democritos.it/mailman/listinfo/pw_forum
>>
>>
>
>
>
> -- 
> =======================================================================
> Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
>  Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum
>