[Pw_forum] why my pw.x run with low efficiency?

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Mon Sep 22 06:42:03 CEST 2008


On Mon, 22 Sep 2008, vega wrote:

VL> Dear sir,
VL> 
VL> Thank you so much for your responding. I do appreciate your help.
VL> 
VL> > than the wall time. here it looks as if the jobs is either
VL> > swapping like crazy or the communication is stalling. on that
VL> 
VL> I believe it. So do you think 10G infiniband is good enough for my job?

no. i think your machine is not used properly. this can be either
due to the setup of the machine or due to the way jobs are run. 
based on previous experience, it is a combination of both.

VL> By the way, there is also another two parallel job on the line. one is
VL> lammps, a classical MD code. the other is VASP. the lammps and vasp are
VL> using 32 cpus in total. my job is using 39 nodes and two cpu for each node.

for as long as those jobs are running on _seperate_ nodes, there should
be no problem. again, it is very hard to give _any_ useful advice 
without having the proper information at hand, so without knowing all
details about your machine (number of nodes, hardware) and how nodes
are assigned for use (batch system, how you submit jobs etc.).

VL> >vega mentioned that openmpi didn't work because
VL> > of "lack of memory". i suspect that this is due to incorrect
VL> > setup of the infiniband fabric and user limits. ulimit -a
VL> > should produce on the compute nodes something like this.
VL> > particularly the "max locked memory" entry is very important,
VL> > and not setting it high enough will result in severely
VL> > degraded performance (cf. ofed documentation).
VL> 
VL> I'm really admiring your experiences and sense about the parallel job.

this is not a very special skill, but actually _reading_ the output
on the screen and _looking it up_ in the documentation. 

VL> the max lock memory option on my machines is 4. Now I asked my

VL> system administrator to reset it to 1024000. My physical memory
VL> is 2G. Do you think set it to 2048000 will be better?

1GB should be ok. in case you are having quad-core cpus, you are
pretty much memory starved for large DFT jobs, and you should run
with -npernode 4, i.e. using only half the cores.

VL> I'll try openmpi again, later, using flags --mca btl_openib_use_srq 1
VL> with mpirun.
VL> 
VL> By the way I'm using MPICH2 now, not MPICH.

same difference. it don't remember it supporting infiniband.
so it is quite possible, that you have not been using the
infiniband at all. this can be _easily_ verified, but running
a smaller test job with both MPICH and OpenMPI across a few nodes
and comparing the timings (not using -npools, to make the 
difference more evident).

in total, please don't jump to conclusions and guess what is
going wrong when running parallel jobs, but identify the performance
bottlenecks systematically and back each claim you make up with
hard data, so that people don't have to speculate about what could
have gone wrong, but can confirm your findings and give proper advice.

you simply have to apply the same strategies and problem solving skills
that you need for proper scientific research. just doing some random
calculations and guessing conclusions from them doesn't give you
scientific insight either.

axel.


-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.


More information about the Pw_forum mailing list