[Pw_forum] openmp vs mpich performance with MKL 10.x

Tue May 6 21:21:49 CEST 2008

Dear Eduardo,

our own experiences are summarized here:
http://quasiamore.mit.edu/pmwiki/index.php?n=Main.CP90Timings

It would be great if you could contribute your own data, either for
pw.x or cp.x under the conditions you describe.

I noticed indeed, informally, a few of the things you mention:

1) no improvements with the Intel fftw2 wrapper, as opposed to fftw2
Q-E sources, when using mpi. I also never managed to successfully run
with the Intel fftw3 wrapper (or with fftw3 - that probably says 
something about me).

2) great improvements of a serial code (different from Q-E) when using
the automatic parallelism of MKL in quad-cores.

3) btw, MPICH has always been for us the slower protocol, compared with
LAMMPI or OpenMPI

I actually wonder if the best solution on a quad-core would be, say,
to use two cores for MPI, and the other two for the openmp threads.

I eagerly await Axel's opinion.

			nicola

Eduardo Ariel Menendez Proupin wrote:
> Hi,
> I have noted recently that I am able to obtain faster binaries of pw.x 
> using the the OpenMP paralellism implemented in the Intel MKL libraries 
> of version 10.xxx, than using MPICH, in the Intel cpus. Previously I had 
> always gotten better performance using MPI. I would like to know of 
> other experience on how to make the machines faster. Let me explain in 
> more details.
> 
> Compiling using MPI means using mpif90 as linker and compiler, linking 
> against mkl_ia32 or mkl_em64t, and using link flags -i-static -openmp. 
> This is just the  what appears in the make.sys after running configure 
> in version 4cvs,
> 
> At runtime, I set
> export OMP_NUM_THREADS=1
> export MKL_NUM_THREADS=1
> and run using
> mpiexec -n $NCPUs pw.x <input >output
> where NCPUs  is the number of cores available in the system.
> 
> The second choice is
> ./configure --disable-parallel
> 
> and at runtime
> export OMP_NUM_THREADS=$NCPU
> export MKL_NUM_THREADS=$NCPU
> and run using 
> pw.x <input >output
> 
> I have tested it in Quadcores (NCPU=4) and with an old Dual Xeon B.C. 
> (before cores) (NCPU=2).
> 
> Before April 2007, the first choice had always workes faster. After 
> that, when I came to use the MKL 10.xxx, the second choice is working 
> faster. I have found no significant difference between version 3.2.3 and 
> 4cvs.
> 
> A special comment is for the FFT library. The MKL has a wrapper to the 
> FFTW, that must be compiled after instalation (it is very easy). This 
> creates additional libraries named like libfftw3xf_intel.a and 
> libfftw2xf_intel.a
> This allows improves the performance in the second choice, specially 
> with libfftw3xf_intel.a.
> 
> Using MPI, libfftw2xf_intel.a is as fast as using the FFTW source 
> distributed with espresso, i.e., there is no gain in using 
> libfftw2xf_intel.a. With  libfftw3xf_intel.a and MPI, I have never been 
> able to run pw.x succesfully, it just aborts.
> 
> I would like to hear of your experiences.
>  
> Best regards
> Eduardo Menendez
> University of Chile
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Pw_forum mailing list
> Pw_forum at pwscf.org
> http://www.democritos.it/mailman/listinfo/pw_forum

-- 
---------------------------------------------------------------------
Prof Nicola Marzari   Department of Materials Science and Engineering
13-5066   MIT   77 Massachusetts Avenue   Cambridge MA 02139-4307 USA
tel 617.4522758 fax 2586534 marzari at mit.edu http://quasiamore.mit.edu