Be careful if you run MPI code and use Anaconda
There’s a gotcha lurking if you work with MPI code and also use our current version or Anaconda! Find out how to deal with this.
We recently upgraded the version of Anaconda we provide on the School’s Ubuntu Linux hosts to version 2023.07. This new version of Anaconda has added support for running parallel code written with the popular Message Passing Interface (MPI), using the MPICH implementation of MPI to do all of the hard work.
While this is arguably a nice addition, it might cause you a few nasty surprises if you’re already running MPI code on our hosts!
This is because we currently use OpenMPI rather than MPICH as our default MPI implementation, as most of our users have always tended to use OpenMPI, with a few using Intel MPI and very few if any people using MPICH.
So, if you’ve previously compiled a C, C++ or Fortran MPI code on our hosts using mpicc, then it has almost certainly been compiled with OpenMPI, and will only work correctly when run using OpenMPI’s own mpirun (aka mpiexec, orterun) MPI code runner.
However, whenever you have Anaconda 2023.07 activated, the mpirun command will actually invoke MPICH’s code runner as included by Anaconda. Unfortunately MPICH & OpenMPI aren’t mutually compatible and trying to run OpenMPI code with MPICH’s mpirun (or vice versa) doesn’t work. Even worse, rather than giving you an error message, running OpenMPI code with MPICH’s mpirun ends up running multiple independent serial copies of the code, which really isn’t very useful and will probably run for a very long time before you notice there’s a problem.
So you’ll need to be careful if you use both MPI and Anaconda!
Here are some tips and examples to help you get things right.
Tip: Deactivate Anaconda before running your own MPI code
If you want to run some existing MPI code – which has most likely been compiled with OpenMPI – then make sure Anaconda is deactivated before you run it.
You can usually tell whether Anaconda is activated by checking your shell prompt. If it starts with (base) then Anaconda has probably been activated. If it has been activated, type conda deactivate to deactivate it.
Here’s an example. My first shell prompt shows me that Anaconda is active, so I deactivate it with conda deactivate and my prompt returns to normal:
(base) [dmckain@sausage ~]$ conda deactivate [dmckain@sausage ~]$
You can also check which mpirun you’re going to get with the infinitely useful which command:
[dmckain@sausage ~]$ which mpirun
/usr/bin/mpirun
[dmckain@sausage ~]$ readlink -f $(which mpirun)
/usr/bin/orterun
(The which command tells you which code actually gets run when you type mpirun. In this case, we’re getting /usr/bin/mpirun. Codes living in /usr/bin tend to be “core” codes rather than add-ons, so usually indicates that you’re using the system default version of that code. mpirun tends to be a symbolic link (alias) for another code, and we can resolve the actual target code using the handy readlink command.)
If Anaconda were activated, we would instead get something like the following:
[dmckain@sausage ~]$ conda activate (base) [dmckain@sausage ~]$ which mpirun /opt/anaconda-2023.07/bin/mpirun (base) [dmckain@sausage ~]$ readlink -f $(which mpirun) /opt/anaconda-2023.07/bin/mpiexec.hydra
(Here we’re seeing that mpirun is resolving to something under /opt/anaconda-2023.07. The /opt directory typically contains additional/funky stuff, and that’s where we’ve installed our Anaconda.)
Recommendation: Don’t activate Anaconda by default
When you install Anaconda manually, it will add some stuff to your login files so that it gets activated whenever you launch a new terminal. This is often what you’d want on your own computer.
However the Anaconda that we provide for you doesn’t do this. We leave Anaconda deactivated by default, and you should activate it whenever you need it by typing conda activate. This is our recommended way of using Anaconda.
However it’s easy to accidentally override this, either by installing your own Anaconda or by typing conda init while using our Anaconda, which is sometimes recommended in error messages.
If you find that Anaconda is always activated by default, you can disable this by editing your ~/.bashrc login file to remove the auto-initialisation gubbins that Anaconda added in for you. This will be a bunch of lines as follows:
# >>> conda initialize >>> ... # <<< conda initialize <<<
Remove all of these lines, closes your editor and re-open your terminal. You should now find that Anaconda is no longer activated by default.
Example of what goes wrong
[dmckain@sausage ~]$ which mpirun /usr/bin/mpirun [dmckain@sausage ~]$ readlink -f $(which mpirun) /usr/bin/orterun [dmckain@sausage ~]$ mpirun -np 4 xthi Host=sausage MPI Rank=0 CPU=12 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=1 CPU= 0 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=2 CPU=13 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=3 CPU=18 NUMA Node=0 CPU Affinity=0-19 [dmckain@sausage ~]$ mpirun -np 4 xthi.mpich Host=sausage MPI Rank=0 CPU=14 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=0 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=14 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=8 NUMA Node=0 CPU Affinity=0-19
Here we see that the mpirun command is giving us OpenMPI’s mpirun, which runs the xthi code as expected – note the 4 different MPI ranks being reported.
However running the MPICH code (xthi.mpich) with OpenMPI’s mpirun does give output, but note that all the ranks are reported as 0 – meaning that it has actually run 4 independent serial copies of the code, which really isn’t what you want!
Here’s what happens if we do this while Anaconda is active, which I’m doing via an explicit use of conda activate:
[dmckain@sausage ~]$ conda activate (base) [dmckain@sausage ~]$ which mpirun /opt/anaconda-2023.07/bin/mpirun (base) [dmckain@sausage ~]$ readlink -f $(which mpirun) /opt/anaconda-2023.07/bin/mpiexec.hydra (base) [dmckain@sausage ~]$ mpirun -np 4 xthi Host=sausage MPI Rank=0 CPU=6 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=8 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=0 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=0 CPU=0 NUMA Node=0 CPU Affinity=0-19 (base) [dmckain@sausage ~]$ mpirun -np 4 xthi.mpich Host=sausage MPI Rank=0 CPU=19 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=1 CPU= 2 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=2 CPU=10 NUMA Node=0 CPU Affinity=0-19 Host=sausage MPI Rank=3 CPU= 0 NUMA Node=0 CPU Affinity=0-19
This time, mpirun resolves to the MPICH mpirun provided by Anaconda, and we see that xthi.mpich now behaves correctly, whereas the OpenMPI-compiled xthi doesn’t. So the wrongness swaps around.
Comments are closed
Comments to this thread have been closed by the post author or by an administrator.