Large-scale micromagnetics simulations with dipolar interaction using all-to-all communications

We implement on our micromagnetics simulator low-complexity parallel fast-Fourier-transform algorithms, which reduces the frequency of all-to-all communications from six to two times. Almost all the computation time of micromagnetics simulation is taken up by the calculation of the magnetostatic field which can be calculated using the fast Fourier transform method. The results show that the simulation time is decreased with good scalability, even if the micromagentics simulation is performed using 8192 physical cores. This high parallelization effect enables large-scale micromagentics simulation using over one billion to be performed. Because massively parallel computing is needed to simulate the magnetization dynamics of real permanent magnets composed of many micron-sized grains, it is expected that our simulator reveals how magnetization dynamics influences the coercivity of the permanent magnet.


I. INTRODUCTION
Magnetization dynamics in ferromagnetic materials is an important consideration in determining magnetic properties such as the remanence magnetization and coercivity.The importance of permanent magnets that have high coercivity is evident from the need for high power motors that are used in many clean technologies. 1Although they have been investigated for a long time, how magnetization dynamics determines the coercivity of permanent magnets, still remains unclear.
Micromagnetics simulation based on the Landau-Lifshitz-Gilbert (LLG) equation is a most useful method for the theoretical investigation of magnetization dynamics in permanent magnets.In this method, we not only observe the magnetization dynamics but can also estimate the magnetic energy in the permanent magnet.][4] As computation technology is developing rapidly, we are able to perform simulations on super computing system using about 0.1 billion calculation cells.However, this size is still small for micromagnetics simulations of permanent magnets.In such simulations, we use about 1 nm 3 calculation cells because the exchange length of a high coercivity magnet is a few nano-meters.
Hence performing simulations requires more massively parallel computations.We implemented a low-complexity fast-Fourier-transform (FFT) algorithm on our simulator that runs on a supercomputing system 5 and measured the effect of parallelization while performing massively parallel computation.Because this FFT algorithm reduces the frequency of all-to-all communications from six to two times, this computation method offers advantages in such computations and enables the magnetization dynamics to be simulated enormous simulation systems.

II. COMPUTATION METHOD
Our simulator calculates the magnetization dynamics solving the LLG equation, where M is the magnetization, γ the gyromagnetic ratio, α the Gilbert damping constant, and H eff the effective field, which accounts for anisotropy, exchange, external fields and magnetostatic field H d .The magnetostatic field is obtained by convolution integration which is defined as where K(r−r ′ ) is demagnetization tensor. 6Using FFT methods, we not only calculate H d but can reduce the calculation complexity which is reduced to O(n log n), with n being the total number of cells.Our simulator employs a hybrid parallel computation in which we perform parallel computation using symmetric multiprocessing (SMP) and message passing interface (MPI).Because magnetization data is stored in the memory of each MPI processes, we have to exchange stored information using a torus network which expands in all direction.Figure 1(a) shows a schematic of the torus network that extends in the x, y, and z directions.After the data arrays are exchanged by all-to-all communications using the torus network along the x direction, each MPI process has all the magnetization data in the x direction.Hence we can perform parallel FFT using the following algorithm 7,8 : Step 1: All-to-all communications are performed in the communicator along x direction to rearrange the data from (L x ,L y ,L z ) to (N x ,L y ,L z /P x ) Step 2: FFT is executed along the x direction at each MPI processes.
Step 3: All-to-all communications are performed again to rearrange the magnetization data from (N x ,L y ,L z /P x ) to (L x ,L y ,L z ), where P i = N i /L i with i being x, y, or z.If we calculate FFT in all directions, we get the Fourier transform of the magnetization.In this algorithm called by the nine-step FFT, we perform all-to-all communications six times to arrange the array of magnetization data.However, rearrangements for the FFT (Step 3) are not always necessary in all directions when we need the information only at each mesh point as for the convolution integral.In our simulator, we ignore the rearrangement in the y and z directions to reduce communication times.Therefore we have only two all-to-all communications with one FFT calculation, if we do not use the MPI parallelization in the x direction.
We performed large-scale micromagnetics simulation on Blue Gene/Q at the High Energy Accelerator Research Organization and measured the computational performance of our simulator.This supercomputing system has 6144 nodes and each node has 16 computing core.We choose N x = N y = N z , i.e., the shape of the simulation system is cubic, and discretize the simulation system into rectangular parallelepiped regions as shown in Fig 1(b).In the simulation, we fix the ratio of L y to L z at 1:8.We use one node for one MPI process and 32 SMP threads in each MPI process.

III. RESULTS
Figure 2(a) shows the computation times as a function of the number of nodes.In this calculation, the size of the simulation system is 512 3 .Because we confirmed that the simulation results agree with our previous work in which we used a Hitachi SR16000/M1, we convince that the simulator calculates magnetization dynamics accurately.Results of our simulator show good scalability for the number of nodes.The calculation times for the LLG equation were inversely proportional to the number of nodes.Additionally the time taken for all-to-all communications is reduced proportionately.Generally speaking, the ratio of all-to-all communications cannot be ignored in the FFT calculation.In the present case, the communication time took about 44.5% of the FFT calculation time.Hence a reduction of all-to-all communications frequency enables the calculations involving large systems to be feasible.
One can see that our micromagnetics simulator is specialized for massively parallel computing.Figure 2(b) shows the calculation time as a function of the total number of calculation cells using 512 nodes.When the simulation system is as small as 64 3 and 128 3 , the effect of parallelization for the FFT calculation is low due to the communication time.In contrast, good scalability is found in large-size simulations because calculation times are proportional to n log(n).This property is related to the node-number scalability discussed above and demonstrates that our simulation can be effectively used for large-scale micromagnetics simulation in massively parallel computing environment.

IV. SUMMARY
We implemented FFT algorithm on our micromagnetics simulator which reduces the frequency of all-to-all communications from six to two times and performed micromagnetics simulation to measure the effect of parallelization on our simulator.Our simulation results show that the time taken for all-to-all communications was about 44.5% of the FFT calculation time.Thus the reduction in frequency of all-to-all communications provides advantage for parallel computing because we cannot ignore the communication time associated with the FFT calculation.Although the effect Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions.Download to IP: 130.158.56.102On: Wed, 27 Jul 2016 04:24:12 of parallelization deviates from the theoretical value for small systems, we find good scalability of the calculation time for sufficiently large systems.These properties show that our simulator can realize large-scale micromagnetics simulation that exceed immensely present levels, for instance, up to one billion cells with massively parallel computing.

FIG. 1 .FIG. 2 .
FIG. 1.(a) Schematic of the torus network in all three directions.Blue, green and orange loops represent the network in the x, y and z directions.(b) Schematic of the process array distribution for MPI parallelization.N x , N y , and N z (L x , L y , and L z ) are edge sizes of the simulation system (MPI processes).