Issues due to different lengths (times) of particle trajectories:
1. OpenMP imbalance (between threads/CPUs within one MPI process)
2. MPI imbalance (between MPI processes, all threads/CPUs of different MPI processes affected)
Both types of imbalance become very pronounced for rear extremely long particle trajectories.
Possible solutions (BSC team):
1. Reduce cut-off time for long trajectories (reasonable time would be 0.1s compared to 1.0s used so far)
2. Check the imbalance for much higher particle statistics: 500.000 particles in total, ~250 particles for maxMpiChunkSize, 64 nodes, 4 MPI tasks per node, 12 OpenMP threads per MPI task
3. For MPI imbalance: Dynamic Load Balancing (DLB) library: give job of one MPI task to other unoccupied MPI tasks
4. For OpenMP imbalance: advance communication: start new particles per thread without waiting for other threads
Issues with serial parts (communication with the master process):
Under test conditions (40.000 particles, maxMpiChunkSize=50, 32 nodes) serial communication takes ~40% of the total time. This fraction can become even larger when the issue with long trajectories will be solved!
Particular code parts involved:
SparseData2D/3D (.h)
ero2::DensityManager::gatherParticleDensities2D/3D
ero2::DensityManager::gatherEmissionDensities2D/3D
To do:
ERO team: assess the possibility of some parallelization of the serial communication part
BSC team: compare the serial communication for the cases when no volumetric data has to be collected
Further considerations regarding the serial part:
- Can parallelization via domain decomposition help to parallelize the serial communication?
- How much it will complicate the post-processing / visualization of results?