09:11:26 De cristian sommariva : so I need to explicitly transfer data to the device but the retrieve of the data from the device is implicit at the end of the target region, is this right? 09:12:15 De Thomas Hayward-Schneider : Q; is "omp target" blocking or implied "nowait"? 09:14:25 De Christian Terboven : @Thomas: it is blocking. 09:14:35 De Christian Terboven : Asynchronous offloading will be covered next time. 09:14:43 De Thomas Hayward-Schneider : thanks 09:14:45 De Thomas Hayward-Schneider : If the loop went from 0 to SZ/2 (instead of 0:SZ), would the compiler do an implicit "tofrom:y[0:SZ/2]" 09:14:55 De Christian Terboven : But as a teaser: "nowait" is the thing you have to use. :-) 09:15:07 De Christian Terboven : No. 09:15:20 De Thomas Hayward-Schneider : I can "nowait" to hear it! :-) 09:15:29 De Christian Terboven : If you ask for y[0:SZ] to be transfereed, this is what the compiler will transfer. 09:15:55 De Thomas Hayward-Schneider : And with USM?, would you still get all of y[:]? 09:17:38 De Samuel Lazerson : Q: So it is possible to have one part of the code put the X array on the target device and then have a later part of the code do something with that array later on to avoid regularly transferring massive arrays every time you want to compute something. 09:19:09 De Christian Terboven : @Samuel: yes. We will talk about unstructured data movement the next time. Today is about the basics. 09:19:30 De Aleksander Dubas : Is a implicitly map(tofrom:a) ? 09:19:48 De Christian Terboven : Implicit mapping: yes for scalars. 09:20:27 De Christian Terboven : @USM: it depends on the USM implementation. But you have to tell OpenMP about that. Again, this is not covered today, but next time. 09:23:02 De Jorge Gonzalez : Q: If no target option is given in the compilation, are "target" statements ignored? 09:23:27 De Christian Terboven : @Jorge: I will cover that in a few minutes. 09:23:33 De Jorge Gonzalez : Oh, ok, thank you 09:23:43 De Thomas Hayward-Schneider : (and by extension, is target "if"-able?) 09:24:36 De Gabriele Fatigati : So, it is not possible to overlap CPU and GPU computation? 09:24:58 De Christian Terboven : @Gabriele: it is possible, because asynchronous offloading is possible. We just start with the basics :-). 09:25:05 De Gabriele Fatigati : ok thanks 09:33:31 De Thomas Hayward-Schneider : Q: Can you let us know when you discuss some features which are not widely supported. (for example, perhaps support in the intel oneAPI compilers, AOCC/AOMP, NV HPC-SDK, and/or GCC [I appreciate Cray and IBM compilers are pretty good, but they're not widely so widely available]). 09:34:13 De cristian sommariva : so basically a contention group is a CUDA block? 09:34:17 De Michael Klemm : @Thomas: That sounds like a good question for the end of the webinar today 09:35:10 De Michael Klemm : @cristian: Roughly speaking, yes. Contention group limits the effect of synchronization. So for CUDA, a thread block with it's local hardware barrier is a natural contention group 09:36:10 De cristian sommariva : perfect thanks 09:38:22 De Simppa Äkäslompolo : num_teams … is it maximum allowed, request for that number or demand? 09:38:46 De Michael Klemm : if you ask for n teams, then you'll get n teams 09:39:50 De Michael Klemm : The hardware and implementation of OpenMP will then map them to the available hardware parallelism. SO, that mean if you ask for 1 million teams, only some of them will execute in parallel. for instance, for AMD GPUs this is a few hundred 09:43:11 De Gabriele Fatigati : Are AMD GPU supported? 09:43:17 De Michael Klemm : Latest Clang compiler is 12.0.0 09:43:49 De Michael Klemm : Yes, upstream clang supports AMD GPUs, but AOMP and the clang compiler of ROCm usually has better performance 09:43:56 De Thomas Hayward-Schneider : How does AOCC (and/or AOMP?) compare to upstream Clang/12? 09:44:00 De Thomas Hayward-Schneider : Ah. thanks. 09:44:18 De Michael Klemm : Typically, the vendor compilers are better in terms of fixed bugs and performance 09:45:06 De Simppa Äkäslompolo : Any experience on nvc vs. clang for nvidia? (nvc from nvidia HPC toolkit) 09:45:49 De Michael Klemm : At GTC NVIDIA was showing NVC performance that wa almost on par with CUDA for the cases they have shown. 09:46:00 De Michael Klemm : I guess clang will be not as good (yet) 09:47:24 De Kos, Leon : What about xeus-cling support for GPUs? 09:48:03 De Simppa Äkäslompolo : Is there any way to get any other (better) error message? 09:49:23 De Thomas Hayward-Schneider : re Fortran: I guess "old" flang doesn't have openmp offloading support? I guess the new flang will do, but I guess that's not really useable yet? 09:50:40 De Michael Klemm : @Thomas: That's where the real differences start. F18 (the new flang in the LLVM project) does not have much offload support yet (I think, but it may have changed). So, most vendors use an old flang, which pretty much is the old PGI front-end for Fortran 09:51:07 De Michael Klemm : @Leon: I cannot answer that question, as I do not know the tool you're referring to. 09:53:49 De Simppa Äkäslompolo : Isn't there a term "team group". Is it the same as block? 09:55:12 De Michael Klemm : I haven't seen that term yet. In what context have you seen it? 09:55:13 De Kos, Leon : xeus-cling is a Jupyter kernel for C++ based on the C++ interpreter cling and can use -fopenmp kernel. I wonder if GPU target is supported too? 09:56:12 De Samuel Lazerson : Does the order of the “target teams distribute parallel for and map” keywords matter? 09:56:13 De Michael Klemm : @Leon: I don't know. But what I can say is that Christian and others are working on Jupyter notebooks with OpenMP support. I recall that they are using cling, too 09:56:33 De Michael Klemm : @Samuel: yes for the directive names, no for the clauses 09:56:41 De Samuel Lazerson : thanks 09:56:53 De Michael Klemm : "#pragma omp target parallel teams" would be wrong 09:57:20 De Simppa Äkäslompolo : My mistake, I was confusing teams and congestion groups. 09:57:32 De Michael Klemm : #pragma omp target teams distribute parallel for map(..) schedule(…) and #pragma omp target teams distribute parallel for schedule(…) map(..) are the same 09:57:46 De Michael Klemm : No problem. A "team" is a "contention group": 09:58:03 De Simppa Äkäslompolo : rigth 09:59:24 De Serhiy Mochalskyy : could you measure data transfer time and calculation time of daxpy 09:59:46 De Serhiy Mochalskyy : sfer mainly? 10:01:12 De Serhiy Mochalskyy : sorry, it is slowly on GPU mainly due to time of data transwer not due to calculation. this 0.8 s vs 0.12 s 10:01:18 De Michael Klemm : With what we are showing today, not within the code. External profilers will of course show it. In the next webinar, we will show how to separate data transfers and control flow, so that you can then also measure the individual contrinbutions 10:06:40 De Thomas Hayward-Schneider : Semantically, is Fortran's "do concurrent" close to "!$omp loop\n do" ? 10:09:53 De Michael Klemm : Yes, it is 10:12:01 De Thomas Hayward-Schneider : Q: how widely supported are today's various features? Most notably: loop? 10:14:12 De Tilman Dannert (MPCDF) : Is there a Fortran compiler supporting these features for NVIDIA and AMD and future Intel devices? clang can not be used for Fortran, is flang already supporting OpenMP 5.0? 10:14:18 De Simppa Äkäslompolo : Up to which version of OpenMP? 10:15:22 De Simppa Äkäslompolo : Is GCC worth looking at all? 10:15:50 De Thomas Hayward-Schneider : Does there exist any usable Fortran compiler for AMD GPUs? I think GCC 10 has a technically working but not performant support? I guess GCC 11 is the best hope? 10:18:37 De Simppa Äkäslompolo : Is nvc also CLANG backed? 10:25:11 De Thomas Hayward-Schneider : Apropos compilers, Cineca very recently installed Nvidia's HPC-SDK v21.3 in the module hpc-sdk/2021--binary on marconi100. 10:26:09 De Dion Engels : What would be a good place to already read up on the topics of next webinar? I would like to work ahead a bit 10:26:17 De Jorge Gonzalez : Thank you very much 10:26:17 De cristian sommariva : thank you!