XGCa
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Pages
Classes | Typedefs | Enumerations | Functions
Streamed Namespace Reference

Classes

struct  StreamView
 
struct  Task
 

Typedefs

typedef Kokkos::Cuda GPUStream
 

Enumerations

enum  StreamJob { Sender =0, Runner, Returner, NStreams }
 
enum  Option { NoSend =0, Normal, NoReturn }
 
enum  Tasks {
  ToPinned =0, Send, Run, Return,
  FromPinned, NTasks
}
 

Functions

int partition_size (int i_partition, int n_soa_on_device, int n_partitions_of_device_aosoa)
 
template<typename Function , typename HostAoSoA , typename DeviceAoSoA >
void parallel_for (const std::string name, int n_ptl, Function func, Option option, HostAoSoA aosoa_h, DeviceAoSoA aosoa_d)
 

Typedef Documentation

typedef Kokkos::Cuda Streamed::GPUStream

Enumeration Type Documentation

Enumerator
NoSend 
Normal 
NoReturn 
Enumerator
Sender 
Runner 
Returner 
NStreams 
Enumerator
ToPinned 
Send 
Run 
Return 
FromPinned 
NTasks 

Function Documentation

template<typename Function , typename HostAoSoA , typename DeviceAoSoA >
void Streamed::parallel_for ( const std::string  name,
int  n_ptl,
Function  func,
Option  option,
HostAoSoA  aosoa_h,
DeviceAoSoA  aosoa_d 
)

The streamed parallel_for creates 3 GPU streams and uses them to asynchronously transfer data to and from the GPU while executing. This is done by splitting up the data into chunks. Steps are blocked so that the 3 streams are never operating on the chunk at the same time. Empirically, the host memory must be pinned in order to achieve fully asynchronous data transfer. Pinned memory is optional; if enabled, two extra tasks per step are added: a preliminary step where data is copied into pinned memory, and a final step where returning data is copied from pinned memory.

The smaller the chunk size, the more overlap occurs. However, if it is too small then the device will not be saturated and performance will degrade.

The streamed parallel_for should take the following amount of time to finish N chunks, given execution time E and one-way communication time C per chunk: T = max(E,C)*N + 2*C If the send or return is absent, just need: T = max(E,C)*N + min(E,C) If the device is saturated so E = E_ptl*ptl_per_chunk, then in the limit of large n_ptl: T -> max(E_ptl, C_ptl)*n_ptl

Parameters
[in]nameis the label given for the kokkos parallel_for
[in]n_ptlis the number of particles
[in]funcis the lambda function to be executed
[in]optioncontrols an option for the streamed parallel_for to exclude send or return
[in]aosoa_his the host AoSoA where the data resides
[in]aosoa_dis the device AoSoA that will be streamed to
Returns
void

Here is the call graph for this function:

Here is the caller graph for this function:

int Streamed::partition_size ( int  i_partition,
int  n_soa_on_device,
int  n_partitions_of_device_aosoa 
)
inline

Here is the caller graph for this function: