Swan: A simple tool for porting CUDA to OpenCL

Download latest version

For the impatient: download the latest version of Swan (16 December 2010) [[attachment:swan-10-12-16.tgz|here]] (by downloading this file, you are accepting the GPL2 license http://www.gnu.org/licenses/gpl-2.0.txt). Earlier versions available via wiki history. To build, untar and follow the instructions in the README. the requirements are: * A modern(ish) Linux OS * CUDA toolkit and/or OpenCL runtime and a matching GPU * Perl (with the package Text::Balanced) * gcc/g++ 3.x or later

Changes in latest update (10 May 2010): * Support for Fedora 12+. (Changed from using varargs to an explicit array) * Support for SM_20/Fermi * Some support for OpenCL images (set $has_opencl_images=1 in swan.ocl)

What is it?

Swan is a small tool that aids the reversible conversion of existing CUDA codebases to OpenCL. It does several useful things: * Translates CUDA kernel source-code to OpenCL. * Provides a common API that abstracts both CUDA and OpenCL runtimes. * Preserves the convenience of the CUDA ‘’«< grid, block »>’’ kernel launch syntax by generating C source-code for kernel entry-point functions.

It can also be usefully used for compiling and managing kernels written directly for OpenCL.

Why might you want it?

Possible uses include: * Evaluating OpenCL performance of an existing CUDA code. * Maintaining a dual-target OpenCL and CUDA code. * Reducing dependence on NVCC when compiling host code. * Support multiple CUDA compute capabilities in a single binary * A runtime library for managing OpenCL kernels for new development

Limitations

It’s not a drop-in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls re-written.

Swan does not support a few things. In particular: * CUDA C++ templating in kernel code. * OpenCL Images/Samplers (analogous to Textures) – texture interpolation done in software * Multiple device management in a single process. * Compiling kernels for the CPU. * CUDA device-emulation mode.

Furthermore, it’s a work in progress. It works for our code but no promises it will for yours.

Use

The essential steps for converting a CUDA code:

  • Separate CUDA kernels into separate source files.
  • Compile the kernel source using swan. For a CUDA target: swan --cuda kernel.kh kernel.cu or for OpenCL swan --opencl kernel.kh kernel.cu The output file kernel.kh contains:
  • a data block which holds the compiled kernel source code (PTX or IL).
  • autogenerated source-code functions for running each of the kernels
  • #include "kernel.kh" into the host source.
  • Re-write all kernel invocations to use the corresponding entry-point functions. For example, the kernel with formal declaration: __global__ void vec_add( float *in, float *out, int N ); will have an entry point defined as: void k_vec_add( int3 grid, int3 block, int shmem, float *in, float *out, int N ); where grid, block and shmem correspond to the first three arguments of the «< »> syntax.

Entry points are named systematically. For any kernel kernel(), there will be two entry points: k_kernel() and k_kernel_async(). The former is a synchronous call (blocks until kernel completes execution).

  • Re-write all of the CUDA API calls with the Swan analogues (see swan_api.h)
  • Compile and link code against the appropriate Swan library (libswan_ocl or libswan_cuda).

Examples can be found in the package.

The API is documented [[swan/api here]].

Contact

For queries, problems and suggestions, contact Matt @ ‘‘M.J.Harvey (at) ic (dot) ac (dot) uk’’.

Publications

  • [[http://www.cse.scitech.ac.uk/disco/mew20/presentations/GPU_MattHarvey.pdf Experiences porting from CUDA to OpenCL]] Presentation at the Daresbury Machine Evaluation Workshop, 2009

If publishing work that uses Swan, please cite:

  • Swan: A tool for porting CUDA programs to OpenCL, M J Harvey and G De Fabritiis, Computer Physics Communications, 182 (4) 1093-1099 (2011) [[http://dx.doi.org/10.1016/j.cpc.2010.12.052 10.1016/j.cpc.2010.12.052]]

License

Swan is licensed under the GPL version 2.

For other type of licenses and for converting legacy code please contact info@acellera.com (web: http://www.acellera.com).

SWAN API =

== General ==

All API functions are prefixed ‘'’swan’’’. All errors are fatal.

== Device Management ==

  • ’'’int swanGetDeviceCount( void )’’’ Return the number of available GPU devices.
  • ’'’void swanSetDeviceNumber( int n )’’’ Set the GPU to use. Should be called before ‘'’swanInit()’’’ or any other swan function.
  • ’'’void swanInit( void )’’’ Initialise the runtime. Its is usually not necessary to call this explicitly.
  • ’'’void swanFinalize( void )’’’ Shut down the runtime. Only really needed to tidy up messy OpenCL runtimes.
  • ’'’int swanDeviceVersion( void )’’’ Returns one of:
  • SWAN_DEVICE_OPENCL_10
  • SWAN_DEVICE_CUDA_100
  • SWAN_DEVICE_CUDA_110
  • SWAN_DEVICE_CUDA_120
  • SWAN_DEVICE_CUDA_130
  • SWAN_DEVICE_CUDA_200

== Memory Management ==

  • ’'’void* swanMalloc( size_t len )’’’ Allocate ‘‘len’’ bytes of device memory.
  • ’'’void* swanMallocHost( size_t len )’’’ Allocate ‘‘len’’ bytes of pinned host memory.
  • ’'’void swanFree( void *ptr )’’’ Free device memory allocated with ‘'’swanMalloc()’’’
  • ’'’void swanFreeHost( void *ptr )’’’ Free device memory allocated with ‘'’swanMallocHost()’’’
  • ’'’void swanMemcpyDtoH( void *ptrd, void *ptrh, size_t len )’’’ Copy ‘‘len’’ bytes from device memory region ‘‘ptrd’’ to host memory region ‘‘ptrh’’.
  • ’'’void swanMemcpyHtoD( void *ptrh, void *ptrd, size_t len )’’’ Copy ‘‘len’’ bytes from host memory region ‘‘ptrh’’ to device memory region ‘‘ptrd’’.
  • ’'’void swanMemcpyDtoD( void *ptrd, void *ptrd, size_t len )’’’ Copy ‘‘len’’ bytes from device memory region ‘‘ptrd1’’ to device memory region ‘‘ptrd2’’.
  • ’'’void* swanMallocPitch( size_t *pitch_in_bytes, size_t width_in_bytes, size_t height)’’’ Allocate an aligned 2D region of ‘‘A’‘x’‘height’’, where ‘‘A’’ is ‘‘width_in_bytes’’ rounded up to something suitable for the hardware (typically 256).

  • ’'’void swanBindToGlobal( const char *varname, size_t len, void *ptrh )’’’ Copy ‘‘len’’ bytes from host pointer ‘‘ptrh’’ to the global device variable named ‘‘varname’’. This function has static scope.

== Texturing ==

  • ’'’void swanBindToTexture1D( const char *texname, size_t width, void *ptrd, size_t typesize, int flags )’’’ Bind a texture reference to the device pointer ‘‘ptrd’’. The allocation should be ‘‘width’‘x ‘‘typesize’’ bytes. The texture reference name is extracted from the source code. ‘‘typesize’’ gives the size of the tuple size in the texture, eg ‘‘sizeof(float4)’’. Flags should be a bitwise OR of:
  • TEXTURE_FLOAT - texture contains data of type float.
  • TEXTURE_INT - texture contains data of type int.
  • TEXTURE_UINT - texture contains data of type unsigned int.
  • TEXTURE_NORMALISE - texture addressing will be normalised to between 0 and 1.
  • TEXTURE_INTERPOLATE - linear interpolation will be used. Requires TEXTURE_NORMALISE. This function has static scope.

  • ’'’void swanMakeTexture1D( const char *texname, size_t width, void *ptrh, size_t typesize, int flags )’’’ As ‘'’swanBindToTexture1D’’’, but copies data from the host memory region ‘‘ptrh’’.

  • ’'’void swanMakeTexture2D( const char *texname, size_t width, size_t height, void *ptrh, size_t typesize, int flags )’’’ Create a 2D texture reference. Arguments as per ‘'’swanBindToTexture1D’’’.

'’Note: TEXTURE_NORMALISE and TEXTURE_INTERPOLATE not yet supported’’

== Kernel Execution ==

  • ’'’int swanMaxThreadCount( void )’’’ Return the maxium number of threads that may be used in a single block.
  • ’'’int swanGetNumberOfComputeElements( void )’’’ Return the number of compute elements/multiprocessors in the device
  • ’'’void swanDecompose( block_config_t *grid, block_config_t *block, int thread_count, int threads_per_block )’’’ Create a launch configuration based on the number of threads and threads per block. ‘‘grid’’ and ‘‘block’’ will be 1D.
  • ’'’void swanSynchronize( void )’’’ Block until all asynchronous operations are completed.

All kernel entry points have the following prototype format: * ‘'’void k_’’’'’kernel_name’’’’’ ( block_config_t grid, block_config_t grid, int shmem, ‘’’ '’args,…’’ ’’’ )’’’ Blocking launch of kernel ‘‘kernel_name’’. * ‘'’void k_’’’'’kernel_name’’’'’_async ( block_config_t grid, block_config_t grid, int shmem, ‘’’ '’args,…’’ ’’’ )’’’ Non-blocking launch of kernel ‘‘kernel_name’’.

'’args’’ are the formal arguments defined in the kernel source itself. All kernel entry points have static scope.