CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran Programming.

By: Fatica, MassimilianoContributor(s): Ruetsch, GregoryPublisher: San Francisco : Elsevier Science & Technology, 2013Copyright date: ©2014Description: 1 online resource (339 pages)Content type: text Media type: computer Carrier type: online resourceISBN: 9780124169722Subject(s): FORTRAN (Computer program language)Genre/Form: Electronic books. Additional physical formats: Print version:: CUDA Fortran for Scientists and Engineers : Best Practices for Efficient CUDA Fortran ProgrammingDDC classification: 005.131 LOC classification: QA76.73.F25.R833 20Online resources: Click to View
Contents:
Intro -- Half Title -- Title Page -- Copyright -- Dedication -- Contents -- Acknowledgments -- Preface -- PART I: CUDA Fortran Programming -- 1 Introduction -- 1.1 A Brief History of GPU Computing -- 1.2 Parallel computation -- 1.3 Basic Concepts -- 1.3.1 A first CUDA Fortran program -- 1.3.2 Extending to larger arrays -- 1.3.3 Multidimensional arrays -- 1.4 Determining CUDA Hardware Features and Limits -- 1.4.1 Single and double precision -- 1.4.1.1 Accommodating variable precision -- 1.5 Error Handling -- 1.6 Compiling CUDA Fortran Code -- 1.6.1 Separate compilation -- 2 Performance Measurement and Metrics -- 2.1 Measuring kernel execution time -- 2.1.1 Host-device synchronization and CPU timers -- 2.1.2 Timing via CUDA events -- 2.1.3 Command Line Profiler -- 2.1.4 The nvprof profiling tool -- 2.2 Instruction, bandwidth, and latency bound kernels -- 2.3 Memory bandwidth -- 2.3.1 Theoretical peak bandwidth -- 2.3.2 Effective bandwidth -- 2.3.3 Actual data throughput vs. effective bandwidth -- 3 Optimization -- 3.1 Transfers between host and device -- 3.1.1 Pinned memory -- 3.1.2 Batching small data transfers -- 3.1.2.1 Explicit transfers using cudaMemcpy() -- 3.1.3 Asynchronous data transfers (advanced topic) -- 3.1.3.1 Hyper-Q -- 3.1.3.2 Profiling asynchronous events -- 3.2 Device memory -- 3.2.1 Declaring data in device code -- 3.2.2 Coalesced access to global memory -- 3.2.2.1 Misaligned access -- 3.2.2.2 Strided access -- 3.2.3 Texture memory -- 3.2.4 Local memory -- 3.2.4.1 Detecting local memory use (advanced topic) -- 3.2.5 Constant memory -- 3.2.5.1 Detecting constant memory use (advanced topic) -- 3.3 On-chip memory -- 3.3.1 L1 cache -- 3.3.2 Registers -- 3.3.3 Shared memory -- 3.3.3.1 Detecting shared memory usage (advanced topic) -- 3.3.3.2 Shared memory bank conflicts -- 3.4 Memory optimization example: matrix transpose.
3.4.1 Partition camping (advanced topic) -- 3.4.1.1 Diagonal reordering -- 3.5 Execution configuration -- 3.5.1 Thread-level parallelism -- 3.5.1.1 Shared memory -- 3.5.2 Instruction-level parallelism -- 3.6 Instruction optimization -- 3.6.1 Device intrinsics -- 3.6.1.1 Directed rounding -- 3.6.1.2 C intrinsics -- 3.6.1.3 Fast math intrinsics -- 3.6.2 Compiler options -- 3.6.3 Divergent warps -- 3.7 Kernel loop directives -- 3.7.1 Reductions in CUF kernels -- 3.7.2 Streams in CUF kernels -- 3.7.3 Instruction-level parallelism in CUF kernels -- 4 Multi-GPU Programming -- 4.1 CUDA multi-GPU features -- 4.1.1 Peer-to-peer communication -- 4.1.1.1 Requirements for peer-to-peer communication -- 4.1.2 Peer-to-peer direct transfers -- 4.1.3 Peer-to-peer transpose -- 4.2 Multi-GPU Programming with MPI -- 4.2.1 Assigning devices to MPI ranks -- 4.2.2 MPI transpose -- 4.2.3 GPU-aware MPI transpose -- PART II: Case Studies -- 5 Monte Carlo Method -- 5.1 CURAND -- 5.2 Computing π with CUF kernels -- 5.2.1 IEEE-754 precision (advanced topic) -- 5.3 Computing π with reduction kernels -- 5.3.1 Reductions with atomic locks (advanced topic) -- 5.4 Accuracy of summation -- 5.5 Option pricing -- 6 Finite Difference Method -- 6.1 Nine-Point 1D finite difference stencil -- 6.1.1 Data reuse and shared memory -- 6.1.2 The x-derivative kernel -- 6.1.2.1 Performance of the x-derivative kernel -- 6.1.3 Derivatives in y and z -- 6.1.3.1 Leveraging transpose -- 6.1.4 Nonuniform grids -- 6.2 2D Laplace equation -- 7 Applications of Fast Fourier Transform -- 7.1 CUFFT -- 7.2 Spectral derivatives -- 7.3 Convolution -- 7.4 Poisson Solver -- PART III: Appendices -- Appendix A: Tesla Specifications -- Appendix B: System and Environment Management -- B.1 Environment variables -- B.1.1 General -- B.1.2 Command Line Profiler -- B.1.3 Just-in-time compilation.
B.2 nvidia-smi System Management Interface -- B.2.1 Enabling and disabling ECC -- B.2.2 Compute mode -- B.2.3 Persistence mode -- Appendix C: Calling CUDA C from CUDA Fortran -- C.1 Calling CUDA C libraries -- C.2 Calling User-Written CUDA C Code -- Appendix D: Source Code -- D.1 Texture memory -- D.2 Matrix transpose -- D.3 Thread- and instruction-level parallelism -- D.4 Multi-GPU programming -- D.4.1 Peer-to-peer transpose -- D.4.2 MPI transpose with host MPI transfers -- D.4.3 MPI transpose with device MPI transfers -- D.5 Finite difference code -- D.6 Spectral Poisson Solver -- References -- Index -- A -- B -- C -- D -- E -- F -- G -- H -- I -- J -- K -- L -- M -- N -- O -- P -- R -- S -- T -- U -- V -- W.
Summary: CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran. To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison. Leverage the power of GPU computing with PGI's CUDA Fortran compiler Gain insights from members of the CUDA Fortran language development team Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches Includes full source code for all the examples and several case studies Download source code and slides from the book's companion website.
Holdings
Item type Current library Call number Status Date due Barcode Item holds
Ebrary Ebrary Afghanistan
Available EBKAF00084973
Ebrary Ebrary Algeria
Available
Ebrary Ebrary Cyprus
Available
Ebrary Ebrary Egypt
Available
Ebrary Ebrary Libya
Available
Ebrary Ebrary Morocco
Available
Ebrary Ebrary Nepal
Available EBKNP00084973
Ebrary Ebrary Sudan

Access a wide range of magazines and books using Pressreader and Ebook central.

Enjoy your reading, British Council Sudan.

Available
Ebrary Ebrary Tunisia
Available
Total holds: 0

Intro -- Half Title -- Title Page -- Copyright -- Dedication -- Contents -- Acknowledgments -- Preface -- PART I: CUDA Fortran Programming -- 1 Introduction -- 1.1 A Brief History of GPU Computing -- 1.2 Parallel computation -- 1.3 Basic Concepts -- 1.3.1 A first CUDA Fortran program -- 1.3.2 Extending to larger arrays -- 1.3.3 Multidimensional arrays -- 1.4 Determining CUDA Hardware Features and Limits -- 1.4.1 Single and double precision -- 1.4.1.1 Accommodating variable precision -- 1.5 Error Handling -- 1.6 Compiling CUDA Fortran Code -- 1.6.1 Separate compilation -- 2 Performance Measurement and Metrics -- 2.1 Measuring kernel execution time -- 2.1.1 Host-device synchronization and CPU timers -- 2.1.2 Timing via CUDA events -- 2.1.3 Command Line Profiler -- 2.1.4 The nvprof profiling tool -- 2.2 Instruction, bandwidth, and latency bound kernels -- 2.3 Memory bandwidth -- 2.3.1 Theoretical peak bandwidth -- 2.3.2 Effective bandwidth -- 2.3.3 Actual data throughput vs. effective bandwidth -- 3 Optimization -- 3.1 Transfers between host and device -- 3.1.1 Pinned memory -- 3.1.2 Batching small data transfers -- 3.1.2.1 Explicit transfers using cudaMemcpy() -- 3.1.3 Asynchronous data transfers (advanced topic) -- 3.1.3.1 Hyper-Q -- 3.1.3.2 Profiling asynchronous events -- 3.2 Device memory -- 3.2.1 Declaring data in device code -- 3.2.2 Coalesced access to global memory -- 3.2.2.1 Misaligned access -- 3.2.2.2 Strided access -- 3.2.3 Texture memory -- 3.2.4 Local memory -- 3.2.4.1 Detecting local memory use (advanced topic) -- 3.2.5 Constant memory -- 3.2.5.1 Detecting constant memory use (advanced topic) -- 3.3 On-chip memory -- 3.3.1 L1 cache -- 3.3.2 Registers -- 3.3.3 Shared memory -- 3.3.3.1 Detecting shared memory usage (advanced topic) -- 3.3.3.2 Shared memory bank conflicts -- 3.4 Memory optimization example: matrix transpose.

3.4.1 Partition camping (advanced topic) -- 3.4.1.1 Diagonal reordering -- 3.5 Execution configuration -- 3.5.1 Thread-level parallelism -- 3.5.1.1 Shared memory -- 3.5.2 Instruction-level parallelism -- 3.6 Instruction optimization -- 3.6.1 Device intrinsics -- 3.6.1.1 Directed rounding -- 3.6.1.2 C intrinsics -- 3.6.1.3 Fast math intrinsics -- 3.6.2 Compiler options -- 3.6.3 Divergent warps -- 3.7 Kernel loop directives -- 3.7.1 Reductions in CUF kernels -- 3.7.2 Streams in CUF kernels -- 3.7.3 Instruction-level parallelism in CUF kernels -- 4 Multi-GPU Programming -- 4.1 CUDA multi-GPU features -- 4.1.1 Peer-to-peer communication -- 4.1.1.1 Requirements for peer-to-peer communication -- 4.1.2 Peer-to-peer direct transfers -- 4.1.3 Peer-to-peer transpose -- 4.2 Multi-GPU Programming with MPI -- 4.2.1 Assigning devices to MPI ranks -- 4.2.2 MPI transpose -- 4.2.3 GPU-aware MPI transpose -- PART II: Case Studies -- 5 Monte Carlo Method -- 5.1 CURAND -- 5.2 Computing π with CUF kernels -- 5.2.1 IEEE-754 precision (advanced topic) -- 5.3 Computing π with reduction kernels -- 5.3.1 Reductions with atomic locks (advanced topic) -- 5.4 Accuracy of summation -- 5.5 Option pricing -- 6 Finite Difference Method -- 6.1 Nine-Point 1D finite difference stencil -- 6.1.1 Data reuse and shared memory -- 6.1.2 The x-derivative kernel -- 6.1.2.1 Performance of the x-derivative kernel -- 6.1.3 Derivatives in y and z -- 6.1.3.1 Leveraging transpose -- 6.1.4 Nonuniform grids -- 6.2 2D Laplace equation -- 7 Applications of Fast Fourier Transform -- 7.1 CUFFT -- 7.2 Spectral derivatives -- 7.3 Convolution -- 7.4 Poisson Solver -- PART III: Appendices -- Appendix A: Tesla Specifications -- Appendix B: System and Environment Management -- B.1 Environment variables -- B.1.1 General -- B.1.2 Command Line Profiler -- B.1.3 Just-in-time compilation.

B.2 nvidia-smi System Management Interface -- B.2.1 Enabling and disabling ECC -- B.2.2 Compute mode -- B.2.3 Persistence mode -- Appendix C: Calling CUDA C from CUDA Fortran -- C.1 Calling CUDA C libraries -- C.2 Calling User-Written CUDA C Code -- Appendix D: Source Code -- D.1 Texture memory -- D.2 Matrix transpose -- D.3 Thread- and instruction-level parallelism -- D.4 Multi-GPU programming -- D.4.1 Peer-to-peer transpose -- D.4.2 MPI transpose with host MPI transfers -- D.4.3 MPI transpose with device MPI transfers -- D.5 Finite difference code -- D.6 Spectral Poisson Solver -- References -- Index -- A -- B -- C -- D -- E -- F -- G -- H -- I -- J -- K -- L -- M -- N -- O -- P -- R -- S -- T -- U -- V -- W.

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran. To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison. Leverage the power of GPU computing with PGI's CUDA Fortran compiler Gain insights from members of the CUDA Fortran language development team Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches Includes full source code for all the examples and several case studies Download source code and slides from the book's companion website.

Description based on publisher supplied metadata and other sources.

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2019. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

There are no comments on this title.

to post a comment.