CUDA Programming A Developer's Guide to Parallel Computing with GPUs

Android手机应用开发

下载此实例

开发语言：C/C++
实例大小：16.57M
下载次数：21
浏览次数：133
发布时间：2022-03-29
实例类别：Android手机应用开发
发布人：甜菜葫芦
文件格式：.pdf
所需积分：2

相关标签： Programming develop GUIDE GUID dev

网友评论举报投诉收藏该页

下载此实例

实例介绍

[下载地址]

【实例简介】CUDA Programming A Developer's Guide to Parallel Computing with GPUs

英文原版书 pdf 格式《CUDA Programming A Developer's Guide to Parallel Computing with GPUs》

【实例截图】

【核心代码】

Contents

Preface ................................................................................................................................................xiii
CHAPTER 1 A Short History of Supercomputing................................................ 1
Introduction ................................................................................................................ 1
Von Neumann Architecture........................................................................................ 2
Cray............................................................................................................................. 5
Connection Machine................................................................................................... 6
Cell Processor............................................................................................................. 7
Multinode Computing ................................................................................................ 9
The Early Days of GPGPU Coding......................................................................... 11
The Death of the Single-Core Solution................................................................... 12
NVIDIA and CUDA................................................................................................. 13
GPU Hardware ......................................................................................................... 15
Alternatives to CUDA.............................................................................................. 16
OpenCL ............................................................................................................... 16
DirectCompute .................................................................................................... 17
CPU alternatives.................................................................................................. 17
Directives and libraries ....................................................................................... 18
Conclusion................................................................................................................ 19
CHAPTER 2 Understanding Parallelism with GPUs ......................................... 21
Introduction .............................................................................................................. 21
Traditional Serial Code ............................................................................................ 21
Serial/Parallel Problems........................................................................................... 23
Concurrency.............................................................................................................. 24
Locality................................................................................................................ 25
Types of Parallelism................................................................................................. 27
Task-based parallelism........................................................................................ 27
Data-based parallelism........................................................................................ 28
Flynn’s Taxonomy.................................................................................................... 30
Some Common Parallel Patterns.............................................................................. 31
Loop-based patterns ............................................................................................ 31
Fork/join pattern.................................................................................................. 33
Tiling/grids.......................................................................................................... 35
Divide and conquer............................................................................................. 35
Conclusion................................................................................................................ 36
CHAPTER 3 CUDA Hardware Overview........................................................... 37
PC Architecture........................................................................................................ 37
GPU Hardware ......................................................................................................... 42
v
CPUs and GPUs ....................................................................................................... 46
Compute Levels........................................................................................................ 46
Compute 1.0........................................................................................................ 47
Compute 1.1........................................................................................................ 47
Compute 1.2........................................................................................................ 49
Compute 1.3........................................................................................................ 49
Compute 2.0........................................................................................................ 49
Compute 2.1........................................................................................................ 51
CHAPTER 4 Setting Up CUDA........................................................................ 53
Introduction .............................................................................................................. 53
Installing the SDK under Windows......................................................................... 53
Visual Studio ............................................................................................................ 54
Projects................................................................................................................ 55
64-bit users.......................................................................................................... 55
Creating projects ................................................................................................. 57
Linux......................................................................................................................... 58
Kernel base driver installation (CentOS, Ubuntu 10.4) ..................................... 59
Mac........................................................................................................................... 62
Installing a Debugger............................................................................................... 62
Compilation Model................................................................................................... 66
Error Handling.......................................................................................................... 67
Conclusion................................................................................................................ 68
CHAPTER 5 Grids, Blocks, and Threads......................................................... 69
What it all Means..................................................................................................... 69
Threads ..................................................................................................................... 69
Problem decomposition....................................................................................... 69
How CPUs and GPUs are different.................................................................... 71
Task execution model.......................................................................................... 72
Threading on GPUs............................................................................................. 73
A peek at hardware............................................................................................. 74
CUDA kernels..................................................................................................... 77
Blocks....................................................................................................................... 78
Block arrangement.............................................................................................. 80
Grids ......................................................................................................................... 83
Stride and offset .................................................................................................. 84
X and Y thread indexes........................................................................................ 85
Warps........................................................................................................................ 91
Branching ............................................................................................................ 92
GPU utilization.................................................................................................... 93
Block Scheduling ..................................................................................................... 95
vi Contents
A Practical ExampledHistograms.......................................................................... 97
Conclusion.............................................................................................................. 103
Questions........................................................................................................... 104
Answers............................................................................................................. 104
CHAPTER 6 Memory Handling with CUDA.................................................... 107
Introduction ............................................................................................................ 107
Caches..................................................................................................................... 108
Types of data storage ........................................................................................ 110
Register Usage........................................................................................................ 111
Shared Memory...................................................................................................... 120
Sorting using shared memory........................................................................... 121
Radix sort .......................................................................................................... 125
Merging lists...................................................................................................... 131
Parallel merging ................................................................................................ 137
Parallel reduction............................................................................................... 140
A hybrid approach............................................................................................. 144
Shared memory on different GPUs................................................................... 148
Shared memory summary ................................................................................. 148
Questions on shared memory............................................................................ 149
Answers for shared memory............................................................................. 149
Constant Memory................................................................................................... 150
Constant memory caching................................................................................. 150
Constant memory broadcast.............................................................................. 152
Constant memory updates at runtime............................................................... 162
Constant question.............................................................................................. 166
Constant answer ................................................................................................ 167
Global Memory ...................................................................................................... 167
Score boarding................................................................................................... 176
Global memory sorting ..................................................................................... 176
Sample sort........................................................................................................ 179
Questions on global memory............................................................................ 198
Answers on global memory.............................................................................. 199
Texture Memory..................................................................................................... 200
Texture caching................................................................................................. 200
Hardware manipulation of memory fetches..................................................... 200
Restrictions using textures................................................................................ 201
Conclusion.............................................................................................................. 202
CHAPTER 7 Using CUDA in Practice............................................................ 203
Introduction ............................................................................................................ 203
Serial and Parallel Code......................................................................................... 203
Design goals of CPUs and GPUs ..................................................................... 203
Contents vii
Algorithms that work best on the CPU versus the GPU.................................. 206
Processing Datasets................................................................................................ 209
Using ballot and other intrinsic operations....................................................... 211
Profiling .................................................................................................................. 219
An Example Using AES ........................................................................................ 231
The algorithm.................................................................................................... 232
Serial implementations of AES ........................................................................ 236
An initial kernel ................................................................................................ 239
Kernel performance........................................................................................... 244
Transfer performance........................................................................................ 248
A single streaming version ............................................................................... 249
How do we compare with the CPU.................................................................. 250
Considerations for running on other GPUs...................................................... 260
Using multiple streams...................................................................................... 263
AES summary ................................................................................................... 264
Conclusion.............................................................................................................. 265
Questions........................................................................................................... 265
Answers............................................................................................................. 265
References .............................................................................................................. 266
CHAPTER 8 Multi-CPU and Multi-GPU Solutions .......................................... 267
Introduction ............................................................................................................ 267
Locality................................................................................................................... 267
Multi-CPU Systems................................................................................................ 267
Multi-GPU Systems................................................................................................ 268
Algorithms on Multiple GPUs............................................................................... 269
Which GPU?........................................................................................................... 270
Single-Node Systems.............................................................................................. 274
Streams ................................................................................................................... 275
Multiple-Node Systems.......................................................................................... 290
Conclusion.............................................................................................................. 301
Questions........................................................................................................... 302
Answers............................................................................................................. 302
CHAPTER 9 Optimizing Your Application...................................................... 305
Strategy 1: Parallel/Serial GPU/CPU Problem Breakdown .................................. 305
Analyzing the problem...................................................................................... 305
Time................................................................................................................... 305
Problem decomposition..................................................................................... 307
Dependencies..................................................................................................... 308
Dataset size........................................................................................................ 311
viii Contents
Resolution.......................................................................................................... 312
Identifying the bottlenecks................................................................................ 313
Grouping the tasks for CPU and GPU.............................................................. 317
Section summary............................................................................................... 320
Strategy 2: Memory Considerations ...................................................................... 320
Memory bandwidth........................................................................................... 320
Source of limit................................................................................................... 321
Memory organization........................................................................................ 323
Memory accesses to computation ratio ............................................................ 325
Loop and kernel fusion..................................................................................... 331
Use of shared memory and cache..................................................................... 332
Section summary............................................................................................... 333
Strategy 3: Transfers .............................................................................................. 334
Pinned memory ................................................................................................. 334
Zero-copy memory............................................................................................ 338
Bandwidth limitations....................................................................................... 347
GPU timing ....................................................................................................... 351
Overlapping GPU transfers............................................................................... 356
Section summary............................................................................................... 360
Strategy 4: Thread Usage, Calculations, and Divergence..................................... 361
Thread memory patterns ................................................................................... 361
Inactive threads.................................................................................................. 364
Arithmetic density............................................................................................. 365
Some common compiler optimizations............................................................ 369
Divergence......................................................................................................... 374
Understanding the low-level assembly code .................................................... 379
Register usage ................................................................................................... 383
Section summary............................................................................................... 385
Strategy 5: Algorithms........................................................................................... 386
Sorting ............................................................................................................... 386
Reduction........................................................................................................... 392
Section summary............................................................................................... 414
Strategy 6: Resource Contentions.......................................................................... 414
Identifying bottlenecks...................................................................................... 414
Resolving bottlenecks ....................................................................................... 427
Section summary............................................................................................... 434
Strategy 7: Self-Tuning Applications..................................................................... 435
Identifying the hardware................................................................................... 436
Device utilization.............................................................................................. 437
Sampling performance...................................................................................... 438
Section summary............................................................................................... 439
Conclusion.............................................................................................................. 439
Questions on Optimization................................................................................ 439
Answers............................................................................................................. 440
Contents ix
CHAPTER 10 Libraries and SDK.................................................................. 441
Introduction.......................................................................................................... 441
Libraries............................................................................................................... 441
General library conventions ........................................................................... 442
NPP (Nvidia Performance Primitives)........................................................... 442
Thrust.............................................................................................................. 451
CuRAND......................................................................................................... 467
CuBLAS (CUDA basic linear algebra) library.............................................. 471
CUDA Computing SDK...................................................................................... 475
Device Query.................................................................................................. 476
Bandwidth test................................................................................................ 478
SimpleP2P....................................................................................................... 479
asyncAPI and cudaOpenMP........................................................................... 482
Aligned types.................................................................................................. 489
Directive-Based Programming ............................................................................ 491
OpenACC........................................................................................................ 492
Writing Your Own Kernels.................................................................................. 499
Conclusion ........................................................................................................... 502
CHAPTER 11 Designing GPU-Based Systems................................................ 503
Introduction.......................................................................................................... 503
CPU Processor..................................................................................................... 505
GPU Device......................................................................................................... 507
Large memory support ................................................................................... 507
ECC memory support..................................................................................... 508
Tesla compute cluster driver (TCC)............................................................... 508
Higher double-precision math........................................................................ 508
Larger memory bus width.............................................................................. 508
SMI ................................................................................................................. 509
Status LEDs.................................................................................................... 509
PCI-E Bus............................................................................................................ 509
GeForce cards...................................................................................................... 510
CPU Memory....................................................................................................... 510
Air Cooling.......................................................................................................... 512
Liquid Cooling..................................................................................................... 513
Desktop Cases and Motherboards....................................................................... 517
Mass Storage........................................................................................................ 518
Motherboard-based I/O................................................................................... 518
Dedicated RAID controllers........................................................................... 519
HDSL.............................................................................................................. 520
Mass storage requirements............................................................................. 521
Networking ..................................................................................................... 521
Power Considerations .......................................................................................... 522
x Contents
Operating Systems............................................................................................... 525
Windows ......................................................................................................... 525
Linux............................................................................................................... 525
Conclusion ........................................................................................................... 526
CHAPTER 12 Common Problems, Causes, and Solutions............................... 527
Introduction.......................................................................................................... 527
Errors With CUDA Directives............................................................................. 527
CUDA error handling..................................................................................... 527
Kernel launching and bounds checking......................................................... 528
Invalid device handles.................................................................................... 529
Volatile qualifiers............................................................................................ 530
Compute level–dependent functions.............................................................. 532
Device, global, and host functions................................................................. 534
Kernels within streams................................................................................... 535
Parallel Programming Issues............................................................................... 536
Race hazards................................................................................................... 536
Synchronization.............................................................................................. 537
Atomic operations........................................................................................... 541
Algorithmic Issues............................................................................................... 544
Back-to-back testing....................................................................................... 544
Memory leaks................................................................................................. 546
Long kernels................................................................................................... 546
Finding and Avoiding Errors............................................................................... 547
How many errors does your GPU program have?......................................... 547
Divide and conquer......................................................................................... 548
Assertions and defensive programming......................................................... 549
Debug level and printing................................................................................ 551
Version control................................................................................................ 555
Developing for Future GPUs............................................................................... 555
Kepler.............................................................................................................. 555
What to think about........................................................................................ 558
Further Resources................................................................................................ 560
Introduction..................................................................................................... 560
Online courses ................................................................................................ 560
Taught courses................................................................................................ 561
Books.............................................................................................................. 562
NVIDIA CUDA certification.......................................................................... 562
Conclusion ........................................................................................................... 562
References............................................................................................................ 563
Index .................................................................................................................................................565
Contents xi
This page intentionally left blank

标签： Programming develop GUIDE GUID dev

实例下载地址