在好例子网,分享、交流、成长!
您当前所在位置:首页C/C++ 开发实例Android手机应用开发 → CUDA Programming A Developer's Guide to Parallel Computing with GPUs

CUDA Programming A Developer's Guide to Parallel Computing with GPUs

Android手机应用开发

下载此实例
  • 开发语言:C/C++
  • 实例大小:16.57M
  • 下载次数:21
  • 浏览次数:79
  • 发布时间:2022-03-29
  • 实例类别:Android手机应用开发
  • 发 布 人:甜菜葫芦
  • 文件格式:.pdf
  • 所需积分:2
 相关标签: Programming develop GUIDE GUID dev

实例介绍

【实例简介】CUDA Programming A Developer's Guide to Parallel Computing with GPUs

英文原版书 pdf 格式 《CUDA Programming A Developer's Guide to Parallel Computing with GPUs》

【实例截图】


【核心代码】

Contents

Preface ................................................................................................................................................xiii
CHAPTER 1 A Short History of Supercomputing................................................ 1
Introduction ................................................................................................................ 1
Von Neumann Architecture........................................................................................ 2
Cray............................................................................................................................. 5
Connection Machine................................................................................................... 6
Cell Processor............................................................................................................. 7
Multinode Computing ................................................................................................ 9
The Early Days of GPGPU Coding......................................................................... 11
The Death of the Single-Core Solution................................................................... 12
NVIDIA and CUDA................................................................................................. 13
GPU Hardware ......................................................................................................... 15
Alternatives to CUDA.............................................................................................. 16
OpenCL ............................................................................................................... 16
DirectCompute .................................................................................................... 17
CPU alternatives.................................................................................................. 17
Directives and libraries ....................................................................................... 18
Conclusion................................................................................................................ 19
CHAPTER 2 Understanding Parallelism with GPUs ......................................... 21
Introduction .............................................................................................................. 21
Traditional Serial Code ............................................................................................ 21
Serial/Parallel Problems........................................................................................... 23
Concurrency.............................................................................................................. 24
Locality................................................................................................................ 25
Types of Parallelism................................................................................................. 27
Task-based parallelism........................................................................................ 27
Data-based parallelism........................................................................................ 28
Flynn’s Taxonomy.................................................................................................... 30
Some Common Parallel Patterns.............................................................................. 31
Loop-based patterns ............................................................................................ 31
Fork/join pattern.................................................................................................. 33
Tiling/grids.......................................................................................................... 35
Divide and conquer............................................................................................. 35
Conclusion................................................................................................................ 36
CHAPTER 3 CUDA Hardware Overview........................................................... 37
PC Architecture........................................................................................................ 37
GPU Hardware ......................................................................................................... 42
v
CPUs and GPUs ....................................................................................................... 46
Compute Levels........................................................................................................ 46
Compute 1.0........................................................................................................ 47
Compute 1.1........................................................................................................ 47
Compute 1.2........................................................................................................ 49
Compute 1.3........................................................................................................ 49
Compute 2.0........................................................................................................ 49
Compute 2.1........................................................................................................ 51
CHAPTER 4 Setting Up CUDA........................................................................ 53
Introduction .............................................................................................................. 53
Installing the SDK under Windows......................................................................... 53
Visual Studio ............................................................................................................ 54
Projects................................................................................................................ 55
64-bit users.......................................................................................................... 55
Creating projects ................................................................................................. 57
Linux......................................................................................................................... 58
Kernel base driver installation (CentOS, Ubuntu 10.4) ..................................... 59
Mac........................................................................................................................... 62
Installing a Debugger............................................................................................... 62
Compilation Model................................................................................................... 66
Error Handling.......................................................................................................... 67
Conclusion................................................................................................................ 68
CHAPTER 5 Grids, Blocks, and Threads......................................................... 69
What it all Means..................................................................................................... 69
Threads ..................................................................................................................... 69
Problem decomposition....................................................................................... 69
How CPUs and GPUs are different.................................................................... 71
Task execution model.......................................................................................... 72
Threading on GPUs............................................................................................. 73
A peek at hardware............................................................................................. 74
CUDA kernels..................................................................................................... 77
Blocks....................................................................................................................... 78
Block arrangement.............................................................................................. 80
Grids ......................................................................................................................... 83
Stride and offset .................................................................................................. 84
X and Y thread indexes........................................................................................ 85
Warps........................................................................................................................ 91
Branching ............................................................................................................ 92
GPU utilization.................................................................................................... 93
Block Scheduling ..................................................................................................... 95
vi Contents
A Practical ExampledHistograms.......................................................................... 97
Conclusion.............................................................................................................. 103
Questions........................................................................................................... 104
Answers............................................................................................................. 104
CHAPTER 6 Memory Handling with CUDA.................................................... 107
Introduction ............................................................................................................ 107
Caches..................................................................................................................... 108
Types of data storage ........................................................................................ 110
Register Usage........................................................................................................ 111
Shared Memory...................................................................................................... 120
Sorting using shared memory........................................................................... 121
Radix sort .......................................................................................................... 125
Merging lists...................................................................................................... 131
Parallel merging ................................................................................................ 137
Parallel reduction............................................................................................... 140
A hybrid approach............................................................................................. 144
Shared memory on different GPUs................................................................... 148
Shared memory summary ................................................................................. 148
Questions on shared memory............................................................................ 149
Answers for shared memory............................................................................. 149
Constant Memory................................................................................................... 150
Constant memory caching................................................................................. 150
Constant memory broadcast.............................................................................. 152
Constant memory updates at runtime............................................................... 162
Constant question.............................................................................................. 166
Constant answer ................................................................................................ 167
Global Memory ...................................................................................................... 167
Score boarding................................................................................................... 176
Global memory sorting ..................................................................................... 176
Sample sort........................................................................................................ 179
Questions on global memory............................................................................ 198
Answers on global memory.............................................................................. 199
Texture Memory..................................................................................................... 200
Texture caching................................................................................................. 200
Hardware manipulation of memory fetches..................................................... 200
Restrictions using textures................................................................................ 201
Conclusion.............................................................................................................. 202
CHAPTER 7 Using CUDA in Practice............................................................ 203
Introduction ............................................................................................................ 203
Serial and Parallel Code......................................................................................... 203
Design goals of CPUs and GPUs ..................................................................... 203
Contents vii
Algorithms that work best on the CPU versus the GPU.................................. 206
Processing Datasets................................................................................................ 209
Using ballot and other intrinsic operations....................................................... 211
Profiling .................................................................................................................. 219
An Example Using AES ........................................................................................ 231
The algorithm.................................................................................................... 232
Serial implementations of AES ........................................................................ 236
An initial kernel ................................................................................................ 239
Kernel performance........................................................................................... 244
Transfer performance........................................................................................ 248
A single streaming version ............................................................................... 249
How do we compare with the CPU.................................................................. 250
Considerations for running on other GPUs...................................................... 260
Using multiple streams...................................................................................... 263
AES summary ................................................................................................... 264
Conclusion.............................................................................................................. 265
Questions........................................................................................................... 265
Answers............................................................................................................. 265
References .............................................................................................................. 266
CHAPTER 8 Multi-CPU and Multi-GPU Solutions .......................................... 267
Introduction ............................................................................................................ 267
Locality................................................................................................................... 267
Multi-CPU Systems................................................................................................ 267
Multi-GPU Systems................................................................................................ 268
Algorithms on Multiple GPUs............................................................................... 269
Which GPU?........................................................................................................... 270
Single-Node Systems.............................................................................................. 274
Streams ................................................................................................................... 275
Multiple-Node Systems.......................................................................................... 290
Conclusion.............................................................................................................. 301
Questions........................................................................................................... 302
Answers............................................................................................................. 302
CHAPTER 9 Optimizing Your Application...................................................... 305
Strategy 1: Parallel/Serial GPU/CPU Problem Breakdown .................................. 305
Analyzing the problem...................................................................................... 305
Time................................................................................................................... 305
Problem decomposition..................................................................................... 307
Dependencies..................................................................................................... 308
Dataset size........................................................................................................ 311
viii Contents
Resolution.......................................................................................................... 312
Identifying the bottlenecks................................................................................ 313
Grouping the tasks for CPU and GPU.............................................................. 317
Section summary............................................................................................... 320
Strategy 2: Memory Considerations ...................................................................... 320
Memory bandwidth........................................................................................... 320
Source of limit................................................................................................... 321
Memory organization........................................................................................ 323
Memory accesses to computation ratio ............................................................ 325
Loop and kernel fusion..................................................................................... 331
Use of shared memory and cache..................................................................... 332
Section summary............................................................................................... 333
Strategy 3: Transfers .............................................................................................. 334
Pinned memory ................................................................................................. 334
Zero-copy memory............................................................................................ 338
Bandwidth limitations....................................................................................... 347
GPU timing ....................................................................................................... 351
Overlapping GPU transfers............................................................................... 356
Section summary............................................................................................... 360
Strategy 4: Thread Usage, Calculations, and Divergence..................................... 361
Thread memory patterns ................................................................................... 361
Inactive threads.................................................................................................. 364
Arithmetic density............................................................................................. 365
Some common compiler optimizations............................................................ 369
Divergence......................................................................................................... 374
Understanding the low-level assembly code .................................................... 379
Register usage ................................................................................................... 383
Section summary............................................................................................... 385
Strategy 5: Algorithms........................................................................................... 386
Sorting ............................................................................................................... 386
Reduction........................................................................................................... 392
Section summary............................................................................................... 414
Strategy 6: Resource Contentions.......................................................................... 414
Identifying bottlenecks...................................................................................... 414
Resolving bottlenecks ....................................................................................... 427
Section summary............................................................................................... 434
Strategy 7: Self-Tuning Applications..................................................................... 435
Identifying the hardware................................................................................... 436
Device utilization.............................................................................................. 437
Sampling performance...................................................................................... 438
Section summary............................................................................................... 439
Conclusion.............................................................................................................. 439
Questions on Optimization................................................................................ 439
Answers............................................................................................................. 440
Contents ix
CHAPTER 10 Libraries and SDK.................................................................. 441
Introduction.......................................................................................................... 441
Libraries............................................................................................................... 441
General library conventions ........................................................................... 442
NPP (Nvidia Performance Primitives)........................................................... 442
Thrust.............................................................................................................. 451
CuRAND......................................................................................................... 467
CuBLAS (CUDA basic linear algebra) library.............................................. 471
CUDA Computing SDK...................................................................................... 475
Device Query.................................................................................................. 476
Bandwidth test................................................................................................ 478
SimpleP2P....................................................................................................... 479
asyncAPI and cudaOpenMP........................................................................... 482
Aligned types.................................................................................................. 489
Directive-Based Programming ............................................................................ 491
OpenACC........................................................................................................ 492
Writing Your Own Kernels.................................................................................. 499
Conclusion ........................................................................................................... 502
CHAPTER 11 Designing GPU-Based Systems................................................ 503
Introduction.......................................................................................................... 503
CPU Processor..................................................................................................... 505
GPU Device......................................................................................................... 507
Large memory support ................................................................................... 507
ECC memory support..................................................................................... 508
Tesla compute cluster driver (TCC)............................................................... 508
Higher double-precision math........................................................................ 508
Larger memory bus width.............................................................................. 508
SMI ................................................................................................................. 509
Status LEDs.................................................................................................... 509
PCI-E Bus............................................................................................................ 509
GeForce cards...................................................................................................... 510
CPU Memory....................................................................................................... 510
Air Cooling.......................................................................................................... 512
Liquid Cooling..................................................................................................... 513
Desktop Cases and Motherboards....................................................................... 517
Mass Storage........................................................................................................ 518
Motherboard-based I/O................................................................................... 518
Dedicated RAID controllers........................................................................... 519
HDSL.............................................................................................................. 520
Mass storage requirements............................................................................. 521
Networking ..................................................................................................... 521
Power Considerations .......................................................................................... 522
x Contents
Operating Systems............................................................................................... 525
Windows ......................................................................................................... 525
Linux............................................................................................................... 525
Conclusion ........................................................................................................... 526
CHAPTER 12 Common Problems, Causes, and Solutions............................... 527
Introduction.......................................................................................................... 527
Errors With CUDA Directives............................................................................. 527
CUDA error handling..................................................................................... 527
Kernel launching and bounds checking......................................................... 528
Invalid device handles.................................................................................... 529
Volatile qualifiers............................................................................................ 530
Compute level–dependent functions.............................................................. 532
Device, global, and host functions................................................................. 534
Kernels within streams................................................................................... 535
Parallel Programming Issues............................................................................... 536
Race hazards................................................................................................... 536
Synchronization.............................................................................................. 537
Atomic operations........................................................................................... 541
Algorithmic Issues............................................................................................... 544
Back-to-back testing....................................................................................... 544
Memory leaks................................................................................................. 546
Long kernels................................................................................................... 546
Finding and Avoiding Errors............................................................................... 547
How many errors does your GPU program have?......................................... 547
Divide and conquer......................................................................................... 548
Assertions and defensive programming......................................................... 549
Debug level and printing................................................................................ 551
Version control................................................................................................ 555
Developing for Future GPUs............................................................................... 555
Kepler.............................................................................................................. 555
What to think about........................................................................................ 558
Further Resources................................................................................................ 560
Introduction..................................................................................................... 560
Online courses ................................................................................................ 560
Taught courses................................................................................................ 561
Books.............................................................................................................. 562
NVIDIA CUDA certification.......................................................................... 562
Conclusion ........................................................................................................... 562
References............................................................................................................ 563
Index .................................................................................................................................................565
Contents xi
This page intentionally left blank

实例下载地址

CUDA Programming A Developer's Guide to Parallel Computing with GPUs

不能下载?内容有错? 点击这里报错 + 投诉 + 提问

好例子网口号:伸出你的我的手 — 分享

网友评论

发表评论

(您的评论需要经过审核才能显示)

查看所有0条评论>>

小贴士

感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。

  • 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
  • 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
  • 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
  • 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。

关于好例子网

本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享,对搜索内容的合法性不具有预见性、识别性、控制性,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论平台是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定,若资源存在侵权或相关问题请联系本站客服人员,点此联系我们。关于更多版权及免责申明参见 版权及免责申明

;
报警