CUDA 编程4.2（CUDA_C_Programming_Guide_4.2.pdf）

图形和图像处理
下载此实例
开发语言：C/C++
实例大小：3.06M
下载次数：18
浏览次数：139
发布时间：2020-02-26
实例类别：图形和图像处理
发布人：Jack2020
文件格式：.pdf
所需积分：2
实例介绍

【实例简介】
【实例截图】
【核心代码】
Table of Contents
Chapter 1. Introduction ................................................................................... 1
1.1 From Graphics Processing to General-Purpose Parallel Computing...................1
1.2 CUDA™: a General-Purpose Parallel Computing Architecture ..........................3
1.3 A Scalable Programming Model....................................................................4
1.4 Document’s Structure .................................................................................6
Chapter 2. Programming Model ....................................................................... 7
2.1 Kernels ......................................................................................................7
2.2 Thread Hierarchy........................................................................................8
2.3 Memory Hierarchy .................................................................................... 10
2.4 Heterogeneous Programming .................................................................... 11
2.5 Compute Capability................................................................................... 14
Chapter 3. Programming Interface................................................................ 15
3.1 Compilation with NVCC ............................................................................. 15
3.1.1 Compilation Workflow......................................................................... 16
3.1.1.1 Offline Compilation ...................................................................... 16
3.1.1.2 Just-in-Time Compilation.............................................................. 16
3.1.2 Binary Compatibility ........................................................................... 17
3.1.3 PTX Compatibility............................................................................... 17
3.1.4 Application Compatibility..................................................................... 17
3.1.5 C/C   Compatibility .......................................................................... 18
3.1.6 64-Bit Compatibility............................................................................ 18
3.2 CUDA C Runtime ...................................................................................... 19
3.2.1 Initialization....................................................................................... 19
3.2.2 Device Memory.................................................................................. 20
3.2.3 Shared Memory ................................................................................. 22
3.2.4 Page-Locked Host Memory.................................................................. 28
3.2.4.1 Portable Memory ......................................................................... 29
3.2.4.2 Write-Combining Memory............................................................. 29
iv CUDA C Programming Guide Version 4.2
3.2.4.3 Mapped Memory.......................................................................... 29
3.2.5 Asynchronous Concurrent Execution .................................................... 30
3.2.5.1 Concurrent Execution between Host and Device............................. 30
3.2.5.2 Overlap of Data Transfer and Kernel Execution .............................. 30
3.2.5.3 Concurrent Kernel Execution ........................................................ 31
3.2.5.4 Concurrent Data Transfers ........................................................... 31
3.2.5.5 Streams...................................................................................... 31
3.2.5.6 Events ........................................................................................ 34
3.2.5.7 Synchronous Calls ....................................................................... 34
3.2.6 Multi-Device System........................................................................... 35
3.2.6.1 Device Enumeration..................................................................... 35
3.2.6.2 Device Selection .......................................................................... 35
3.2.6.3 Stream and Event Behavior .......................................................... 35
3.2.6.4 Peer-to-Peer Memory Access ........................................................ 36
3.2.6.5 Peer-to-Peer Memory Copy........................................................... 36
3.2.7 Unified Virtual Address Space.............................................................. 37
3.2.8 Error Checking................................................................................... 37
3.2.9 Call Stack .......................................................................................... 38
3.2.10 Texture and Surface Memory .............................................................. 38
3.2.10.1 Texture Memory.......................................................................... 38
3.2.10.2 Surface Memory .......................................................................... 45
3.2.10.3 CUDA Arrays ............................................................................... 48
3.2.10.4 Read/Write Coherency ................................................................. 48
3.2.11 Graphics Interoperability..................................................................... 48
3.2.11.1 OpenGL Interoperability ............................................................... 49
3.2.11.2 Direct3D Interoperability .............................................................. 51
3.2.11.3 SLI Interoperability...................................................................... 58
3.3 Versioning and Compatibility...................................................................... 58
3.4 Compute Modes ....................................................................................... 59
3.5 Mode Switches ......................................................................................... 60
3.6 Tesla Compute Cluster Mode for Windows .................................................. 60
Chapter 4. Hardware Implementation........................................................... 61
4.1 SIMT Architecture..................................................................................... 61
CUDA C Programming Guide Version 4.2 v
4.2 Hardware Multithreading........................................................................... 62
Chapter 5. Performance Guidelines ............................................................... 65
5.1 Overall Performance Optimization Strategies............................................... 65
5.2 Maximize Utilization .................................................................................. 65
5.2.1 Application Level................................................................................ 65
5.2.2 Device Level ...................................................................................... 66
5.2.3 Multiprocessor Level........................................................................... 66
5.3 Maximize Memory Throughput................................................................... 68
5.3.1 Data Transfer between Host and Device .............................................. 69
5.3.2 Device Memory Accesses .................................................................... 70
5.3.2.1 Global Memory............................................................................ 70
5.3.2.2 Local Memory.............................................................................. 72
5.3.2.3 Shared Memory........................................................................... 72
5.3.2.4 Constant Memory ........................................................................ 73
5.3.2.5 Texture and Surface Memory........................................................ 73
5.4 Maximize Instruction Throughput............................................................... 73
5.4.1 Arithmetic Instructions ....................................................................... 74
5.4.2 Control Flow Instructions .................................................................... 77
5.4.3 Synchronization Instruction................................................................. 77
Appendix A. CUDA-Enabled GPUs .................................................................. 79
Appendix B. C Language Extensions .............................................................. 81
B.1 Function Type Qualifiers............................................................................ 81
B.1.1 __device__........................................................................................ 81
B.1.2 __global__ ........................................................................................ 81
B.1.3 __host__........................................................................................... 81
B.1.4 __noinline__ and __forceinline__ ........................................................ 82
B.2 Variable Type Qualifiers ............................................................................ 82
B.2.1 __device__........................................................................................ 83
B.2.2 __constant__..................................................................................... 83
B.2.3 __shared__ ....................................................................................... 83
B.2.4 __restrict__....................................................................................... 84
B.3 Built-in Vector Types................................................................................. 85
vi CUDA C Programming Guide Version 4.2
B.3.1 char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1,
ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2,
int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
longlong1, ulonglong1, longlong2, ulonglong2, float1, float2, float3, float4, double1,
double2 85
B.3.2 dim3................................................................................................. 86
B.4 Built-in Variables ...................................................................................... 86
B.4.1 gridDim............................................................................................. 87
B.4.2 blockIdx ............................................................................................ 87
B.4.3 blockDim........................................................................................... 87
B.4.4 threadIdx .......................................................................................... 87
B.4.5 warpSize ........................................................................................... 87
B.5 Memory Fence Functions........................................................................... 87
B.6 Synchronization Functions ......................................................................... 89
B.7 Mathematical Functions............................................................................. 89
B.8 Texture Functions..................................................................................... 90
B.8.1 tex1Dfetch()...................................................................................... 90
B.8.2 tex1D()............................................................................................. 91
B.8.3 tex2D()............................................................................................. 91
B.8.4 tex3D()............................................................................................. 91
B.8.5 tex1DLayered().................................................................................. 91
B.8.6 tex2DLayered().................................................................................. 91
B.8.7 texCubemap().................................................................................... 92
B.8.8 texCubemapLayered()........................................................................ 92
B.8.9 tex2Dgather().................................................................................... 92
B.9 Surface Functions ..................................................................................... 92
B.9.1 surf1Dread()...................................................................................... 92
B.9.2 surf1Dwrite()..................................................................................... 93
B.9.3 surf2Dread()...................................................................................... 93
B.9.4 surf2Dwrite()..................................................................................... 93
B.9.5 surf3Dread()...................................................................................... 93
B.9.6 surf3Dwrite()..................................................................................... 94
B.9.7 surf1DLayeredread() .......................................................................... 94
B.9.8 surf1DLayeredwrite() ......................................................................... 94
CUDA C Programming Guide Version 4.2 vii
B.9.9 surf2DLayeredread() .......................................................................... 94
B.9.10 surf2DLayeredwrite() ......................................................................... 95
B.9.11 surfCubemapread()............................................................................ 95
B.9.12 surfCubemapwrite() ........................................................................... 95
B.9.13 surfCubemabLayeredread()................................................................. 95
B.9.14 surfCubemapLayeredwrite()................................................................ 96
B.10 Time Function .......................................................................................... 96
B.11 Atomic Functions ...................................................................................... 96
B.11.1 Arithmetic Functions........................................................................... 97
B.11.1.1 atomicAdd()................................................................................ 97
B.11.1.2 atomicSub()................................................................................ 97
B.11.1.3 atomicExch()............................................................................... 98
B.11.1.4 atomicMin() ................................................................................ 98
B.11.1.5 atomicMax()................................................................................ 98
B.11.1.6 atomicInc()................................................................................. 98
B.11.1.7 atomicDec()................................................................................ 98
B.11.1.8 atomicCAS()................................................................................ 99
B.11.2 Bitwise Functions ............................................................................... 99
B.11.2.1 atomicAnd()................................................................................ 99
B.11.2.2 atomicOr().................................................................................. 99
B.11.2.3 atomicXor()................................................................................. 99
B.12 Warp Vote Functions............................................................................... 100
B.13 Warp Shuffle Functions ........................................................................... 100
B.13.1 Synopsys......................................................................................... 100
B.13.2 Description ...................................................................................... 100
B.13.3 Return Value ................................................................................... 101
B.13.4 Notes.............................................................................................. 101
B.13.5 Examples ........................................................................................ 102
B.13.5.1 Broadcast of a single value across a warp.................................... 102
B.13.5.2 Inclusive plus-scan across sub-partitions of 8 threads ................... 102
B.13.5.3 Reduction across a warp ............................................................ 103
B.14 Profiler Counter Function......................................................................... 103
B.15 Assertion ............................................................................................... 103
viii CUDA C Programming Guide Version 4.2
B.16 Formatted Output................................................................................... 104
B.16.1 Format Specifiers ............................................................................. 105
B.16.2 Limitations ...................................................................................... 105
B.16.3 Associated Host-Side API.................................................................. 106
B.16.4 Examples ........................................................................................ 106
B.17 Dynamic Global Memory Allocation........................................................... 108
B.17.1 Heap Memory Allocation ................................................................... 108
B.17.2 Interoperability with Host Memory API............................................... 109
B.17.3 Examples ........................................................................................ 109
B.17.3.1 Per Thread Allocation................................................................. 109
B.17.3.2 Per Thread Block Allocation ........................................................ 109
B.17.3.3 Allocation Persisting Between Kernel Launches............................. 110
B.18 Execution Configuration .......................................................................... 111
B.19 Launch Bounds....................................................................................... 112
B.20 #pragma unroll ...................................................................................... 114
Appendix C. Mathematical Functions........................................................... 115
C.1 Standard Functions................................................................................. 115
C.1.1 Single-Precision Floating-Point Functions............................................ 115
C.1.2 Double-Precision Floating-Point Functions .......................................... 118
C.2 Intrinsic Functions .................................................................................. 120
C.2.1 Single-Precision Floating-Point Functions............................................ 121
C.2.2 Double-Precision Floating-Point Functions .......................................... 122
Appendix D. C/C   Language Support ....................................................... 123
D.1 Code Samples ........................................................................................ 123
D.1.1 Data Aggregation Class .................................................................... 123
D.1.2 Derived Class................................................................................... 124
D.1.3 Class Template ................................................................................ 124
D.1.4 Function Template ........................................................................... 125
D.1.5 Functor Class................................................................................... 125
D.2 Restrictions ............................................................................................ 126
D.2.1 Qualifiers......................................................................................... 126
D.2.1.1 Device Memory Qualifiers........................................................... 126
D.2.1.2 Volatile Qualifier........................................................................ 126
CUDA C Programming Guide Version 4.2 ix
D.2.2 Pointers .......................................................................................... 127
D.2.3 Operators........................................................................................ 127
D.2.3.1 Assignment Operator................................................................. 127
D.2.3.2 Address Operator ...................................................................... 127
D.2.4 Functions ........................................................................................ 127
D.2.4.1 Function Parameters.................................................................. 127
D.2.4.2 Static Variables within Function .................................................. 128
D.2.4.3 Function Pointers....................................................................... 128
D.2.4.4 Function Recursion .................................................................... 128
D.2.5 Classes............................................................................................ 128
D.2.5.1 Data Members........................................................................... 128
D.2.5.2 Function Members ..................................................................... 128
D.2.5.3 Constructors and Destructors ..................................................... 128
D.2.5.4 Virtual Functions ....................................................................... 128
D.2.5.5 Virtual Base Classes................................................................... 128
D.2.5.6 Windows-Specific ...................................................................... 128
D.2.6 Templates ....................................................................................... 129
Appendix E. Texture Fetching ...................................................................... 131
E.1 Nearest-Point Sampling........................................................................... 132
E.2 Linear Filtering ....................................................................................... 132
E.3 Table Lookup ......................................................................................... 134
Appendix F. Compute Capabilities ............................................................... 135
F.1 Features and Technical Specifications....................................................... 136
F.2 Floating-Point Standard........................................................................... 139
F.3 Compute Capability 1.x ........................................................................... 141
F.3.1 Architecture..................................................................................... 141
F.3.2 Global Memory ................................................................................ 141
F.3.2.1 Devices of Compute Capability 1.0 and 1.1 .................................. 142
F.3.2.2 Devices of Compute Capability 1.2 and 1.3 .................................. 142
F.3.3 Shared Memory ............................................................................... 143
F.3.3.1 32-Bit Strided Access ................................................................. 143
F.3.3.2 32-Bit Broadcast Access ............................................................. 143
F.3.3.3 8-Bit and 16-Bit Access .............................................................. 144
x CUDA C Programming Guide Version 4.2
F.3.3.4 Larger Than 32-Bit Access.......................................................... 144
F.4 Compute Capability 2.x ........................................................................... 145
F.4.1 Architecture..................................................................................... 145
F.4.2 Global Memory ................................................................................ 146
F.4.3 Shared Memory ............................................................................... 147
F.4.3.1 32-Bit Strided Access ................................................................. 147
F.4.3.2 Larger Than 32-Bit Access.......................................................... 148
F.4.4 Constant Memory............................................................................. 148
F.5 Compute Capability 3.0 ........................................................................... 149
F.5.1 Architecture..................................................................................... 149
F.5.2 Global Memory ................................................................................ 150
F.5.3 Shared Memory ............................................................................... 152
F.5.3.1 64-Bit Mode .............................................................................. 152
F.5.3.2 32-Bit Mode .............................................................................. 152
Appendix G. Driver API ................................................................................ 155
G.1 Context.................................................................................................. 157
G.2 Module .................................................................................................. 158
G.3 Kernel Execution..................................................................................... 158
G.4 Interoperability between Runtime and Driver APIs..................................... 160
标签： 编程
实例下载地址

CUDA 编程4.2（CUDA_C_Programming_Guide_4.2.pdf）

点此下载实例
不能下载？内容有错？点击这里报错 + 投诉 + 提问
好例子网口号：伸出你的我的手 — 分享！
网友评论

我要评论
小贴士

感谢您为本站写下的评论，您的评论对其它用户来说具有重要的参考价值，所以请认真填写。
类似“顶”、“沙发”之类没有营养的文字，对勤劳贡献的楼主来说是令人沮丧的反馈信息。
相信您也不想看到一排文字/表情墙，所以请不要反馈意义不大的重复字符，也请尽量不要纯表情的回复。
提问之前请再仔细看一遍楼主的说明，或许是您遗漏了。
请勿到处挖坑绊人、招贴广告。既占空间让人厌烦，又没人会搭理，于人于己都无利。
关于好例子网

本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享，对搜索内容的合法性不具有预见性、识别性、控制性，仅供学习研究，请务必在下载后24小时内给予删除，不得用于其他任何用途，否则后果自负。基于互联网的特殊性，平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查；无论平台是否已进行审查，用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场，基于网友分享，根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定，若资源存在侵权或相关问题请联系本站客服人员，点此联系我们。关于更多版权及免责申明参见版权及免责申明
CUDA 编程4.2（CUDA_C_Programming_Guide_4.2.pdf）

同类人气实例

实例介绍

实例下载地址

CUDA 编程4.2（CUDA_C_Programming_Guide_4.2.pdf）

相关软件

相关文章

网友评论

小贴士

关于好例子网

下载周排行

下载总排行