实例介绍
【实例截图】
【核心代码】
TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................3 1.4. Document Structure...................................................................................... 5 Chapter 2. Programming Model............................................................................... 7 2.1. Kernels......................................................................................................7 2.2. Thread Hierarchy......................................................................................... 8 2.3. Memory Hierarchy....................................................................................... 10 2.4. Heterogeneous Programming.......................................................................... 12 2.5. Compute Capability..................................................................................... 14 Chapter 3. Programming Interface..........................................................................15 3.1. Compilation with NVCC................................................................................ 15 3.1.1. Compilation Workflow.............................................................................16 3.1.1.1. Offline Compilation.......................................................................... 16 3.1.1.2. Just-in-Time Compilation....................................................................16 3.1.2. Binary Compatibility...............................................................................17 3.1.3. PTX Compatibility..................................................................................17 3.1.4. Application Compatibility.........................................................................17 3.1.5. C Compatibility.................................................................................. 18 3.1.6. 64-Bit Compatibility............................................................................... 18 3.2. CUDA Runtime........................................................................................... 19 3.2.1. Initialization.........................................................................................19 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Graphs.......................................................................................... 36 3.2.5.7. Events...........................................................................................42 3.2.5.8. Synchronous Calls.............................................................................43 3.2.6. Multi-Device System............................................................................... 43 3.2.6.1. Device Enumeration.......................................................................... 43 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | iv 3.2.6.2. Device Selection.............................................................................. 43 3.2.6.3. Stream and Event Behavior................................................................. 44 3.2.6.4. Peer-to-Peer Memory Access................................................................44 3.2.6.5. Peer-to-Peer Memory Copy..................................................................45 3.2.7. Unified Virtual Address Space................................................................... 46 3.2.8. Interprocess Communication..................................................................... 46 3.2.9. Error Checking......................................................................................47 3.2.10. Call Stack.......................................................................................... 48 3.2.11. Texture and Surface Memory................................................................... 48 3.2.11.1. Texture Memory............................................................................. 48 3.2.11.2. Surface Memory............................................................................. 58 3.2.11.3. CUDA Arrays..................................................................................62 3.2.11.4. Read/Write Coherency..................................................................... 62 3.2.12. Graphics Interoperability........................................................................62 3.2.12.1. OpenGL Interoperability................................................................... 63 3.2.12.2. Direct3D Interoperability...................................................................65 3.2.12.3. SLI Interoperability..........................................................................71 3.3. External Resource Interoperability................................................................... 72 3.3.1. Vulcan Interoperability............................................................................73 3.3.1.1. Matching device UUIDs.......................................................................73 3.3.1.2. Importing memory objects.................................................................. 73 3.3.1.3. Mapping buffers onto imported memory objects........................................75 3.3.1.4. Mapping mipmapped arrays onto imported memory objects.......................... 76 3.3.1.5. Importing synchronization objects.........................................................78 3.3.1.6. Signaling/waiting on imported synchronization objects................................80 3.3.2. OpenGL Interoperability.......................................................................... 81 3.3.3. Direct3D 12 Interoperability..................................................................... 81 3.3.3.1. Matching device LUIDs....................................................................... 81 3.3.3.2. Importing memory objects.................................................................. 82 3.3.3.3. Mapping buffers onto imported memory objects........................................84 3.3.3.4. Mapping mipmapped arrays onto imported memory objects.......................... 84 3.3.3.5. Importing synchronization objects.........................................................86 3.3.3.6. Signaling/waiting on imported synchronization objects................................87 3.3.4. Direct3D 11 Interoperability..................................................................... 88 3.3.4.1. Matching device LUIDs....................................................................... 88 3.3.4.2. Importing memory objects.................................................................. 88 3.3.4.3. Mapping buffers onto imported memory objects........................................90 3.3.4.4. Mapping mipmapped arrays onto imported memory objects.......................... 90 3.3.4.5. Importing synchronization objects.........................................................92 3.3.4.6. Signaling/waiting on imported synchronization objects................................94 3.3.5. NVIDIA Software Communication Interface Interoperability (NVSCI).......................95 3.3.5.1. Importing memory objects.................................................................. 96 3.3.5.2. Mapping buffers onto imported memory objects........................................97 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | v 3.3.5.3. Mapping mipmapped arrays onto imported memory objects.......................... 97 3.3.5.4. Importing synchronization objects.........................................................98 3.3.5.5. Signaling/waiting on imported synchronization objects................................99 3.4. Versioning and Compatibility........................................................................ 100 3.5. Compute Modes........................................................................................ 101 3.6. Mode Switches..........................................................................................102 3.7. Tesla Compute Cluster Mode for Windows.........................................................102 Chapter 4. Hardware Implementation.................................................................... 104 4.1. SIMT Architecture...................................................................................... 104 4.2. Hardware Multithreading............................................................................. 106 Chapter 5. Performance Guidelines.......................................................................108 5.1. Overall Performance Optimization Strategies.....................................................108 5.2. Maximize Utilization...................................................................................108 5.2.1. Application Level................................................................................. 108 5.2.2. Device Level.......................................................................................109 5.2.3. Multiprocessor Level............................................................................. 109 5.2.3.1. Occupancy Calculator...................................................................... 111 5.3. Maximize Memory Throughput....................................................................... 113 5.3.1. Data Transfer between Host and Device......................................................114 5.3.2. Device Memory Accesses........................................................................ 115 5.4. Maximize Instruction Throughput................................................................... 119 5.4.1. Arithmetic Instructions.......................................................................... 119 5.4.2. Control Flow Instructions....................................................................... 123 5.4.3. Synchronization Instruction..................................................................... 124 Appendix A. CUDA-Enabled GPUs..........................................................................125 Appendix B. C Language Extensions................................................................... 126 B.1. Function Execution Space Specifiers............................................................... 126 B.1.1. __global__......................................................................................... 126 B.1.2. __device__.........................................................................................126 B.1.3. __host__............................................................................................127 B.1.4. __noinline__ and __forceinline__..............................................................127 B.2. Variable Memory Space Specifiers.................................................................. 127 B.2.1. __device__.........................................................................................128 B.2.2. __constant__...................................................................................... 128 B.2.3. __shared__.........................................................................................128 B.2.4. __managed__......................................................................................129 B.2.5. __restrict__........................................................................................129 B.3. Built-in Vector Types.................................................................................. 131 B.3.1. char, short, int, long, longlong, float, double...............................................131 B.3.2. dim3................................................................................................ 132 B.4. Built-in Variables.......................................................................................132 B.4.1. gridDim............................................................................................. 132 B.4.2. blockIdx............................................................................................ 132 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | vi B.4.3. blockDim........................................................................................... 132 B.4.4. threadIdx...........................................................................................133 B.4.5. warpSize............................................................................................133 B.5. Memory Fence Functions............................................................................. 133 B.6. Synchronization Functions............................................................................136 B.7. Mathematical Functions...............................................................................137 B.8. Texture Functions......................................................................................137 B.8.1. Texture Object API...............................................................................137 B.8.1.1. tex1Dfetch()..................................................................................137 B.8.1.2. tex1D()........................................................................................ 137 B.8.1.3. tex1DLod()....................................................................................137 B.8.1.4. tex1DGrad().................................................................................. 138 B.8.1.5. tex2D()........................................................................................ 138 B.8.1.6. tex2DLod()....................................................................................138 B.8.1.7. tex2DGrad().................................................................................. 138 B.8.1.8. tex3D()........................................................................................ 138 B.8.1.9. tex3DLod()....................................................................................138 B.8.1.10. tex3DGrad().................................................................................139 B.8.1.11. tex1DLayered()............................................................................. 139 B.8.1.12. tex1DLayeredLod().........................................................................139 B.8.1.13. tex1DLayeredGrad()....................................................................... 139 B.8.1.14. tex2DLayered()............................................................................. 139 B.8.1.15. tex2DLayeredLod().........................................................................139 B.8.1.16. tex2DLayeredGrad()....................................................................... 140 B.8.1.17. texCubemap().............................................................................. 140 B.8.1.18. texCubemapLod().......................................................................... 140 B.8.1.19. texCubemapLayered().....................................................................140 B.8.1.20. texCubemapLayeredLod()................................................................ 140 B.8.1.21. tex2Dgather()...............................................................................140 B.8.2. Texture Reference API...........................................................................141 B.8.2.1. tex1Dfetch()..................................................................................141 B.8.2.2. tex1D()........................................................................................ 141 B.8.2.3. tex1DLod()....................................................................................142 B.8.2.4. tex1DGrad().................................................................................. 142 B.8.2.5. tex2D()........................................................................................ 142 B.8.2.6. tex2DLod()....................................................................................142 B.8.2.7. tex2DGrad().................................................................................. 142 B.8.2.8. tex3D()........................................................................................ 143 B.8.2.9. tex3DLod()....................................................................................143 B.8.2.10. tex3DGrad().................................................................................143 B.8.2.11. tex1DLayered()............................................................................. 143 B.8.2.12. tex1DLayeredLod().........................................................................144 B.8.2.13. tex1DLayeredGrad()....................................................................... 144 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | vii B.8.2.14. tex2DLayered()............................................................................. 144 B.8.2.15. tex2DLayeredLod().........................................................................144 B.8.2.16. tex2DLayeredGrad()....................................................................... 145 B.8.2.17. texCubemap().............................................................................. 145 B.8.2.18. texCubemapLod().......................................................................... 145 B.8.2.19. texCubemapLayered().....................................................................145 B.8.2.20. texCubemapLayeredLod()................................................................ 145 B.8.2.21. tex2Dgather()...............................................................................146 B.9. Surface Functions...................................................................................... 146 B.9.1. Surface Object API............................................................................... 146 B.9.1.1. surf1Dread()..................................................................................146 B.9.1.2. surf1Dwrite................................................................................... 146 B.9.1.3. surf2Dread()..................................................................................147 B.9.1.4. surf2Dwrite()................................................................................. 147 B.9.1.5. surf3Dread()..................................................................................147 B.9.1.6. surf3Dwrite()................................................................................. 147 B.9.1.7. surf1DLayeredread()........................................................................ 148 B.9.1.8. surf1DLayeredwrite()....................................................................... 148 B.9.1.9. surf2DLayeredread()........................................................................ 148 B.9.1.10. surf2DLayeredwrite()......................................................................148 B.9.1.11. surfCubemapread()........................................................................ 149 B.9.1.12. surfCubemapwrite()....................................................................... 149 B.9.1.13. surfCubemapLayeredread()...............................................................149 B.9.1.14. surfCubemapLayeredwrite()..............................................................149 B.9.2. Surface Reference API........................................................................... 150 B.9.2.1. surf1Dread()..................................................................................150 B.9.2.2. surf1Dwrite................................................................................... 150 B.9.2.3. surf2Dread()..................................................................................150 B.9.2.4. surf2Dwrite()................................................................................. 150 B.9.2.5. surf3Dread()..................................................................................151 B.9.2.6. surf3Dwrite()................................................................................. 151 B.9.2.7. surf1DLayeredread()........................................................................ 151 B.9.2.8. surf1DLayeredwrite()....................................................................... 151 B.9.2.9. surf2DLayeredread()........................................................................ 152 B.9.2.10. surf2DLayeredwrite()......................................................................152 B.9.2.11. surfCubemapread()........................................................................ 152 B.9.2.12. surfCubemapwrite()....................................................................... 152 B.9.2.13. surfCubemapLayeredread()...............................................................153 B.9.2.14. surfCubemapLayeredwrite()..............................................................153 B.10. Read-Only Data Cache Load Function.............................................................153 B.11. Time Function.........................................................................................153 B.12. Atomic Functions..................................................................................... 154 B.12.1. Arithmetic Functions........................................................................... 155 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | viii B.12.1.1. atomicAdd().................................................................................155 B.12.1.2. atomicSub()................................................................................. 156 B.12.1.3. atomicExch()................................................................................156 B.12.1.4. atomicMin()................................................................................. 156 B.12.1.5. atomicMax().................................................................................156 B.12.1.6. atomicInc()..................................................................................157 B.12.1.7. atomicDec().................................................................................157 B.12.1.8. atomicCAS().................................................................................157 B.12.2. Bitwise Functions............................................................................... 157 B.12.2.1. atomicAnd().................................................................................157 B.12.2.2. atomicOr().................................................................................. 158 B.12.2.3. atomicXor()................................................................................. 158 B.13. Address Space Predicate Functions................................................................158 B.13.1. __isGlobal()...................................................................................... 158 B.13.2. __isShared()...................................................................................... 158 B.13.3. __isConstant()....................................................................................159 B.13.4. __isLocal()........................................................................................159 B.14. Warp Vote Functions................................................................................. 159 B.15. Warp Match Functions............................................................................... 160 B.15.1. Synopsys.......................................................................................... 160 B.15.2. Description....................................................................................... 160 B.16. Warp Shuffle Functions..............................................................................161 B.16.1. Synopsis........................................................................................... 161 B.16.2. Description....................................................................................... 161 B.16.3. Notes.............................................................................................. 162 B.16.4. Examples..........................................................................................163 B.16.4.1. Broadcast of a single value across a warp............................................ 163 B.16.4.2. Inclusive plus-scan across sub-partitions of 8 threads............................... 163 B.16.4.3. Reduction across a warp................................................................. 164 B.17. Warp matrix functions...............................................................................164 B.17.1. Description....................................................................................... 164 B.17.2. Sub-byte Operations............................................................................ 167 B.17.3. Restrictions.......................................................................................167 B.17.4. Element Types & Matrix Sizes.................................................................168 B.17.5. Example...........................................................................................169 B.18. Profiler Counter Function........................................................................... 170 B.19. Assertion............................................................................................... 170 B.20. Formatted Output.................................................................................... 171 B.20.1. Format Specifiers............................................................................... 172 B.20.2. Limitations....................................................................................... 172 B.20.3. Associated Host-Side API.......................................................................173 B.20.4. Examples..........................................................................................174 B.21. Dynamic Global Memory Allocation and Operations............................................ 175 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | ix B.21.1. Heap Memory Allocation....................................................................... 175 B.21.2. Interoperability with Host Memory API......................................................176 B.21.3. Examples..........................................................................................176 B.21.3.1. Per Thread Allocation.....................................................................177 B.21.3.2. Per Thread Block Allocation............................................................. 178 B.21.3.3. Allocation Persisting Between Kernel Launches...................................... 179 B.22. Execution Configuration.............................................................................180 B.23. Launch Bounds........................................................................................ 180 B.24. #pragma unroll........................................................................................183 B.25. SIMD Video Instructions..............................................................................183 Appendix C. Cooperative Groups.......................................................................... 185 C.1. Introduction.............................................................................................185 C.2. Intra-block Groups.....................................................................................186 C.2.1. Thread Groups and Thread Blocks.............................................................186 C.2.2. Tiled Partitions....................................................................................187 C.2.3. Thread Block Tiles............................................................................... 187 C.2.4. Coalesced Groups................................................................................ 188 C.2.5. Uses of Intra-block Cooperative Groups...................................................... 188 C.2.5.1. Discovery Pattern........................................................................... 188 C.2.5.2. Warp-Synchronous Code Pattern..........................................................189 C.2.5.3. Composition.................................................................................. 190 C.3. Grid Synchronization.................................................................................. 190 C.4. Multi-Device Synchronization........................................................................ 192 Appendix D. CUDA Dynamic Parallelism..................................................................194 D.1. Introduction.............................................................................................194 D.1.1. Overview........................................................................................... 194 D.1.2. Glossary............................................................................................ 194 D.2. Execution Environment and Memory Model....................................................... 195 D.2.1. Execution Environment.......................................................................... 195 D.2.1.1. Parent and Child Grids.....................................................................195 D.2.1.2. Scope of CUDA Primitives................................................................. 196 D.2.1.3. Synchronization..............................................................................196 D.2.1.4. Streams and Events.........................................................................196 D.2.1.5. Ordering and Concurrency.................................................................197 D.2.1.6. Device Management........................................................................ 197 D.2.2. Memory Model.................................................................................... 197 D.2.2.1. Coherence and Consistency............................................................... 198 D.3. Programming Interface................................................................................200 D.3.1. CUDA C Reference.............................................................................200 D.3.1.1. Device-Side Kernel Launch................................................................ 200 D.3.1.2. Streams....................................................................................... 201 D.3.1.3. Events......................................................................................... 201 D.3.1.4. Synchronization..............................................................................202 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | x D.3.1.5. Device Management........................................................................ 202 D.3.1.6. Memory Declarations....................................................................... 202 D.3.1.7. API Errors and Launch Failures........................................................... 204 D.3.1.8. API Reference................................................................................205 D.3.2. Device-side Launch from PTX.................................................................. 206 D.3.2.1. Kernel Launch APIs......................................................................... 206 D.3.2.2. Parameter Buffer Layout.................................................................. 208 D.3.3. Toolkit Support for Dynamic Parallelism......................................................208 D.3.3.1. Including Device Runtime API in CUDA Code........................................... 208 D.3.3.2. Compiling and Linking......................................................................209 D.4. Programming Guidelines.............................................................................. 209 D.4.1. Basics............................................................................................... 209 D.4.2. Performance.......................................................................................210 D.4.2.1. Synchronization..............................................................................210 D.4.2.2. Dynamic-parallelism-enabled Kernel Overhead........................................ 211 D.4.3. Implementation Restrictions and Limitations................................................211 D.4.3.1. Runtime.......................................................................................211 Appendix E. Mathematical Functions..................................................................... 214 E.1. Standard Functions.................................................................................... 214 E.2. Intrinsic Functions..................................................................................... 222 Appendix F. C Language Support....................................................................... 225 F.1. C 11 Language Features............................................................................. 225 F.2. C 14 Language Features............................................................................. 228 F.3. Restrictions.............................................................................................. 228 F.3.1. Host Compiler Extensions........................................................................228 F.3.2. Preprocessor Symbols.............................................................................229 F.3.2.1. __CUDA_ARCH__............................................................................. 229 F.3.3. Qualifiers........................................................................................... 230 F.3.3.1. Device Memory Space Specifiers.......................................................... 230 F.3.3.2. __managed__ Memory Space Specifier...................................................231 F.3.3.3. Volatile Qualifier.............................................................................233 F.3.4. Pointers............................................................................................. 234 F.3.5. Operators........................................................................................... 234 F.3.5.1. Assignment Operator........................................................................ 234 F.3.5.2. Address Operator............................................................................ 234 F.3.6. Run Time Type Information (RTTI)............................................................. 234 F.3.7. Exception Handling............................................................................... 234 F.3.8. Standard Library...................................................................................234 F.3.9. Functions........................................................................................... 234 F.3.9.1. External Linkage............................................................................. 235 F.3.9.2. Implicitly-declared and explicitly-defaulted functions................................ 235 F.3.9.3. Function Parameters........................................................................ 236 F.3.9.4. Static Variables within Function.......................................................... 236 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | xi F.3.9.5. Function Pointers............................................................................ 237 F.3.9.6. Function Recursion.......................................................................... 237 F.3.9.7. Friend Functions............................................................................. 237 F.3.9.8. Operator Function........................................................................... 238 F.3.10. Classes............................................................................................. 238 F.3.10.1. Data Members...............................................................................238 F.3.10.2. Function Members..........................................................................238 F.3.10.3. Virtual Functions........................................................................... 238 F.3.10.4. Virtual Base Classes........................................................................239 F.3.10.5. Anonymous Unions......................................................................... 239 F.3.10.6. Windows-Specific........................................................................... 239 F.3.11. Templates......................................................................................... 240 F.3.12. Trigraphs and Digraphs..........................................................................240 F.3.13. Const-qualified variables....................................................................... 241 F.3.14. Long Double...................................................................................... 241 F.3.15. Deprecation Annotation........................................................................ 241 F.3.16. C 11 Features...................................................................................242 F.3.16.1. Lambda Expressions........................................................................242 F.3.16.2. std::initializer_list..........................................................................243 F.3.16.3. Rvalue references.......................................................................... 244 F.3.16.4. Constexpr functions and function templates.......................................... 244 F.3.16.5. Constexpr variables........................................................................ 244 F.3.16.6. Inline namespaces..........................................................................245 F.3.16.7. thread_local................................................................................. 246 F.3.16.8. __global__ functions and function templates......................................... 246 F.3.16.9. __device__/__constant__/__shared__ variables...................................... 248 F.3.16.10. Defaulted functions.......................................................................248 F.3.17. C 14 Features...................................................................................249 F.3.17.1. Functions with deduced return type....................................................249 F.3.17.2. Variable templates......................................................................... 250 F.4. Polymorphic Function Wrappers..................................................................... 250 F.5. Extended Lambdas..................................................................................... 254 F.5.1. Extended Lambda Type Traits...................................................................255 F.5.2. Extended Lambda Restrictions..................................................................256 F.5.3. Notes on __host__ __device__ lambdas.......................................................265 F.5.4. *this Capture By Value........................................................................... 265 F.5.5. Additional Notes...................................................................................268 F.6. Code Samples........................................................................................... 270 F.6.1. Data Aggregation Class...........................................................................270 F.6.2. Derived Class...................................................................................... 270 F.6.3. Class Template.....................................................................................271 F.6.4. Function Template................................................................................ 271 F.6.5. Functor Class...................................................................................... 272 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | xii Appendix G. Texture Fetching..............................................................................273 G.1. Nearest-Point Sampling............................................................................... 273 G.2. Linear Filtering........................................................................................ 274 G.3. Table Lookup........................................................................................... 275 Appendix H. Compute Capabilities........................................................................ 277 H.1. Features and Technical Specifications............................................................. 277 H.2. Floating-Point Standard...............................................................................281 H.3. Compute Capability 3.x.............................................................................. 282 H.3.1. Architecture....................................................................................... 282 H.3.2. Global Memory....................................................................................283 H.3.3. Shared Memory................................................................................... 285 H.4. Compute Capability 5.x.............................................................................. 286 H.4.1. Architecture....................................................................................... 286 H.4.2. Global Memory....................................................................................287 H.4.3. Shared Memory................................................................................... 287 H.5. Compute Capability 6.x.............................................................................. 291 H.5.1. Architecture....................................................................................... 291 H.5.2. Global Memory....................................................................................291 H.5.3. Shared Memory................................................................................... 291 H.6. Compute Capability 7.x.............................................................................. 292 H.6.1. Architecture....................................................................................... 292 H.6.2. Independent Thread Scheduling............................................................... 292 H.6.3. Global Memory....................................................................................294 H.6.4. Shared Memory................................................................................... 295 Appendix I. Driver API....................................................................................... 297 I.1. Context................................................................................................... 300 I.2. Module....................................................................................................301 I.3. Kernel Execution........................................................................................302 I.4. Interoperability between Runtime and Driver APIs............................................... 304 Appendix J. CUDA Environment Variables...............................................................305 Appendix K. Unified Memory Programming..............................................................308 K.1. Unified Memory Introduction........................................................................ 308 K.1.1. System Requirements............................................................................ 309 K.1.2. Simplifying GPU Programming.................................................................. 309 K.1.3. Data Migration and Coherency................................................................. 311 K.1.4. GPU Memory Oversubscription................................................................. 311 K.1.5. Multi-GPU.......................................................................................... 312 K.1.6. System Allocator..................................................................................312 K.1.7. Hardware Coherency.............................................................................313 K.1.8. Access Counters.................................................................................. 314 K.2. Programming Model....................................................................................314 K.2.1. Managed Memory Opt In........................................................................ 314 K.2.1.1. Explicit Allocation Using cudaMallocManaged()........................................ 315 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | xiii K.2.1.2. Global-Scope Managed Variables Using __managed__.................................316 K.2.2. Coherency and Concurrency.................................................................... 316 K.2.2.1. GPU Exclusive Access To Managed Memory............................................. 316 K.2.2.2. Explicit Synchronization and Logical GPU Activity.....................................317 K.2.2.3. Managing Data Visibility and Concurrent CPU GPU Access with Streams......... 318 K.2.2.4. Stream Association Examples............................................................. 319 K.2.2.5. Stream Attach With Multithreaded Host Programs.................................... 320 K.2.2.6. Advanced Topic: Modular Programs and Data Access Constraints................... 321 K.2.2.7. Memcpy()/Memset() Behavior With Managed Memory................................ 322 K.2.3. Language Integration............................................................................ 322 K.2.3.1. Host Program Errors with __managed__ Variables.....................................323 K.2.4. Querying Unified Memory Support.............................................................324 K.2.4.1. Device Properties........................................................................... 324 K.2.4.2. Pointer Attributes........................................................................... 324 K.2.5. Advanced Topics.................................................................................. 324 K.2.5.1. Managed Memory with Multi-GPU Programs on pre-6.x Architectures.............. 324 K.2.5.2. Using fork() with Managed Memory...................................................... 325 K.3. Performance Tuning................................................................................... 325 K.3.1. Data Prefetching..................................................................................326 K.3.2. Data Usage Hints................................................................................. 327 K.3.3. Querying Usage Attributes...................................................................... 328 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | xiv LIST OF FIGURES Figure 1 Floating-Point Operations per Second for the CPU and GPU ...................................1 Figure 2 Memory Bandwidth for the CPU and GPU .........................................................2 Figure 3 The GPU Devotes More Transistors to Data Processing ......................................... 2 Figure 4 GPU Computing Applications ........................................................................ 3 Figure 5 Automatic Scalability ................................................................................. 5 Figure 6 Grid of Thread Blocks ................................................................................ 9 Figure 7 Memory Hierarchy ................................................................................... 11 Figure 8 Heterogeneous Programming ...................................................................... 13 Figure 9 Matrix Multiplication without Shared Memory .................................................. 25 Figure 10 Matrix Multiplication with Shared Memory .....................................................28 Figure 11 Child Graph Example .............................................................................. 38 Figure 12 Creating a Graph Using Graph APIs Example .................................................. 39 Figure 13 The Driver API Is Backward but Not Forward Compatible ..................................101 Figure 14 Parent-Child Launch Nesting .................................................................... 196 Figure 15 Nearest-Point Sampling Filtering Mode ........................................................274 Figure 16 Linear Filtering Mode ............................................................................ 275 Figure 17 One-Dimensional Table Lookup Using Linear Filtering ...................................... 276 Figure 18 Examples of Global Memory Accesses ......................................................... 285 Figure 19 Strided Shared Memory Accesses ...............................................................289 Figure 20 Irregular Shared Memory Accesses .............................................................290 Figure 21 Library Context Management ................................................................... 301 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.2 | xv LIST OF TABLES Table 1 Linear Memory Address Space ...................................................................... 20 Table 2 Cubemap Fetch ........................................................................................57 Table 3 Throughput of Native Arithmetic Instructions .................................................. 119 Table 4 Alignment Requirements ........................................................................... 131 Table 5 New Device-only Launch Implementation Functions .......................................... 205 Table 6 Supported API Functions ........................................................................... 205 Table 7 Single-Precision Mathematical Standard Library Functions with Maximum ULP Error .... 214 Table 8 Double-Precision Mathematical Standard Library Functions with Maximum ULP Error... 218 Table 9 Functions Affected by -use_fast_math .......................................................... 222 Table 10 Single-Precision Floating-Point Intrinsic Functions ........................................... 223 Table 11 Double-Precision Floating-Point Intrinsic Functions .......................................... 224 Table 12 C 11 Language Features ........................................................................ 225 Table 13 C 14 Language Features ........................................................................ 228 Table 14 Feature Support per Compute Capability ......................................................277 Table 15 Technical Specifications per Compute Capability ............................................ 278 Table 16 Objects Available in the CUDA Driver API ..................................................... 297 Table 17 CUDA Environment Variables .....................................................................305
标签: 编程
小贴士
感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。
- 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
- 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
- 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
- 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。
关于好例子网
本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享,对搜索内容的合法性不具有预见性、识别性、控制性,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论平台是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定,若资源存在侵权或相关问题请联系本站客服人员,点此联系我们。关于更多版权及免责申明参见 版权及免责申明
网友评论
我要评论