Cuda_1

At first it was mostly just a review of stuff we generally knew I’ve already passed it so not going to go to in depth but now I’m doing the coding exercises and rishabhs right once I get cooking I fucking COOK

I just have to put myself into positions to cook

Anyway so now we have the exercises.

And there was a cudaDeviceReset() but once we eliminated it the print output from the gpu stopped. Why is that? Well the powerful thing about this setup is that theres asynchronous execution of host and device code (host -> cpu) (device -> gpu) (just in case i end up giving the notes to rishabh or someone xd.

Soooo that means I think the cpu instructions ended and then the program just exits completely determined by cpu? Yer at least thats what gpt is saying.

Also this jank colab setup is hella funny lol.

Ok and this is actually on a very simple scale the same bug or issue which Edward Yang was talking about on the pytorch podcast.

People run their code and basically the time they get their error is all fucked up because async code. I’ll have to hear the exact issue again its on one of the cuda episodes but it throws them of course but really its a fundamental disconnect from understanding how execution is happening.

So theres a function called CudaDeviceSynchronize which I guess just halts cpu execution until all the gpu calls before catch up? Let me try it out

Ok yeah very interesting stuff the synchronize func worked great Now onto the really interesting stuff! Which is going to be printing threads.

Each thread has its thread index ;0 So now we can do conditional stuff.which leads to interesting downstream implications

How would I get to threadIdx.y

I experimented with the code blockIdx as well and It is 0 indexed which is interesting.

Well that was fucking awesome And i then i launched with 10 blocks and one thread and got basically random numbers in the output

GPT output It sounds like you’ve been deep-diving into CUDA programming and experiencing firsthand how asynchronous execution between the CPU (host) and GPU (device) affects your code’s behavior. Here’s a streamlined summary of your observations and experiments: 1. Asynchronous Execution: CUDA allows for asynchronous execution where the CPU can run ahead of the GPU. This means operations issued by the CPU to the GPU may not have completed by the time the CPU finishes its execution. 2. cudaDeviceReset(): Removing cudaDeviceReset() from your code stopped the GPU output from printing because this function cleans up all resources related to the current device in use. Its absence likely caused the program to terminate before the GPU could finish processing and outputting results. 3. cudaDeviceSynchronize(): You correctly identified and utilized cudaDeviceSynchronize(), which blocks the host execution until the GPU completes all preceding commands. This synchronization ensures that the GPU finishes its tasks, such as printing to the output, before the program terminates. 4. Experimenting with Thread and Block Indices: You’ve played around with threadIdx.y and blockIdx.x, which are used to identify the position of threads within blocks and blocks within grids, respectively. These indices are crucial for controlling what each thread executes and are zero-indexed. 5. Results from Varying the Kernel Launch Configuration: When launching kernels with different configurations (like 10 blocks and one thread per block), the output seemed random due to each thread having a unique index and potentially executing at different times or encountering different states of memory. 6. Practical Implications: This exploration helps in understanding more complex behaviors in multi-threaded environments, especially for conditional operations based on thread indices, which can lead to various outcomes depending on how threads are managed and executed. Your journey through CUDA programming is uncovering the critical aspects of GPU programming, particularly how managing concurrency and synchronization can impact the results and performance of your applications. This foundational knowledge is vital as you delve into more complex GPU-based computations.

Hyped for chapter 2 tmmrw

Ok there was this previous iteration of rohan but we’re not structured enough here and need to get a more accessible representation because this is illegible