TA here. Noted! I now have more resources to test more environments, and will do so whenever possible. I think freezing due to memory overuse is going to be a problem with anything you code yourself, but I do think we could be more rigorous with guiding people to achieve limited memory use for the tokenizer task.
IMO the cost of renting GPUs is a bit overstated in these comments. Generally almost all of the development can be done locally, and then ran for a short period of time using on-demand GPUs. For assignment 1, you can run everything on your local machine, even if you don't have a GPU. For A1 and A2, you can do (most of) the tasks with only a few hours of renting. Without being too careful using rental GPUs throughout will net you around $200 of a compute budget, but you can easily get this under $50 if you're willing to scale down many of the problems. I think we could work on making this clear and charting what these changes are.
If you have further feedback or encounter problems, feel free to open issues in the repos so we can resolve them! It's hard for us to fix issues we're not aware of.
Memory overuse: for context, it's about parallelism on gloo backend with CPU. My observation is that on Linux, the same (bad) python code will result in the process getting killed quickly, saving user the trouble of rebooting. Not sure if MacOS behavior is expected in the first place.
GPU cost: most of us will spend at least a few hours of troubleshooting to get started on a leased GPU, including but not limited to figuring out how much storage is needed, if CUDA version works well etc. No GPU is definitely possible but difficult. Plus, one issue might be that most of us just don't have enough experience working with them, resulting in more time figuring things out.
Github issues -- noted, will create any issue that I can think of.
OOM on CUDA GPUs is relatively graceful (the process crashes). However, on macOS if torch MPS tries to allocate too much memory, the whole kernel will simply lock up and the only option is to reboot the computer. I have no idea why Apple doesn’t reserve memory for stuff like the OOM/kernel watchdog, but it seems they either don’t or there is a bug.
TA here. Biggest changes are in the second assignment (distributed) where we added a bunch of memory, profiling and distributed tasks, as well as in the fifth assignment (alignment), where most of the RL tasks are fresh this year. Assignment 3 (scaling laws) was also completely updated, but in a way that might be difficult to run without substantial resources. I'm working on a way for external students to be able to run simulated experiments for free!
Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.
We have autograding for code through tests written by hand, and additionally do manual code audits if we see suspicious behavior. We also do grading the old-fashioned way for writeups.
We do indeed catch students who don't follow the honor code. It's very obvious from how the code looks, as well as the rate of progress. Since we use Modal for class submissions, we have code deltas for every time they run something on B200s. The diffs often contain something like 300 lines in 5 minutes, in which case we review and report based on how egregious/provable it looks.
TA here. Definitely not! In fact we explicitly added sections in the first assignment to allow for scaling down to even local compute (M-series GPUs). For assignment 2 there are a few regions that require Triton support for your GPU, but everything can be adapted for much cheaper GPUs.
We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.
I am only familiar with MLIR for accelerator-specific compilation, but my understanding is that by describing operations at a higher level, you don’t need the frontend to know what LLVM IR will lead to the best final performance. For instance you could say "perform tiled matrix multiplication" instead of "multiply and add while looping in this arbitrary indexing pattern", and an MLIR pass can reason about what pattern to use and take whatever hints you’ve given it. This is especially helpful when some implementations should be different depending on previous/next ops and what your target hardware is. I think there’s no reason Zig can’t do something like this internally, but MLIR is an existing way to build primitives at several different levels of abstraction. From what I’ve heard it’s far from ergonomic for compiler devs, though…
IMO the cost of renting GPUs is a bit overstated in these comments. Generally almost all of the development can be done locally, and then ran for a short period of time using on-demand GPUs. For assignment 1, you can run everything on your local machine, even if you don't have a GPU. For A1 and A2, you can do (most of) the tasks with only a few hours of renting. Without being too careful using rental GPUs throughout will net you around $200 of a compute budget, but you can easily get this under $50 if you're willing to scale down many of the problems. I think we could work on making this clear and charting what these changes are.
If you have further feedback or encounter problems, feel free to open issues in the repos so we can resolve them! It's hard for us to fix issues we're not aware of.