TDM 30200: Project 10 — 2023
Motivation: In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some pytorch
work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model.
Context: This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on pytorch
and JAX
. We will use pytorch
for our calculations.
Scope: SLURM, unix, bash, pytorch
, Python
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/sim/train.csv
-
/anvil/projects/tdm/data/sim/test.csv
-
/anvil/projects/tdm/data/sim/train100k.csv
-
/anvil/projects/tdm/data/sim/train10m.csv
Questions
You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you’ve already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later. |
This project is broken into two parts. In part 1, we will use Part 2 is a continuation from the previous project. In the previous project, you used |
Part 1
You should all have been granted access to our GPU allocation. If you try to use the GPU allocation and run into issues, please create a post in Piazza and make sure you include your Anvil username. To find your Anvil username, you can run the following in a terminal inside your Jupyter Notebook:
|
This question should be completed our GPU allocation, since our regular allocation does not have access to GPUs. To launch the Jupyter Lab instance using our GPU allocation, use the typical Jupyter Notebook option at ondemand.anvil.rcac.purdue.edu. However, instead of using the default options, use the following:
To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you.
|
For this question you will use pytorch
with cpus (like in the previous project) to build a model for train.csv
, train100k.csv
, and train10m.csv
. Use the %%time
Jupyter magic to time the calculation for each dataset.
The following is the code from the previous project that you can use to get started.
|
For |
Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using train.csv
, train100k.csv
, and train10m.csv
. What percentage faster is the GPU calculations for each dataset?
-
Code used to solve this problem.
-
Output from running the code.
-
Time it took to build the model for the
train.csv
andtrain100k.csv
using cpus. In addition, the estimated time it would take to build the model fortrain10m.csv
, again, using cpus. -
Time it took to build the model for the
train.csv
,train100k.csv
, andtrain10m.csv
, using gpus. -
What percentage faster (or slower) the GPU version is vs the CPU version for each dataset.
Part 2
You can now save your notebook, and switch back to using the regular |
We’ve provided you with a Python script called bootstrap_samples.py
that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our x_train
tensor from our x_train
tensor, with replacement. What this means is if our x_train
contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values xr_train
. Then proceed as normal, building your model using xr_train
instead of x_train
.
In addition at the end of the script, we used your model to get predictions for all of the values in x_test
. Save these predictions to a parquet file, for example, 0cd68e5e-134d-4575-a31d-2060644f4caa.parquet
, in a safe location, for example $SCRATCH/p10output/
. Each file will each contain a single set of point estimates for our predictions.
import sys
import argparse
import pandas as pd
import random
import torch
from pathlib import Path
import uuid
class Regression(torch.nn.Module):
def __init__(self):
super().__init__()
self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float))
self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float))
self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float))
def forward(self, x):
return self.beta0 + self.beta1*x + self.beta2*x**2
def get_point_estimates(x_train, y_train, x_test):
model = Regression()
learning_rate = .0003
num_epochs = 10000
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mseloss = torch.nn.MSELoss(reduction='mean')
# resample data
resampled_idxs = random.choices(range(75), k=75)
xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75)
for _ in range(num_epochs):
# set to training mode -- note this does not _train_ anything
model.train()
# calculate the predictions / forward pass
y_predictions = model(xr_train)
# calculate the MSE
mse = mseloss(y_train[resampled_idxs], y_predictions)
# calculate the partial derivatives / backwards step
mse.backward()
# update our parameters
optimizer.step()
# zero out the gradients
optimizer.zero_grad()
# get predictions
predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()})
return(predictions)
def main():
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers(help="possible commands", dest="command")
bootstrap_parser = subparsers.add_parser("bootstrap", help="")
bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output")
bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to")
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
args = parser.parse_args()
if args.command == "bootstrap":
dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv")
x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float)
dat = pd.read_csv("/anvil/projects/tdm/data/sim/test.csv")
x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float)
for _ in range(args.n):
estimates = get_point_estimates(x_train, y_train, x_test)
estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet")
if __name__ == "__main__":
main()
Make sure your
|
You can use the script like the following, in order to create 10 sets of point estimates:
Make sure the |
Next, create your job script. Let’s call this p10_job.sh
. You can use the following code. We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes.
#!/bin/bash
#SBATCH --account=cis220051 # Queue
#SBATCH --partition=shared
#SBATCH --job-name=kevinsjob # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=kamstut@purdue.edu # Where to send mail
#SBATCH --time=00:30:00
#SBATCH --ntasks=10 # Number of tasks (total)
#SBATCH -o /dev/null # Output to dev null
#SBATCH -e /dev/null # Error to dev null
for((i=0; i < 10; i+=1))
do
srun -A cis220051 -p shared --exact -n 1 -c 1 singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 $HOME/bootstrap_samples.py bootstrap 200 --output $SCRATCH/p10output/ &
done
wait
You won’t need any of that array stuff anymore since we don’t have to keep track of the files we’re working with. |
Make sure both
|
Submit your job using sbatch p10_job.sh
.
Make sure to clear out the SLURM environment variables if you choose to run the
|
Great! Now you have a directory $SCRATCH/p10output/
that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showing:
-
The actual
y_test
values (in blue) as a set of points (usingplt.scatter
). -
The predictions as a line.
-
The confidence intervals as a shaded region. (You can use
plt.fill_between
).
The 95% confidence interval is simply the 97.5th percentile of each prediction’s point estimates (upper) and the 2.5th percentile of each prediction’s point estimates (lower).
You can import via:
|
You will need to run the algorithm to get your predictions using the non-resampled training data — otherwise you won’t have the predictions to plot! |
You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use. |
You can loop through the output files by doing something like:
|
-
Code used to solve this problem.
-
Output from running the code.
-
2-3 sentences explaining the "other" changes in the provided script.
-
1-2 sentences describing your opinion of the changes.
-
p10_job.sh
. -
Your resulting graphic — make sure it renders properly when viewed in Gradescope.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |