Pytorch常见错误记录

1、cuDNN版本不匹配在这里插入图片描述
被这个问题困了一上午。。。
首先$LD_LIBRARY_PATH
显示的路径为/usr/local/cuda-9.0/lib64
cd到这个路径下发现对应的cudnn版本为7102
但是conda list显示自己anaconda对应的cudnn版本为7301
因此产生了版本不匹配
解决方案: conda install cudnn=7.1.2
将自己的cudnn版本改为7102

2、计算梯度的时候报错
在这里插入图片描述
Pytorch在梯度下降的时候，对于梯度有影响的变量不能有inplace操作
inplace操作包括:
1+= 、 -=类，改为x = x+1
2带下划线类，比如squeeze_()，改为无下划线
3对张量进行修改
今天刚刚遇到的：

for i in range(b):
     if a[i] == self.tokenizer.yes_token:
         attention[i] = attention[i].clone() + self.alpha*att[i]
     elif a[i] == self.tokenizer.no_token:
         attention[i] = attention[i].clone() - self.beta * att[i]
     elif a[i] == self.tokenizer.non_applicable_token:
         attention[i]= attention[i].clone()

这样会报错，改为如下写法

list_atten = []
for i in range(b):
    if a[i] == self.tokenizer.yes_token:
        tmp = attention[i].clone() + self.alpha*att[i]
        tmp = F.softmax(tmp, dim=0)
        list_atten.append(tmp)
    elif a[i] == self.tokenizer.no_token:
        tmp = attention[i].clone() - self.beta * att[i]
        tmp = F.softmax(tmp, dim=0)
        list_atten.append(tmp)
    elif a[i] == self.tokenizer.non_applicable_token:
        tmp = attention[i].clone()
        list_atten.append(tmp)
attention = t.stack(list_atten, dim=0)

3、cublas runtime error
在这里插入图片描述
暂时还没解决

4、RuntimeError: rnn: hx is not contiguous
在这里插入图片描述
从一层GRU转为多层GRU时，突然遇到这个问题。

q_feature = hidden.transpose(1, 0)

解决方法：
将代码中的transpose操作后面接上contiguous()

q_feature = hidden.transpose(1, 0).contiguous()  # b*1*520

还是不能运行，待解决

5、RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED
在这里插入图片描述
解决方法：
sudo rm -rf ~/.nv 重启
这个方法也有不work的时候
实验室服务器单机多卡的情况，大家一般都只用一块GPU，TensorFlow时代的做法是，

import os
os.environ['CUDA_ENABLE_DEVICES'] = '0'

但在Pytorch中并不适用，正确的操作方法为

import torch
torch.cuda.set_device(0)

另外在指定GPU跑程序时，也可以在运行时直接指定

CUDA_VISIBLE_DEVICES=1 python my_script.py

Pytorch常见错误记录

浏览过的版块