SVM Loss的循环和矩阵实现笔记 • Liu Yang's Blog

SVM Loss的公式

L_i = \sum_{j \neq y_i} \max(0, W_j^T X_i - W_{y_i}^T X_i + \Delta)

如果我们对 $W_j^T X_i - W_{y_i}^T X_i + \Delta$ 取负号，那么即 $-(W_j^T X_i - W_{y_i}^T X_i + \Delta)$

显然，正确类的分数要高于误分类的分数，且大于一个间隔, 再从优化的角度看，我们最后需要对Loss最小化，所以前面加一个负号，这就构成了Hinge 合页损失。

Native SVM Loss

每次迭代一个样本(1,F),F表示特征数量，计算在每个类别的分数(1,C)，对每个类别和正确的分数通过公式计算，若>0则产生了Loss。

此时, 对应两个类别的权重W, 计算时，系数都是输入的这个样本, 根据正负， $dW_{y_i}$ ，减去x[i]， $dW_j$ , 加x[i].

最后, 再加上正则化项的系数即可.

Code如下:

def svm_loss_naive(
    W: torch.Tensor, X: torch.Tensor, y: torch.Tensor, reg: float
):
    """
    Structured SVM loss function, naive implementation (with loops).

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples. When you implment the regularization over W, please DO NOT
    multiply the regularization term by 1/2 (no coefficient).

    Inputs:
    - W: A PyTorch tensor of shape (D, C) containing weights.
    - X: A PyTorch tensor of shape (N, D) containing a minibatch of data.
    - y: A PyTorch tensor of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as torch scalar
    - gradient of loss with respect to weights W; a tensor of same shape as W
    """
    dW = torch.zeros_like(W)  # initialize the gradient as zero

    # compute the loss and the gradient
    num_classes = W.shape[1]
    num_train = X.shape[0]
    loss = 0.0
    for i in range(num_train):
        scores = W.t().mv(X[i])
        correct_class_score = scores[y[i]]
        for j in range(num_classes):
            if j == y[i]:
                continue
            # -(correct_class_score - scores[j] - 1)
            margin = scores[j] - correct_class_score + 1  # note delta = 1
            if margin > 0:
                loss += margin
                #######################################################################
                # TODO:                                                               #
                # Compute the gradient of the SVM term of the loss function and store #
                # it on dW. (part 1) Rather than first computing the loss and then    #
                # computing the derivative, it is simple to compute the derivative    #
                # at the same time that the loss is being computed.                   #
                #######################################################################
                # Replace "pass" statement with your code
                # 因为此时在公式中W的系数都是x[i]
                dW[:, j] += X[i]
                dW[:, y[i]] -= X[i]
                #######################################################################
                #                       END OF YOUR CODE                              #
                #######################################################################

    # Right now the loss is a sum over all training examples, but we want it
    # to be an average instead so we divide by num_train.
    loss /= num_train

    # Add regularization to the loss.
    loss += reg * torch.sum(W * W)

    #############################################################################
    # TODO:                                                                     #
    # Compute the gradient of the loss function w.r.t. the regularization term  #
    # and add it to dW. (part 2)                                                #
    #############################################################################
    dW /= num_train
    dW += reg * 2 * W
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################

    return loss, dW

Vectorized SVM Loss

可直接用 (N,F) @ (F,C), 得到所有结果的分数, 通过索引得到正确分类的分数(N,1), 根据公式, 求出Margin的结果(N,C).

显然, 那么哪些类贡献了Loss, Margin的结果大于0, 因此可创建Mask(N,C). 需要注意的是,正确的类也参与了Loss的计算,所以其Mask为误分类点数量的负值.

输入(N,F).T @ Mask(N,C) 结果为 (F,C), 最后的结果再加上正则化项则解出最终结果.

可以这样理解, 对输入转置后, 每个样本都是列向量, 而Mask中的每个列向量则表示样本的线性组合即贡献Loss的误分类样本的组合, 因此恰好是能得到结果的.

这种恰好在深度学习中很常见.

Code如下:

def svm_loss_vectorized(
    W: torch.Tensor, X: torch.Tensor, y: torch.Tensor, reg: float
):
    """
    Structured SVM loss function, vectorized implementation. When you implment
    the regularization over W, please DO NOT multiply the regularization term by
    1/2 (no coefficient). The inputs and outputs are the same as svm_loss_naive.

    Inputs:
    - W: A PyTorch tensor of shape (D, C) containing weights.
    - X: A PyTorch tensor of shape (N, D) containing a minibatch of data.
    - y: A PyTorch tensor of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as torch scalar
    - gradient of loss with respect to weights W; a tensor of same shape as W
    """
    loss = 0.0
    dW = torch.zeros_like(W)  # initialize the gradient as zero

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the structured SVM loss, storing the    #
    # result in loss.                                                           #
    #############################################################################
    # Replace "pass" statement with your code
    N = X.shape[0]
    scores = X @ W # (N, C)
    y_correct_pred = scores[torch.arange(N),y].reshape(-1,1)# (N,1)
    # print("y_correct_pred", y_correct_pred.shape)
    margin = - (y_correct_pred - scores - 1)# (N,C)
    margin[torch.arange(N),y] = 0 # (N, C)
    margin = torch.maximum(margin, torch.tensor(0.0)) # (N, C)
    loss = torch.sum(margin)/N + reg * torch.sum(W * W)
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################

    #############################################################################
    # TODO:                                                                     #
    # Implement a vectorized version of the gradient for the structured SVM     #
    # loss, storing the result in dW.                                           #
    #                                                                           #
    # Hint: Instead of computing the gradient from scratch, it may be easier    #
    # to reuse some of the intermediate values that you used to compute the     #
    # loss.                                                                     #
    #############################################################################
    mask = (margin > 0).double() # 误分类点
    mask_sum = torch.sum(mask, dim=1)
    mask[torch.arange(N), y] -= mask_sum # 拉大正确的类和其他部分的差距
    dW = X.t() @ mask
    # Normalize and add regularization term
    dW /= N
    dW += 2 * reg * W
    #############################################################################
    #                             END OF YOUR CODE                              #
    #############################################################################
    return loss, dW