Symposium on Machine Learning in Speech and Language Processing (MLSLP)
Bellevue, WA, USA
This paper investigates the relationship between the loss function, the type of regularization, and the resulting model sparsity of discriminatively-trained multiclass linear models. The effects on sparsity of optimizing log loss are straightforward: L2 regularization produces very dense models while L1 regularization produces much sparser models. However, optimizing hinge loss yields more nuanced behavior. We give experimental evidence and theoretical arguments that, for a class of problems that arises frequently in natural-language processing, both L1- and L2-regularized hinge loss lead to sparser models than L2-regularized log loss, but less sparse models than L1-regularized log loss. Furthermore, we give evidence and arguments that for models with only indicator features, there is a critical threshold on the weight of the regularizer below which L1- and L2-regularized hinge loss tends to produce models of similar sparsity.
Index Terms: regularization, hinge loss, support vector machines, SVMs, sparsity
Bibliographic reference. Moore, Robert / DeNero, John (2011): "L1 and L2 regularization for multiclass hinge loss models", In MLSLP-2011, 1-5.