Résumé basé essentiellement sur le remarquable ‘Glove : Global Vectors for word representation’
\( X \): matrice de co-occurrence
Au commencement était le LSA : ‘global matrix factorization ’ / ‘count-based’ method
\( X = UΣV \)
‘While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure’
Next move : predictation-based ~ probabilist method
‘The starting point for the skip-gram or ivLBL methods is a model \( Q_{ij} \) for the probability that word \(j \) appears in the context of word \(i \).’
Softmax model :
\( Q_{ij} = \frac{ e^{w_i^T \tilde{w}_j }}{ ∑_{k=1}^{V} e^{w_i^T\tilde{w}_k} } \)
‘Training proceeds in an on-line, stochastic fashion, but the implied global objective function can be written as’ : \(J = -\sum_{i \in corpus,j\in context(i)} \log Q_{ij} \) ,
‘Evaluating the normalization factor of the softmax for each term in this sum is costly. To allow for efficient training, the skip-gram and ivLBL models introduce approximations to \( Q_{ij} \) ’
But… (to read : p5) :
\(softmax \rightarrow (distance∶ ) \; entropy \Vdash \log \; least \; square \; objective \)
‘The idea of factorizing the log of the co-occurrence matrix is closely related to LSA and we will use the resulting model as a baseline in our experiments. A main drawback to this model is that it weighs all co-occurrences equally, even those that happen rarely or never. Such rare co-occurrences are noisy and carry less information than the more frequent ones — yet even just the zero entries account for 75–95% of the data in \( X \), depending on the vocabulary size and corpus.’
D’où : \( J =\sum f(X_{ij} ) (w_i^T \tilde{w}_j - \log X_{ij} )^2 \)
Compléxité : \( Glove \sim |C|^.8 \; vs \; w2v \sim |C| \)
No comments:
Post a Comment