Three More Word Embeddings Papers

Ok, one last round of word embeddings papers, then I’m on to research that’s a lot more relevant to my current work. Here I’ll look at three papers, all of which are again related to skip-gram. Two of them look at giving more or different information to neural embedding models, and one looks a little more deeply at the objective function optimized by skip-gram.


Read more...

Tensor Decompositions and Applications; Kolda and Bader, SIREV 2009


Reactions to the skip-gram model (three papers)

Finishing up (for now) my reading about skip-gram, I’ll summarize three papers that provide different reactions or follow-ons to the skip-gram papers, all of which deal with the difference between traditional distributional models (that produce representations by looking at word-count statistics in a corpus) and the new neural-network based models (that directly train representations to make predictions about the corpus).


Read more...

Distributed Representations of Words and Phrases and their Compositionality; Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean; NIPS 2013

The second word embeddings paper I’ll discuss is the second main skip-gram paper, a follow on to the original ICLR paper that basically drops the CBOW model and focuses on scaling up the skip-gram model to larger datasets. This paper gives three main contributions, I would say. First, they provide a slightly modified objective function and a few other sampling heuristics that result in a more computationally efficient model. Second, they show that their model works with phrases, too, though they just do this by replacing the individual tokens in a multiword expression with a single symbol representing the phrase - pretty simple, but it works. And lastly, they show what to me was a very surprising additional feature of the learned vector spaces: some relationships are encoded compositionally in the vector space, meaning that you can just add the vectors for two words like “Russian” and “capital” to get a vector that is very close to “Moscow”. They didn’t do any kind of thorough evaluation of this, but the fact the it works at all was very surprising to me. They did give a reasonable explanation, however, and I’ve put it into math below.


Read more...

Efficient Estimation of Word Representations in Vector Space; Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean; ICLR 2013

I’m a bit late to the word embeddings party, but I just read a series of papers related to the skip-gram model proposed in 2013 by Mikolov and others at Google.


Read more...