Ronny Restrepo

Summary

I have been working on a sentiment analysis project project recently, and have been looking for ways to improve the performance of my model.

Today I set myself the challenge of learning how to make use of pre-trained GloVe word embeddings.

GloVe vectors file Format

On the official website for GloVe word embeddings, there are several zip files containing embeddings that are trained on different data. One is trained on Wikipedia, another on Twitter, and two through webcrawling the internet.

Within the zip files, there are several text files that contain the actual word vectors. There is a different file for different word embedding sizes trained on the same data. For example, here is a list of the files in the glove.6B.zip zip file trained on Wikipedia.

glove.6B.50d.txt
glove.6B.100d.txt
glove.6B.200d.txt
glove.6B.300d.txt

The 6B specifieds the number of tokens the data was trained on.
The 50d, 100d... etc specified the size of the word vectors.

The format inside each file is as follows, in this simplified example:

when 0.27062 -0.36596 0.097193 -0.50708 0.37375
year -0.098793 0.26983 0.35304 -0.10727 -0.015183
there 0.68491 0.32385 -0.11592 -0.35925 0.49889
...

Each line starts with the token/word followed all the values of the corresponding vector for that token, each separated by a space. The number of actual values would match the word vector size. So for example, if you opened the glove.6B.300d.txt file, it would contain 300 numbers on each line.

The number of lines corresponds to the size of the vocabulary.

Using GloVe when you have your own Vocabulary

In my case, I already had an established vocabulary for the data that I was using for my project. It is a vocabulary that will not match up with what GloVe embeddings, so the challenge for today was to load up the GloVe vectors in a format that would be useful given my established vocabulary.

We can start with a list, which maps the ids of your own established vocabulary to the actual tokens, and a dictionary that maps token strings to ids (such as in this simple example):

id2word = ["PAD", "UNKNOWN", "the", "there", "year", "when"]
word2id = {word: id for id, word in enumerate(id2word)}

We will need to initialize the array that will store the embeddings. Since we cannot assume that all of the tokens in our pre-defined vocabulary will exist in the pretrained GloVe vectors, we will need to initialize the values to some random values.

import numpy as np

# INITIALIZE EMBEDDINGS TO RANDOM VALUES
embed_size = 100
vocab_size = len(id2word)
sd = 1/np.sqrt(embed_size)  # Standard deviation to use
weights = np.random.normal(0, scale=sd, size=[vocab_size, embed_size])
weights = weights.astype(np.float32)

I initialized the weights using a variant of Xavier intialization. You could play around with different initialization strategies to see if you get better results.

Note: the embed_size variable should match the size of the word vector size for the file you will use.

To actually override the given word vectors from the GloVe text files, we can run the following.

from io import open
file = "/path/to/glove.6B.100d.txt"

# EXTRACT DESIRED GLOVE WORD VECTORS FROM TEXT FILE
with open(file, encoding="utf-8", mode="r") as textFile:
    for line in textFile:
        # Separate the values from the word
        line = line.split()
        word = line[0]

        # If word is in our vocab, then update the corresponding weights
        id = word2id.get(word, None)
        if id is not None:
            weights[id] = np.array(line[1:], dtype=np.float32)

Updating embeddings in PyTorch model

In my case, I am using PyTorch, and I have a model that is created as a subclass of torch.nn.Module as follows:

class Model(nn.Module):
    def __init__(self, n_vocab, embed_size, ...):
        super(Model, self).__init__()
        ...
        self.embeddings = nn.Embedding(n_vocab, embed_size)
        ...

...

model = Model(n_vocab, embed_size, ...)

In order to update the embeddings in my model, I run the following:

# UPDATING PYTORCH EMBEDDINGS
model.embeddings.weight.data = torch.Tensor(weights)

Using Pretrained GloVe word vectors

Summary

GloVe vectors file Format

Using GloVe when you have your own Vocabulary

Updating embeddings in PyTorch model

Comments