Overview

Tests status Coverage Status PyPI Package latest release Supported versions Documentation Status

A package for working with files containing word embeddings (aka word vectors). Written for:

  1. providing a common interface for different file formats;

  2. providing a flexible function for building “embedding matrices” that you can use for initializing the Embedding layer of your deep learning model;

  3. taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;

  4. satisfying my (inexplicable) urge of writing a Python package.

Features

  • Supports textual and Google’s binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).

  • Allows to easily implement, test and integrate new file formats.

  • Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).

  • Well-documented and type-annotated (meaning great IDE support).

  • Extensively tested.

  • Progress bars (by default) for every time-consuming operation.

Installation

pip install embfile

Quick start

import embfile

with embfile.open("path/to/file.bin") as f:     # infer file format from file extension

    print(f.vocab_size, f.vector_size)

    # Load some word vectors in a dictionary (raise KeyError if any word is missing)
    word2vec = f.load(['ciao', 'hello'])

    # Like f.load() but allows missing words (and returns them in a Set)
    word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])

    # Build a matrix for initializing an Embedding layer either from
    # a list of words or from a dictionary {word: index}. Handles the
    # initialization of eventual missing word vectors (see "oov_initializer")
    matrix, word2index, missing_words = embfile.build_matrix(f, words)

Examples

The examples shows how to use embfile to initialize the Embedding layer of a deep learning model. They are just illustrative, don’t skip the documentation.