med ∩ ml

Language identification with fastText

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
import fasttext

lid_model = fasttext.load_model('lid.176.bin')

def detector(text):
    # return empty string if there is no tweet
    if text.isspace():
        return ""
        # get first item of the prediction tuple, then split by "__label__" and return only language code
        return lid_model.predict(text)[0][0].split("__label__")[1]

df['language'] = df['Tweet'].apply(detector)

This used to be a box promoting a newsletter, but I think we already have enough newsletters. Everybody wants to get a spot in your mailbox.

Instead of that, subscribing to my RSS feed will be a lot better, and less intrusive in your life.

If you are looking for a newsletter to subscribe to, click here.