rand[om]

med ∩ ml

Language identification with fastText

When dealing with a multilingual dataset doing language identification is a very important part of the analysis process, here I’ll show a way to do a fast ⚡️ and reliable ✨ language identification with fasttext.

TL;DR

import fasttext

lid_model = fasttext.load_model('lid.176.bin')

def detector(text):
    # return empty string if there is no tweet
    if text.isspace():
        return ""
    else:
        # get first item of the prediction tuple, then split by "__label__" and return only language code
        return lid_model.predict(text)[0][0].split("__label__")[1]
    
df['language'] = df['Tweet'].apply(detector)

First we need to download a dataset to play with. We’ll use the kaggle CLI tool. Run the following command after installing it or download the dataset directly from kaggle.

kaggle datasets download -d rtatman/the-umass-global-english-on-twitter-dataset
import pandas as pd

df = pd.read_csv('the-umass-global-english-on-twitter-dataset.zip', sep='\t')
df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

print(f"""There are: {df.Tweet.shape[0]} tweets.
    
First 10 tweets:
    
{df['Tweet'].head(10)}""")
There are: 10502 tweets.
    
First 10 tweets:
    
0                            Bugün bulusmami lazimdiii
1       Volkan konak adami tribe sokar yemin ederim :D
2                                                  Bed
3    I felt my first flash of violence at some fool...
4              Ladies drink and get in free till 10:30
5                     @Melanynijholtxo ahhahahahah dm!
6                                                 Fuck
7    Watching #Miranda On bbc1!!! @mermhart u r HIL...
8                                       @StizZsti fino
9            Shopping! (@ Kohl's) http://t.co/I8ZkQHT9
Name: Tweet, dtype: object

Looking at the first 10 tweets we can already see there are multiple languages been used. Let’s now use fasttext to classify them. You need to download the language identification model to use it, run the following shell command if you need to do so.

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Alternatively, you can use a slightly less accurate but a lot smaller model:

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

And now we can use it.

import fasttext

lid_model = fasttext.load_model('lid.176.bin')

Create a list in which each item is a tweet.

corpus = df['Tweet'].to_list()

Predict the language of the first tweet.

tweet = corpus[0]
prediction = lid_model.predict(corpus[0])
print(f"Text:\n \t {tweet} \n \n Prediction:\n \t {prediction}")
Text:
     Bugün bulusmami lazimdiii 
 
 Prediction:
     (('__label__tr',), array([0.77383512]))

The output of the prediction is a tuple of language label and prediction confidence. Language label is a string with “_label_” followed by ISO 639 code of the language. You can check the language codes here and the languages supported by fasttext at the end of their blog post.

Let’s try with a different one.

print(f"Text:\n \t {corpus[4]} \n \n Prediction:\n \t {lid_model.predict(corpus[4])}")
Text:
     Ladies drink and get in free till 10:30 
 
 Prediction:
     (('__label__en',), array([0.65553886]))

Now we can create a new column with the language of each tweet.

def detector(text):
    # return empty string if there is no tweet
    if text.isspace():
        return ""
    else:
        # get first item of the prediction tuple, then split by "__label__" and return only language code
        return lid_model.predict(text)[0][0].split("__label__")[1]
%%time
df['language'] = df['Tweet'].apply(detector)
CPU times: user 258 ms, sys: 4.17 ms, total: 262 ms
Wall time: 266 ms

I added the %%time IPython command to show one of the main reasons to use fasttext, the speed ⚡️. In 250ms I processed 10502 tweets working on a 2016 laptop with a dual-core 2 GHz Intel Core i5 processor.

The speed of fastText can also be compared with langid.py in this table from their blog.

Finally, if we wanted to work only with a subset of the languages and convert the rest to “unk” (and maybe use a multilingual model when working with them), we can use the following.

# keep the 4 top languages according to the current predictions
languanges_keep = df.language.value_counts().index[:4].to_list()

# create a custom list
languanges_keep = ["en", "es", "pt", "ja"]

df.loc[df.language.apply(lambda x: x not in languanges_keep), 'language'] = "unk"
df.language.value_counts()
en     5735
unk    2247
es     1139
pt      876
ja      505
Name: language, dtype: int64

And from here we can keep working with our text data!