Feature engineering #4

gabrielpreda · 2018-08-24T08:20:53Z

Continue the exploratory data analysis, perform feature engineering, add sentiment analysis-based features

vitalie-cracan · 2018-09-12T14:37:50Z

fyi

I tried few approaches that use GloVe word representations (https://nlp.stanford.edu/projects/glove/, glove.6B.300d), but none achieved highers score than the current ones, some significantly lower (e.g. business_service).

Approach 1:

Use the mean representation for subject and the mean representation for body (so treat them as BOW). Concatenate the two vectors and use as features.

Approach 2:

Use tfidf score of words from TfidfVectorizer as weight when summing up vector representations of words in the body. Use the resulted vector representations of body as features.

It was a surprise for me, I was expecting GloVe representations to bear more information. Searching the net, it looks like others have tried simmilar approaches (even training GloVe on the train data corpus), only to discover same. Tfidf scores for words produce best results.

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

In these approaches LogisticRegression had highest score, but is slow. SVM is the next best (quite close), but faster. LGB produced much lower scores.

vitalie-cracan · 2018-09-12T14:48:43Z

Code for second approach (I did not keep the one for the first, but I can restore it if needed):

import numpy as np
import pandas as pd
#from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

class Glove():
    DEFAULT_FILE_PATH = "datasets/glove.6B.300d.txt"
    WORD_VECTOR_DIMENSION = 300
    
    glove_vectors = {}
    not_found_words = []
    frequent_words = ['the', 'a', 'be', 'and', 'of', 'in', 'to', 'have', 'i', 'that', 'for', 'you', 'he', 'with', 'on',
                     'dear', 'hi', 'hello', 'best', 'regards', 'thanks', 'thank', 'please']
    def __init__(self):
        print("Loading Glove vectors")
        self.loadWordVectors()

    def loadWordVectors(self):
        with open(self.DEFAULT_FILE_PATH, 'r', encoding='utf-8') as file:
            for line in file:
                row = line.split()
                self.glove_vectors[row[0].strip()] = np.array(row[1:]).astype(float)

    def wordToVector(self, word):
        zero = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)
        if word in self.frequent_words:
            return zero
                
        word_vector = self.glove_vectors.get(word)
        if word_vector is not None:
            return word_vector
        
        return zero

    
    def textToVector(self, text):

        vector_sum = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)

        if isinstance(text, float): # nan
            return vector_sum
        
        if type(text) != np.ndarray: 
            text = text.strip().split()

        for word in text:
            vector_sum += self.wordToVector(word)

        return vector_sum
    
    def subjBodyToVector(self, subject, body):
        subject_vector = self.textToVector(subject)
        body_vector = self.textToVector(body)
        return np.concatenate([subject_vector, body_vector])


glove = Glove()

class GloveVectorizer(TfidfVectorizer):        

    def fit_transform(self, X, y = None):
        return self.transform(X, y)

    def toGlove(self, pair):
        (i, words) = pair
        result = np.zeros(glove.WORD_VECTOR_DIMENSION, dtype=float)
        for word in words:
            j = self.vocabulary_[word]
            result += self.tfidf[i, j] * glove.wordToVector(word)
        return result

    def transform(self, X, y = None):
        self.tfidf = super().fit_transform(X, y)
        newX = self.inverse_transform(self.tfidf)
        return pd.DataFrame(data = list(map(self.toGlove, enumerate(newX))))

To use it, simply replace CountVectorizer + TfidfTransformer (=TfidVectorizer) with GloveVectorizer.

The helper functions in Glove do not use averages, rather a simple sum. GloveVectorizer uses weighted sum.

gabrielpreda · 2018-09-25T10:35:25Z

It is a surprise for me to hear about lowest score with LGB. You might want to push this work adding Glove as one of the options on the pre-processing part of the ML pipeline. It is worth keeping them, we might be able to improve on these later on (or even start building an ensamble approach).

vitalie-cracan · 2018-10-19T13:57:11Z

Today I tried FastText: https://fasttext.cc/docs/en/supervised-tutorial.html

I used category as label and body for text.

./fasttext supervised -input examples/stc/tickets.train -output examples/stc/model -lr 0.1 -epoch 25 -wordNgrams 2

tickets.train contain 40000 items and the rest, 8538

./fasttext test examples/stc/model.bin examples/stc/tickets.valid
N       8538
P@1     0.821
R@1     0.821
Number of examples: 8538

Looks like a good score, but not much better than what we have already. The advantage is that fasttext is indeed very fast to train.

Note: I will clean things up and push the Glove trials, hopefully some time next week.

vitalie-cracan · 2018-10-23T06:20:29Z

@gabrielpreda I do not have permissions to push a new branch, maybe you could restrict the master branch but allow me to create new branches?

gabrielpreda added the improvement label Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature engineering #4

Feature engineering #4

gabrielpreda commented Aug 24, 2018

vitalie-cracan commented Sep 12, 2018 •

edited

Loading

vitalie-cracan commented Sep 12, 2018 •

edited

Loading

gabrielpreda commented Sep 25, 2018

vitalie-cracan commented Oct 19, 2018 •

edited

Loading

vitalie-cracan commented Oct 23, 2018

Feature engineering #4

Feature engineering #4

Comments

gabrielpreda commented Aug 24, 2018

vitalie-cracan commented Sep 12, 2018 • edited Loading

vitalie-cracan commented Sep 12, 2018 • edited Loading

gabrielpreda commented Sep 25, 2018

vitalie-cracan commented Oct 19, 2018 • edited Loading

vitalie-cracan commented Oct 23, 2018

vitalie-cracan commented Sep 12, 2018 •

edited

Loading

vitalie-cracan commented Sep 12, 2018 •

edited

Loading

vitalie-cracan commented Oct 19, 2018 •

edited

Loading