Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature engineering #4

Open
gabrielpreda opened this issue Aug 24, 2018 · 5 comments
Open

Feature engineering #4

gabrielpreda opened this issue Aug 24, 2018 · 5 comments

Comments

@gabrielpreda
Copy link
Owner

Continue the exploratory data analysis, perform feature engineering, add sentiment analysis-based features

@vitalie-cracan
Copy link

vitalie-cracan commented Sep 12, 2018

fyi

I tried few approaches that use GloVe word representations (https://nlp.stanford.edu/projects/glove/, glove.6B.300d), but none achieved highers score than the current ones, some significantly lower (e.g. business_service).

Approach 1:

Use the mean representation for subject and the mean representation for body (so treat them as BOW). Concatenate the two vectors and use as features.

Approach 2:

Use tfidf score of words from TfidfVectorizer as weight when summing up vector representations of words in the body. Use the resulted vector representations of body as features.

It was a surprise for me, I was expecting GloVe representations to bear more information. Searching the net, it looks like others have tried simmilar approaches (even training GloVe on the train data corpus), only to discover same. Tfidf scores for words produce best results.

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

In these approaches LogisticRegression had highest score, but is slow. SVM is the next best (quite close), but faster. LGB produced much lower scores.

@vitalie-cracan
Copy link

vitalie-cracan commented Sep 12, 2018

Code for second approach (I did not keep the one for the first, but I can restore it if needed):

import numpy as np
import pandas as pd
#from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

class Glove():
    DEFAULT_FILE_PATH = "datasets/glove.6B.300d.txt"
    WORD_VECTOR_DIMENSION = 300
    
    glove_vectors = {}
    not_found_words = []
    frequent_words = ['the', 'a', 'be', 'and', 'of', 'in', 'to', 'have', 'i', 'that', 'for', 'you', 'he', 'with', 'on',
                     'dear', 'hi', 'hello', 'best', 'regards', 'thanks', 'thank', 'please']
    def __init__(self):
        print("Loading Glove vectors")
        self.loadWordVectors()

    def loadWordVectors(self):
        with open(self.DEFAULT_FILE_PATH, 'r', encoding='utf-8') as file:
            for line in file:
                row = line.split()
                self.glove_vectors[row[0].strip()] = np.array(row[1:]).astype(float)

    def wordToVector(self, word):
        zero = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)
        if word in self.frequent_words:
            return zero
                
        word_vector = self.glove_vectors.get(word)
        if word_vector is not None:
            return word_vector
        
        return zero

    
    def textToVector(self, text):

        vector_sum = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)

        if isinstance(text, float): # nan
            return vector_sum
        
        if type(text) != np.ndarray: 
            text = text.strip().split()

        for word in text:
            vector_sum += self.wordToVector(word)

        return vector_sum
    
    def subjBodyToVector(self, subject, body):
        subject_vector = self.textToVector(subject)
        body_vector = self.textToVector(body)
        return np.concatenate([subject_vector, body_vector])


glove = Glove()

class GloveVectorizer(TfidfVectorizer):        

    def fit_transform(self, X, y = None):
        return self.transform(X, y)

    def toGlove(self, pair):
        (i, words) = pair
        result = np.zeros(glove.WORD_VECTOR_DIMENSION, dtype=float)
        for word in words:
            j = self.vocabulary_[word]
            result += self.tfidf[i, j] * glove.wordToVector(word)
        return result

    def transform(self, X, y = None):
        self.tfidf = super().fit_transform(X, y)
        newX = self.inverse_transform(self.tfidf)
        return pd.DataFrame(data = list(map(self.toGlove, enumerate(newX))))

To use it, simply replace CountVectorizer + TfidfTransformer (=TfidVectorizer) with GloveVectorizer.

  • The helper functions in Glove do not use averages, rather a simple sum. GloveVectorizer uses weighted sum.

@gabrielpreda
Copy link
Owner Author

It is a surprise for me to hear about lowest score with LGB. You might want to push this work adding Glove as one of the options on the pre-processing part of the ML pipeline. It is worth keeping them, we might be able to improve on these later on (or even start building an ensamble approach).

@vitalie-cracan
Copy link

vitalie-cracan commented Oct 19, 2018

Today I tried FastText: https://fasttext.cc/docs/en/supervised-tutorial.html

I used category as label and body for text.

./fasttext supervised -input examples/stc/tickets.train -output examples/stc/model -lr 0.1 -epoch 25 -wordNgrams 2

tickets.train contain 40000 items and the rest, 8538

./fasttext test examples/stc/model.bin examples/stc/tickets.valid
N       8538
P@1     0.821
R@1     0.821
Number of examples: 8538

Looks like a good score, but not much better than what we have already. The advantage is that fasttext is indeed very fast to train.

Note: I will clean things up and push the Glove trials, hopefully some time next week.

@vitalie-cracan
Copy link

@gabrielpreda I do not have permissions to push a new branch, maybe you could restrict the master branch but allow me to create new branches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants