-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature engineering #4
Comments
fyi I tried few approaches that use GloVe word representations (https://nlp.stanford.edu/projects/glove/, glove.6B.300d), but none achieved highers score than the current ones, some significantly lower (e.g. business_service). Approach 1: Use the mean representation for subject and the mean representation for body (so treat them as BOW). Concatenate the two vectors and use as features. Approach 2: Use tfidf score of words from TfidfVectorizer as weight when summing up vector representations of words in the body. Use the resulted vector representations of body as features. It was a surprise for me, I was expecting GloVe representations to bear more information. Searching the net, it looks like others have tried simmilar approaches (even training GloVe on the train data corpus), only to discover same. Tfidf scores for words produce best results. http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ In these approaches LogisticRegression had highest score, but is slow. SVM is the next best (quite close), but faster. LGB produced much lower scores. |
Code for second approach (I did not keep the one for the first, but I can restore it if needed): import numpy as np
import pandas as pd
#from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
class Glove():
DEFAULT_FILE_PATH = "datasets/glove.6B.300d.txt"
WORD_VECTOR_DIMENSION = 300
glove_vectors = {}
not_found_words = []
frequent_words = ['the', 'a', 'be', 'and', 'of', 'in', 'to', 'have', 'i', 'that', 'for', 'you', 'he', 'with', 'on',
'dear', 'hi', 'hello', 'best', 'regards', 'thanks', 'thank', 'please']
def __init__(self):
print("Loading Glove vectors")
self.loadWordVectors()
def loadWordVectors(self):
with open(self.DEFAULT_FILE_PATH, 'r', encoding='utf-8') as file:
for line in file:
row = line.split()
self.glove_vectors[row[0].strip()] = np.array(row[1:]).astype(float)
def wordToVector(self, word):
zero = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)
if word in self.frequent_words:
return zero
word_vector = self.glove_vectors.get(word)
if word_vector is not None:
return word_vector
return zero
def textToVector(self, text):
vector_sum = np.zeros(self.WORD_VECTOR_DIMENSION, dtype=float)
if isinstance(text, float): # nan
return vector_sum
if type(text) != np.ndarray:
text = text.strip().split()
for word in text:
vector_sum += self.wordToVector(word)
return vector_sum
def subjBodyToVector(self, subject, body):
subject_vector = self.textToVector(subject)
body_vector = self.textToVector(body)
return np.concatenate([subject_vector, body_vector])
glove = Glove()
class GloveVectorizer(TfidfVectorizer):
def fit_transform(self, X, y = None):
return self.transform(X, y)
def toGlove(self, pair):
(i, words) = pair
result = np.zeros(glove.WORD_VECTOR_DIMENSION, dtype=float)
for word in words:
j = self.vocabulary_[word]
result += self.tfidf[i, j] * glove.wordToVector(word)
return result
def transform(self, X, y = None):
self.tfidf = super().fit_transform(X, y)
newX = self.inverse_transform(self.tfidf)
return pd.DataFrame(data = list(map(self.toGlove, enumerate(newX)))) To use it, simply replace CountVectorizer + TfidfTransformer (=TfidVectorizer) with GloveVectorizer.
|
It is a surprise for me to hear about lowest score with LGB. You might want to push this work adding Glove as one of the options on the pre-processing part of the ML pipeline. It is worth keeping them, we might be able to improve on these later on (or even start building an ensamble approach). |
Today I tried FastText: https://fasttext.cc/docs/en/supervised-tutorial.html I used category as label and body for text.
tickets.train contain 40000 items and the rest, 8538 ./fasttext test examples/stc/model.bin examples/stc/tickets.valid
N 8538
P@1 0.821
R@1 0.821
Number of examples: 8538 Looks like a good score, but not much better than what we have already. The advantage is that fasttext is indeed very fast to train. Note: I will clean things up and push the Glove trials, hopefully some time next week. |
@gabrielpreda I do not have permissions to push a new branch, maybe you could restrict the master branch but allow me to create new branches? |
Continue the exploratory data analysis, perform feature engineering, add sentiment analysis-based features
The text was updated successfully, but these errors were encountered: