Here is my winning strategy to carry multi-text classification task out.
Data Source : https://catalog.data.gov/dataset/consumer-complaint-database
- Word Frequency Plot: Compare frequencies across different texts and quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between text1 and text2, and between text1 and text3?
-
Most discriminant and important word per categories
-
Relationships between words & Pairwise correlations: examining which words tend to follow others immediately, or that tend to co-occur within the same documents.
Which word is associated with another word? Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “collect”, then “agency”, then “report/credit/score”, by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the text.
- Distribution of words: Want to show that there are similar distributions for all texts, with many words that occur rarely and fewer words that occur frequently. Here is the goal of Zip Law (extended with Harmonic mean) - Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.
-
How to spell variants of a given word
-
Chi-Square to see which words are associated to each category: find the terms that are the most correlated with each of the categories
-
Part of Speech Tags and Frequency distribution of POST: Noun Count, Verb Count, Adjective Count, Adverb Count and Pronoun Count
-
Metrics of words: Word Count of the documents – ie. total number of words in the documents, Character Count of the documents – total number of characters in the documents, Average Word Density of the documents – average length of the words used in the documents, Puncutation Count in the Complete Essay – total number of punctuation marks in the documents, Upper Case Count in the Complete Essay – total number of upper count words in the documents, Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
- Count Vector
- TF IDF
- Co-Occurrence Matrix with a fixed context window (SVD)
- TF-ICF
- Function Aware Components
- CBOW (word2vec)
- Skip-Grams (word2vec)
- Glove
- At character level -> FastText
- Topic Model as features // LDA features
Visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. A novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic.
The main innovation here is that these embeddings are learnt in hyperbolic space, as opposed to the commonly used Euclidean space. The reason behind this is that hyperbolic space is more suitable for capturing any hierarchical information inherently present in the graph. Embedding nodes into a Euclidean space while preserving the distance between the nodes usually requires a very high number of dimensions.
https://arxiv.org/pdf/1705.08039.pdf https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb
Learning representations of symbolic data such as text, graphs and multi-relational data has become a central paradigm in machine learning and artificial intelligence. For instance, word embeddings such as WORD2VEC, GLOVE and FASTTEXT are widely used for tasks ranging from machine translation to sentiment analysis.
Typically, the objective of embedding methods is to organize symbolic objects (e.g., words, entities, concepts) in a way such that their similarity in the embedding space reflects their semantic or functional similarity. For this purpose, the similarity of objects is usually measured either by their distance or by their inner product in the embedding space. For instance, Mikolov embed words in Rd such that their inner product is maximized when words co-occur within similar contexts in text corpora. This is motivated by the distributional hypothesis, i.e., that the meaning of words can be derived from the contexts in which they appear.
- CountVectorizer + Logistic
- CountVectorizer + NB
- CountVectorizer + LightGBM
- HasingTF + IDF + Logistic Regression
- TFIDF + NB
- TFIDF + LightGBM
- TF-IDF + SVM
- Hashing Vectorizer + Logistic
- Hashing Vectorizer + NB
- Hashing Vectorizer + LightGBM
- Bagging / Boosting
- Word2Vec + Logistic
- Word2Vec + LightGNM
- Word2Vec + XGBoost
- LSA + SVM
- GRU + Attention Mechanism
- CNN + RNN + Attention Mechanism
- CNN + LSTM/GRU + Attention Mechanism
Goal: explain predictions of arbitrary classifiers, including text classifiers (when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved)
- Lime
- Skate
- Shap
-
All models : https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
-
CNN Text Classification: https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb
-
CNN Multichannel Text Classification + Hierarchical attention + …: https://github.com/gaurav104/TextClassification/blob/master/CNN%20Multichannel%20Text%20Classification.ipynb
-
Notes for Deep Learning https://arxiv.org/pdf/1808.09772.pdf
-
Doc classification with NLP https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb
-
Paragraph Topic Classification http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf
-
1D convolutional neural networks for NLP https://github.com/Tixierae/deep_learning_NLP/blob/master/cnn_imdb.ipynb
-
Hierarchical Attention for text classification https://github.com/Tixierae/deep_learning_NLP/blob/master/HAN/HAN_final.ipynb
-
Multi-class classification scikit learn (Random forest, SVM, logistic regression) https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb
-
Text feature extraction TFIDF mathematics https://dzone.com/articles/machine-learning-text-feature-0
-
Classification Yelp Reviews (AWS) http://www.developintelligence.com/blog/2017/06/practical-neural-networks-keras-classifying-yelp-reviews/
-
Convolutional Neural Networks for Text Classification (waouuuuu) http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ https://github.com/davidsbatista/ConvNets-for-sentence-classification
-
3 ways to interpretate your NLP model [Lime, ELI5, Skater] https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb https://towardsdatascience.com/3-ways-to-interpretate-your-nlp-model-to-management-and-customer-5428bc07ce15 https://medium.freecodecamp.org/how-to-improve-your-machine-learning-models-by-explaining-predictions-with-lime-7493e1d78375
-
Deep Learning for text made easy with AllenNLP https://medium.com/swlh/deep-learning-for-text-made-easy-with-allennlp-62bc79d41f31
-
Ensemble Classifiers https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/
-
**Classification Algorithms ** [tfidf, count features, logistic regression, naive bayes, svm, xgboost, grid search, word vectors, LSTM, GRU, Ensembling] : https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook
-
Deep learning architecture [TextCNN, BiDirectional RNN(LSTM/GRU), Attention Models] : https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/ and https://www.kaggle.com/mlwhiz/attention-pytorch-and-keras
-
CNN + Word2vec and LSTM + Word2Vec : https://www.kaggle.com/kakiac/deep-learning-4-text-classification-cnn-bi-lstm
-
Comparison of models [Bag of Words - Countvectorizer Features, TFIDF Features, Hashing Features, Word2vec Features] : https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/
-
Embed, encode, attend, predict : https://explosion.ai/blog/deep-learning-formula-nlp
-
Visualisation sympa pour comprendre CNN : http://www.thushv.com/natural_language_processing/make-cnns-for-nlp-great-again-classifying-sentences-with-cnns-in-tensorflow/
-
Yelp comments classification [ LSTM, LSTM + CNN] : https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb
-
RNN text classification : https://karpathy.github.io/2015/05/21/rnn-effectiveness/
-
CNN for Sentence Classification & DCNN for Modelling Sentences & VDNN for Text Classification & Multi Channel Variable size CNN & Multi Group Norm Constraint CNN & RACNN Neural Networks for Text Classification: https://bicepjai.github.io/machine-learning/2017/11/10/text-class-part1.html
-
Transformers : https://towardsdatascience.com/transformers-141e32e69591
-
Seq2Seq : https://guillaumegenthial.github.io /sequence-to-sequence.html
-
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) : https://jalammar.github.io/
-
LSTM & GRU explanation : https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
-
Text classification using attention mechanism in Keras : http://androidkt.com/text-classification-using-attention-mechanism-in-keras/
-
Bernoulli Naive Bayes & Multinomial Naive Bayes & Random Forests & Linear SVM & SVM with non-linear kernel https://github.com/irfanelahi-ds/document-classification-python/blob/master/document_classification_python_sklearn_nltk.ipynb and https://richliao.github.io/
-
DL text classification : https://gitlab.com/the_insighters/data-university/nuggets/document-classification-with-deep-learning
-
1-D Convolutions over text : http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ and https://github.com/davidsbatista/ConvNets-for-sentence-classification/blob/master/Convolutional-Neural-Networks-for-Sentence-Classification.ipynb
-
[Bonus] Sentiment Analysis in PySpark : https://github.com/tthustla/setiment_analysis_pyspark/blob/master/Sentiment%20Analysis%20with%20PySpark.ipynb
-
RNN Text Generation : https://github.com/priya-dwivedi/Deep-Learning/blob/master/RNN_text_generation/RNN_project.ipynb
-
Finding similar documents with Word2Vec and Soft Cosine Measure: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
-
[!! ESSENTIAL !!] Text Classification with Hierarchical Attention Networks: https://humboldt-wi.github.io/blog/research/information_systems_1819/group5_han/
-
[ESSENTIAL for any NLP Project]: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
-
Doc2Vec + Logistic Regression : https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint_3.ipynb
-
Doc2Vec -> just embedding: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
-
New way of embedding -> Poincaré Embeddings: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb
-
Doc2Vec + Text similarity: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
-
Graph Link predictions + Part-of-Speech tagging tutorial with the Keras: https://github.com/Cdiscount/IT-Blog/tree/master/scripts/link-prediction & https://techblog.cdiscount.com/link-prediction-in-large-scale-networks/
- Other Topics - Text Similarity [Word Mover Distance] =========================================================
-
Finding similar documents with Word2Vec and WMD : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
-
Introduction to Wasserstein metric (earth mover’s distance): https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/
-
Earthmover Distance: https://jeremykun.com/2018/03/05/earthmover-distance/ Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters). For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?
-
Word Mover’s distance calculation between word pairs of two documents: https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents
-
Word Mover’s Distance (WMD) for Python: https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb
-
[LECTURES] : Computational Optimal Transport : https://optimaltransport.github.io/pdf/ComputationalOT.pdf
-
Computing the Earth Mover’s Distance under Transformations : http://robotics.stanford.edu/~scohen/research/emdg/emdg.html
-
[LECTURES] Slides WMD: http://robotics.stanford.edu/~rubner/slides/sld014.htm
- BOW + Xgboost Model + Word level TF-IDF + XgBoost + N-gram Level TF-IDF + Xgboost + Character Level TF-IDF + XGboost: https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Xgboost_bow_tfidf.ipynb
8 - Other Topics - Topic Modeling LDA
https://github.com/FelixChop/MediumArticles/blob/master/LDA-BBC.ipynb
https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb
-
TF-IDF + K-means & Latent Dirichlet Allocation (with Bokeh): https://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
-
[!! ESSENTIAL !!] Building a LDA-based Book Recommender System: https://humboldt-wi.github.io/blog/research/information_systems_1819/is_lda_final/
-
Text generation with a Variational Autoencoder : https://github.com/NicGian/text_VAE
-
Variational_text_inference : https://github.com/s4sarath/Deep-Learning-Projects/tree/master/variational_text_inference and https://s4sarath.github.io/2016/11/23/variational_autoenocder_for_Natural_Language_Processing