Skip to content

Crash Course in Machine Learning

Anna Price edited this page Oct 17, 2019 · 7 revisions

What is machine learning?

Machine learning algorithms have the ability to automatically learn from data without explicit programming. They can be broadly split into three types: supervised, unsupervised and reinforcement. In general, machine learning algorithms work by inferring a function that maps input data to discrete classes. This mapping can be saved to create a machine learning model that can be used to classify new unseen data. To build a machine leaning requires the use of a large training dataset.

Training, testing and validation datasets

A training dataset is used to train the machine learning algorithm, it should be reflective of the "real world" data that the resultant saved machine model is likely to encounter. The testing dataset acts as an early evaluation of the machine learning model during the training. After evaluating the results from the testing dataset, the ML model can be further fine-tuned. Although the algorithm is not explictly fitted to the testing dataset it used during the training process, and alone does not act as a fair evaluation of our model. A validation dataset of completely unseen data is evaluated the finalised model in order to give an unbiased evaluation of its performance.

Building a training dataset

NLP-Bio-Tools uses supervised machine learning for binary classification. To build a model, a large labelled training dataset is required that includes positive (belonging to a group) and negative (not belonging to a group) classes. To avoid bias, the training dataset should be balanced with an equal number of positive and negative documents. If you have a severly unbalanced dataset you should look at under or oversampling one of the classes.

Natural Language Processing (NLP)

The accuracy of a machine learning model depends greatly on the quality of the input data. In text classification problems, it is common to use a NLP pipeline to produce a clean and concise representation of the input documents. NLP pipelines usually include the follwing steps:

  • tokenization: each word and punctuation mark is separated out into individual tokens
  • removing punctuation and numbers: these are usually irrelevant data so they are removed
  • removing stopwords: common stopwords such as a, and, or, the, ... are removed
  • stemming: reducing words to their base form

Feature Selection

We select the top features from the NLP processed text that we want our machine learning algorithm to fit to. This further improves the quality of our data that we are passing to the algorithm. Many methods for feature selection exist including: bag-of-words (i.e. basic token counts), term frequency-inverse document (tf-idf), information gain, and chi-square.

Machine learning algorithms for text classification

It's important to choose a ML algorithm that is appropriate for your data. Algorithms that generally perform well for text classifcation tasks include: Naive Bayes, k-Nearest Neighbours, Regression Models, Decision Trees, Support Vector Machines, and Neural Networks.