This repository contains the data used in the experiments conducted for the paper Text categorization with WEKA: a survey by Donatella Merlini (email: [email protected]) and Martina Rossini (email: [email protected]).
In particular, all the multlingual recipes used for our Language Identification experiments can be found in the Recipes folder. A separate test set in ARFF format can be found here; it was used to get an estimate of how well our models could recognize the language of a generic piece of text, that does not have anything to do with cooking. Note that, as stated in the actuall papar these short sentences are extracted from the Leipzig Text Corpora.
Moreover, the stopword_list.txt contains the list of stopwords used for all the six languages we examinated. The file contains one word per line, as is required by the WordsFromFile stopwordsHandler in WEKA.
Lastly, the second text categorization example shown in the paper focuses on detecting the type of dish a certain recipe is about. The dataset used for this part can be found in the Dishes folder.
All our experiments were conducted using WEKA version 3.8.4.
As of 16-04-2021 our paper can be found on Elsevier's journal Machine Learning with Applications and can be accessed here.