Text-categorization-with-WEKA

This repository contains the data used in the experiments conducted for the paper Text categorization with WEKA: a survey by Donatella Merlini (email: donatella.merlini@unifi.it) and Martina Rossini (email: martina.rossini@stud.unifi.it).

In particular, all the multlingual recipes used for our Language Identification experiments can be found in the Recipes folder. A separate test set in ARFF format can be found here; it was used to get an estimate of how well our models could recognize the language of a generic piece of text, that does not have anything to do with cooking. Note that, as stated in the actuall papar these short sentences are extracted from the Leipzig Text Corpora.
Moreover, the stopword_list.txt contains the list of stopwords used for all the six languages we examinated. The file contains one word per line, as is required by the WordsFromFile stopwordsHandler in WEKA.

Lastly, the second text categorization example shown in the paper focuses on detecting the type of dish a certain recipe is about. The dataset used for this part can be found in the Dishes folder.

All our experiments were conducted using WEKA version 3.8.4.

Announcement:

As of 16-04-2021 our paper can be found on Elsevier's journal Machine Learning with Applications and can be accessed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text-categorization-with-WEKA

Announcement:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text-categorization-with-WEKA

Announcement: