Skip to content
This repository has been archived by the owner on May 7, 2018. It is now read-only.

A Python function to break down hashtags or compound words created by putting together multiple words

Notifications You must be signed in to change notification settings

cortexml/HashTagSplitter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HashTagSplitter

A recursive python function to break down hashtags or compound words created by putting together multiple words

My implementation of the maximum matching algorithm used in Natural Language Processing (NLP) to split compound words or hashtags to multiple words.

Example Usage:

>>> split_hashtag_to_words_all_possibilities("edgeofentertainment")
[['edge', 'of', 'entertainment']]

>>> split_hashtag_to_words_all_possibilities("playtowin")
[['play', 'tow', 'in'], ['play', 'to', 'win']]

>>> split_hashtag_to_words_all_possibilities("datascience")
[['data', 'science'], ['da', 'ta', 'science']]

>>> split_hashtag_to_words_all_possibilities("superbowl")
[['superb', 'owl'], ['super', 'bowl'], ['sup', 'er', 'bowl']]

As can be seen from the examples, the output is totally based on the quality/vocabulary of the dictionary that is used.

TODO:

Build an n-gram model based on a corpus from nltk to order the possibilities by probability of occurence/usage and display only the top 3/5 most probable possibilities

About

A Python function to break down hashtags or compound words created by putting together multiple words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%