-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
664b47a
commit 3d5a8fd
Showing
10 changed files
with
211 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
nltk==3.5 | ||
numpy | ||
nltk | ||
scikit_learn |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
Metadata-Version: 2.1 | ||
Name: irtm | ||
Version: 0.0.2 | ||
Summary: A toolbox for Information Retreival & Text Mining. | ||
Version: 0.0.3 | ||
Summary: A toolbox for Information Retrieval & Text Mining. | ||
Home-page: https://github.com/KanishkNavale/IRTM-Toolbox.git | ||
Author: Kanishk Navale | ||
Author-email: [email protected] | ||
|
@@ -44,7 +44,7 @@ from irtm.toolbox import * | |
>>> 'M466' | ||
``` | ||
|
||
2. Tokenizer: Convert a sequence of characters into a sequence of tokens. | ||
2. Tokenizer: Converts a sequence of characters into a sequence of tokens. | ||
|
||
```python | ||
print(tokenize('LINUX')) | ||
|
@@ -56,4 +56,47 @@ from irtm.toolbox import * | |
>>> ['text', 'mining'] | ||
``` | ||
|
||
3. Vectorize: Converts a string to token based weight tensor. | ||
|
||
```python | ||
vector = vectorize([ | ||
'texts ([string]): a multiline or a single line string.', | ||
'dict ([list], optional): list of tokens. Defaults to None.', | ||
'enable_Idf (bool, optional): use IDF or not. Defaults to True.', | ||
'normalize (str, optional): normalization of vector. Defaults to l2.', | ||
'max_dim ([int], optional): dimension of vector. Defaults to None.', | ||
'smooth (bool, optional): restricts value >0. Defaults to True.', | ||
'weightedTf (bool, optional): Tf = 1+log(Tf). Defaults to True.', | ||
'return_features (bool, optional): feature vector. Defaults to False.' | ||
]) | ||
|
||
print(f'Vector Shape={vector.shape}') | ||
``` | ||
|
||
```bash | ||
>>> Vector Shape=(8, 37) | ||
``` | ||
|
||
4. Predict Token Weights: Computes importance of a token based on classification optimization. | ||
|
||
```python | ||
dictionary = ['vector', 'string', 'bool'] | ||
vector = vectorize([ | ||
'X ([np.array]): vectorized matrix columns arraged as per the dictionary.', | ||
'y ([labels]): True classification labels.', | ||
'epochs ([int]): Optimization epochs.', | ||
'verbose (bool, optional): Enable verbose outputs. Defaults to False.', | ||
'dict ([type], optional): list of tokens. Defaults to None.' | ||
], dict=dictionary) | ||
|
||
labels = np.random.randint(1, size=(vector.shape[0], 1)) | ||
weights = predict_weights(vector, labels, 100, dict=dictionary) | ||
``` | ||
|
||
```bash | ||
>>> Token-Weights Mappings: {'vector': 0.22097790924850977, | ||
'string': 0.39296369957440075, | ||
'bool': 0.689853175081446} | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
nltk==3.5 | ||
numpy | ||
nltk | ||
scikit_learn |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters