- Environment Information and How to Run
- Brief Introduction
- File Structure & Script Information & Keyword Information
- Regex Descriptions
- Sources
- Mail me if you want to access to data and whole lexicon set which used in project
OS: macOS Catalina 10.15.7
Python Version: 3.8.3
To run: python ner.py input_file_path > output_file_path
I collected Turkish and English text data which are labeled for NER. With the help of the 17 script cleaned them and created lexicons.
I created keywords files in order to catch NER tags with only regex.
- "ner.py" is the main NER script.
- Lexicons are listed as follows:
"lexicon_organization.txt"
"lexicon_location.txt"
"lexicon_person.txt"
- Scripts that I used to create lexicons are listed under "Scripts" folder. I used 17 scripts to gather and process the lexicon files.
- Each subfolders name are structed like: "DATASET-NAME_Related"
Scipt files under subfolders are named as "DATASET-NAME.py"
Extracted data files names as follows:
"DATASET-NAME_organizations.txt"
"DATASET-NAME_locations.txt"
"DATASET-NAME_persons.txt"
Also under same folder you can find the raw data that I used to create those context based seperated txt files.
- After extracted data from raw datasets, I did 6 steps in order to get final lexicons, I did some copy paste operations and used 3 scripts in "data_prep" folder which is subfolder of "Scripts".
- Step 1: I created 3 different files and put all the data that I gathered from different sources into 3 seperate files correspond to their contents(organizations, locations, persons).
These files are named as "lexicon_CONTENT_with_duplicate.txt"
- Step 2: I extracted all duplicates with the script which named as "check_duplicate.py".
These files are named as "lexicon_CONTENT_no_duplicate.txt"
- Step 3: I extracted all the intersections between files into "all_intersections.txt" with the script named as "remove_spesific.py".
These files are named as "lexicon_CONTENT_no_intersection_duplicate.txt"
- Step 4: I used "handle_insersections.py" script in order to find suitable CONTENT for each word in "all_intersections.txt" which is created at Step 3.
I extracted that suitable words as "CONTENT_eklenecekler.txt"
- Step 5: I created 3 different files and put all the data I gathered from Step 3 and Step 4.
These files are named as "lexicon_CONTENT_with_keywords.txt"
- Step 6: I used the script named as "create_lexicons.py" which helped my to extraxt all the keywords(Bankası, Üniversitesi, Bey etc.) from each file.
These files are named as "lexicon_CONTENT.txt"
- I created a script which named as "enamex_cleaner.py", helped me to clear the enamax tagged documents in order to test my NER.
- I created a script which named as "find_line.py", This script helped me to find a keyword in lexicon.
Example: I found unrelated word in "lexicon_person.txt" but it is hard to find it my cmd+f or eye. So this script gives me the line number.
- Keywords that I used to create lexicons are listed under "Keywords" folder. I used 13 keyword files in order the use in my regex.
- Whenever I see a general prefix or suffix, I added to releated keyword file. All of keyword files are filled by hand.
I limited the repetitions with 6 consecutive words in order to prevent Catastrophic backtracking.
At first a particular line searched keywords, Afterwards searched again with lexicons.
Searching order for both keywords and lexicon is: Organization -> Location -> Time -> Person
Organization:
I used these regex with keywords:
1: "(?<=" + pre_organization + r" )(\s*([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*|ve)){1,6}"
This regex catchs organizations which are like "rakibi Arçelik" where pre_organization is rakibi.
Organizations could have "ve" between capital words.
2: "(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*|ve)\s*){1,6}(?=([']\w*\s*)? " + after_organization + r"\w*)"
This regex catchs organizations which are like "Galatasarayın Başkanı" where after_organization is "Başkanı" so I do not include "Başkanı" in my catched word.
Organizations could have "ve" between capital words.
3: "(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]+|ve)\s*){1,6}( "+after_organization+r"\w*)"
This regex catchs organizations which are like "Sabanci Üniversitesi" where after_organization is "Üniversite" so I include "Üniversite" in my catched word.
Organizations could have "ve" between capital words.
I used these regex with lexicon:
1: "([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)"
This regex catch all capitalized words to search in lexicon
2:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*)+"
This regex catchs all consecutive capitalized sentences to search in lexicon.
3: "(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*)|(ve\s*))+"
This regex catchs all consecutive capitalized sentences which could include "ve" to search in lexicon.
4:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\.( )?)"
This regex catchs all capitalized character which continues with "." to search in lexicon
After every regex operation if there are still tokes with all capitals and lenght of 3:
1: " ([A-ZÇĞİÖŞÜ]){3}(?= |'|/')"
I do not check if catched token is really organization because most of 3 letter uppercased tokens are organizations. This regex catchs tokens like "YÖK".
Location:
I used these regex with keyword:
1: "(?<=" + preLocation + r" )(\s*([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)){1,6}"
This regex catchs locations which are like "Başkent Ankara" where preLocation is "Başkent" so I do not include "Başkent" in my catched word.
2: "(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)\s*){1,6}(?=([']\w*\s*)? " + after_location + r"\w*)"
This regex catchs locations which are like "İstanbul ilçesi" where after_location is "ilçe" so I do not include "ilçe" in my catched word.
3: "(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)\s*){1,6}( " + after_location + r"\w*)"
This regex catchs locations which are like "İstanbul Havaalanı" where after_location is "Havaalanı" so I include "Havaalanı" in my catched word.
I used these regex with lexicon:
1: "([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)"
This regex catch all capitalized words to search in lexicon
2:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*)+"
This regex catchs all consecutive capitalized sentences to search in lexicon.
3:"(([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*)|(ve\s*))+"
This regex catchs all consecutive capitalized sentences which could include "ve" to search in lexicon.
4:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\.( )?)"
This regex catchs all capitalized character which continues with "." to search in lexicon
Time:
1: "(\d{1,4} )?("+month+r")( \d{1,4})?"
This regex catchs times which are like "23 Ekim 1998" where month is "Ekim" so I catched whole date.
2: "(( )?((-|(\d{1,4}))( )?)*)?("+month+r")"
This regex catchs times which are like "7 - 8 Ekim" where month is "Ekim" so I catched more than one date.
3: "(" + preDate + r"([']\w*)? )(\d{1,4})([']\w*)?"
This regex catchs times which are like "M.O. 500" where preDate is "M.O." so I catched all date.
4: "(\d{1,4} )(?=([']\w*)? "+afterDate+r"\w*)"
This regex catchs times which are like "1998 yılı" where afterDate is "yılı" so I catched only the date.
5: "(" + preTime + r" )(([0-2][0-3])|[0-9])([:.]([0-5][0-9]))?"
This regex catchs times which are like "saat 5" where preTime is "saat"
6: "\d{2}[./-]\d{2}[./-]\d{2,4}"
This regex catchs times which are like "01.01.2000".
7: "(([A-Z]+)|\d+)(. yüzyıl)"
This regex catchs times which are like "5. yüzyıl" or "V. yüzyıl".
8: "(([0-2][0-3])|[0-9])[:.]([0-5][0-9])"
This regex catchs times which are like "23:59"
9: "([12][0-9]{3})(?=[']\w*)?"
This regex catchs times which are like "1997"
Person:
I used these regex with keyword:
1: "(?<=" + prePerson + r" )(\s*[A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*){1,6}"
This regex catchs person names which are like "sayın Cavit" where prePerson is "sayın" so I do not include "sayın" in my catched word.
2: "(" + r"(III\.|I\.|II\.|IV\.|V\.|VI\.)" + r" )(\s*[A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*){1,6}"
This regex catchs person names which are like "V. Cavit".
3: "([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*){1,6}(?=([']\w*)? " + afterPerson + r"\w*)"
This regex catchs person names which are like "Cavit Bey" where afterPerson is "Bey" so I do not include "Bey" in my catched word.
I used these regex with lexicon:
1:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*)"
This regex catch all capitalized words to search in lexicon
2:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\s*)+"
This regex catchs all consecutive capitalized sentences to search in lexicon.
3:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*((')|\s*))+"
This regex catchs all consecutive capitalized sentences with "'" to search in lexicon.
4:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*[']?)+"
This regex catchs all consecutive capitalized sentences which seperated by "'" to search in lexicon.
5:"([A-ZÇĞİÖŞÜ]+[a-zçğıöşü]*\.( )?)"
This regex catchs all capitalized character which continues with "." to search in lexicon