src.dackar.text_processing.Preprocessing¶
Created on October, 2022
@author: dgarrett622, wangc, mandd
Attributes¶
Classes¶
NLP Preprocessing class |
|
Object to find misspelled words and automatically correct spelling |
|
Class to expand abbreviations |
Module Contents¶
- src.dackar.text_processing.Preprocessing.textacyNormalize = ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode', 'whitespace'][source]¶
- src.dackar.text_processing.Preprocessing.textacyRemove = ['accents', 'brackets', 'html_tags', 'punctuation'][source]¶
- src.dackar.text_processing.Preprocessing.textacyReplace = ['currency_symbols', 'emails', 'emojis', 'hashtags', 'numbers', 'phone_numbers', 'urls', 'user_handles'][source]¶
- src.dackar.text_processing.Preprocessing.preprocessorDefaultList = ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'whitespace',...[source]¶
- class src.dackar.text_processing.Preprocessing.Preprocessing(preprocessorList=preprocessorDefaultList, preprocessorOptions=preprocessorDefaultOptions)[source]¶
Bases:
object
NLP Preprocessing class
- createTextacyNormalizeFunction(name, options)[source]¶
Creates a function from textacy.preprocessing.normalize such that only argument is a string and adds it to the functionList
- Parameters:
name – str, name of the preprocessor
options – dict, dictionary of preprocessor options
- Returns:
None
- createTextacyRemoveFunction(name, options)[source]¶
Creates a function from textacy.preprocessing.remove such that the only argument is a string and adds it to the functionList
- Parameters:
name – str, name of the preprocessor
options – dict, dictionary of preprocessor options
- Returns:
None
- class src.dackar.text_processing.Preprocessing.SpellChecker(checker='autocorrect')[source]¶
Bases:
object
Object to find misspelled words and automatically correct spelling
Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequences)
- addWordsToDictionary(words)[source]¶
Adds a list of words to the spell check dictionary
- Parameters:
words – list, list of words to add to the dictionary
- Returns:
None
- getMisspelledWords(text)[source]¶
Returns a list of words that are misspelled according to the dictionary used
- Parameters:
None
- Returns:
list, list of misspelled words
- Return type:
misspelled
- correct(text)[source]¶
Performs automatic spelling correction and returns corrected text
- Parameters:
None
- Returns:
str, spelling corrected text
- Return type:
corrected
- handleAbbreviations(abbrDatabase, text, type)[source]¶
Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their correspoding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
list, list of corrected text options
- Return type:
options
- generateAbbrDict(abbrDatabase)[source]¶
Generates an AbbrDict that can be used by handleAbbreviationsDict
- Parameters:
abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their correspoding full)
- Returns:
dictionary, a abbreviations dictionary
- Return type:
abbrDict
- handleAbbreviationsDict(abbrDict, text, type)[source]¶
Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
abbrDict – dictionary, dictionary containing library of abbreviations
expression (and their correspoding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
list, list of corrected text options
- Return type:
options