src.dackar.text_processing.SpellChecker¶
Created on October, 2022
@author: mandd, wangc
Attributes¶
Classes¶
Object to find misspelled words and automatically correct spelling  | 
Module Contents¶
- class src.dackar.text_processing.SpellChecker.SpellChecker(checker='autocorrect')[source]¶
 Bases:
objectObject to find misspelled words and automatically correct spelling
Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequencies)
- addWordsToDictionary(words)[source]¶
 Adds a list of words to the spell check dictionary
- Parameters:
 words – list, list of words to add to the dictionary
- Returns:
 None
- getMisspelledWords(text)[source]¶
 Returns a list of words that are misspelled according to the dictionary used
- Parameters:
 None
- Returns:
 list, list of misspelled words
- Return type:
 misspelled
- correct(text)[source]¶
 Performs automatic spelling correction and returns corrected text
- Parameters:
 None
- Returns:
 str, spelling corrected text
- Return type:
 corrected
- handleAbbreviations(abbrDatabase, text, type)[source]¶
 Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multiple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
 abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
 list, list of corrected text options
- Return type:
 options
- generateAbbrDict(abbrDatabase)[source]¶
 Generates an AbbrDict that can be used by handleAbbreviationsDict
- Parameters:
 abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)
- Returns:
 dictionary, a abbreviations dictionary
- Return type:
 abbrDict
- handleAbbreviationsDict(abbrDict, text, type)[source]¶
 Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
 abbrDict – dictionary, dictionary containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
 list, list of corrected text options
- Return type:
 options