src.dackar.text_processing.SpellChecker¶
Created on October, 2022
@author: mandd, wangc
Attributes¶
Classes¶
Object to find misspelled words and automatically correct spelling |
Module Contents¶
- class src.dackar.text_processing.SpellChecker.SpellChecker(checker='autocorrect')[source]¶
Bases:
object
Object to find misspelled words and automatically correct spelling
Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequencies)
- addWordsToDictionary(words)[source]¶
Adds a list of words to the spell check dictionary
- Parameters:
words – list, list of words to add to the dictionary
- Returns:
None
- getMisspelledWords(text)[source]¶
Returns a list of words that are misspelled according to the dictionary used
- Parameters:
None
- Returns:
list, list of misspelled words
- Return type:
misspelled
- correct(text)[source]¶
Performs automatic spelling correction and returns corrected text
- Parameters:
None
- Returns:
str, spelling corrected text
- Return type:
corrected
- handleAbbreviations(abbrDatabase, text, type)[source]¶
Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multiple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
list, list of corrected text options
- Return type:
options
- generateAbbrDict(abbrDatabase)[source]¶
Generates an AbbrDict that can be used by handleAbbreviationsDict
- Parameters:
abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)
- Returns:
dictionary, a abbreviations dictionary
- Return type:
abbrDict
- handleAbbreviationsDict(abbrDict, text, type)[source]¶
Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)
- Parameters:
abbrDict – dictionary, dictionary containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search
- Returns:
list, list of corrected text options
- Return type:
options