src.dackar.text_processing.SpellChecker¶

Created on October, 2022

@author: mandd, wangc

Attributes¶

logger

Classes¶

SpellChecker

Object to find misspelled words and automatically correct spelling

Module Contents¶

src.dackar.text_processing.SpellChecker.logger[source]¶

class src.dackar.text_processing.SpellChecker.SpellChecker(checker='autocorrect')[source]¶

Bases: object

Object to find misspelled words and automatically correct spelling

Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequencies)

checker = ''[source]¶

addedWords = [][source]¶

includedWords = [][source]¶

addWordsToDictionary(words)[source]¶

Adds a list of words to the spell check dictionary

Parameters:: words – list, list of words to add to the dictionary
Returns:: None

getMisspelledWords(text)[source]¶

Returns a list of words that are misspelled according to the dictionary used

Parameters:: None
Returns:: list, list of misspelled words
Return type:: misspelled

correct(text)[source]¶

Performs automatic spelling correction and returns corrected text

Parameters:: None
Returns:: str, spelling corrected text
Return type:: corrected

handleAbbreviations(abbrDatabase, text, type)[source]¶

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multiple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:

abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search

Returns:

list, list of corrected text options

Return type:

options

generateAbbrDict(abbrDatabase)[source]¶

Generates an AbbrDict that can be used by handleAbbreviationsDict

Parameters:

abbrDatabase – pandas dataframe, dataframe containing library of abbreviations
expression (and their corresponding full)

Returns:

dictionary, a abbreviations dictionary

Return type:

abbrDict

handleAbbreviationsDict(abbrDict, text, type)[source]¶

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:

abbrDict – dictionary, dictionary containing library of abbreviations
expression (and their corresponding full)
text – str, string of text that will be analyzed
type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed
expanded (to determine which words are abbreviations that need to be)
spellcheck (*) – in this case spellchecker is used to identify words that
recognized (are not)
hard (*) – here we directly search for the abbreviations in the provided
sentence
mixed (*) – here we perform first a “hard” search followed by a “spellcheck”
search

Returns:

list, list of corrected text options

Return type:

options

findOptimalOption(options)[source]¶

Method to handle abbreviation with multiple meanings

Parameters:: options – list, list of sentence options
Returns:: string, option from the provided options list that fits more the possible
Return type:: optimalOpt