src.dackar.text_processing.SpellChecker

Created on October, 2022

@author: mandd, wangc

Attributes

logger

Classes

SpellChecker

Object to find misspelled words and automatically correct spelling

Module Contents

src.dackar.text_processing.SpellChecker.logger[source]
class src.dackar.text_processing.SpellChecker.SpellChecker(checker='autocorrect')[source]

Bases: object

Object to find misspelled words and automatically correct spelling

Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequencies)

checker = ''[source]
addedWords = [][source]
includedWords = [][source]
addWordsToDictionary(words)[source]

Adds a list of words to the spell check dictionary

Parameters:

words – list, list of words to add to the dictionary

Returns:

None

getMisspelledWords(text)[source]

Returns a list of words that are misspelled according to the dictionary used

Parameters:

None

Returns:

list, list of misspelled words

Return type:

misspelled

correct(text)[source]

Performs automatic spelling correction and returns corrected text

Parameters:

None

Returns:

str, spelling corrected text

Return type:

corrected

handleAbbreviations(abbrDatabase, text, type)[source]

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multiple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:
  • abbrDatabase – pandas dataframe, dataframe containing library of abbreviations

  • expression (and their corresponding full)

  • text – str, string of text that will be analyzed

  • type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed

  • expanded (to determine which words are abbreviations that need to be)

  • spellcheck (*) – in this case spellchecker is used to identify words that

  • recognized (are not)

  • hard (*) – here we directly search for the abbreviations in the provided

  • sentence

  • mixed (*) – here we perform first a “hard” search followed by a “spellcheck”

  • search

Returns:

list, list of corrected text options

Return type:

options

generateAbbrDict(abbrDatabase)[source]

Generates an AbbrDict that can be used by handleAbbreviationsDict

Parameters:
  • abbrDatabase – pandas dataframe, dataframe containing library of abbreviations

  • expression (and their corresponding full)

Returns:

dictionary, a abbreviations dictionary

Return type:

abbrDict

handleAbbreviationsDict(abbrDict, text, type)[source]

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:
  • abbrDict – dictionary, dictionary containing library of abbreviations

  • expression (and their corresponding full)

  • text – str, string of text that will be analyzed

  • type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed

  • expanded (to determine which words are abbreviations that need to be)

  • spellcheck (*) – in this case spellchecker is used to identify words that

  • recognized (are not)

  • hard (*) – here we directly search for the abbreviations in the provided

  • sentence

  • mixed (*) – here we perform first a “hard” search followed by a “spellcheck”

  • search

Returns:

list, list of corrected text options

Return type:

options

findOptimalOption(options)[source]

Method to handle abbreviation with multiple meanings

Parameters:

options – list, list of sentence options

Returns:

string, option from the provided options list that fits more the possible

Return type:

optimalOpt