src.dackar.text_processing.SpellChecker ======================================= .. py:module:: src.dackar.text_processing.SpellChecker .. autoapi-nested-parse:: Created on October, 2022 @author: mandd, wangc Attributes ---------- .. autoapisummary:: src.dackar.text_processing.SpellChecker.logger Classes ------- .. autoapisummary:: src.dackar.text_processing.SpellChecker.SpellChecker Module Contents --------------- .. py:data:: logger .. py:class:: SpellChecker(checker='autocorrect') Bases: :py:obj:`object` Object to find misspelled words and automatically correct spelling Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequencies) .. py:attribute:: checker :value: '' .. py:attribute:: addedWords :value: [] .. py:attribute:: includedWords :value: [] .. py:method:: addWordsToDictionary(words) Adds a list of words to the spell check dictionary :param words: list, list of words to add to the dictionary :returns: None .. py:method:: getMisspelledWords(text) Returns a list of words that are misspelled according to the dictionary used :param None: :returns: list, list of misspelled words :rtype: misspelled .. py:method:: correct(text) Performs automatic spelling correction and returns corrected text :param None: :returns: str, spelling corrected text :rtype: corrected .. py:method:: handleAbbreviations(abbrDatabase, text, type) Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: `src/nlp/data/abbreviations.xlsx` This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multiple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method) :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations :param and their corresponding full expression: :param text: str, string of text that will be analyzed :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed :param to determine which words are abbreviations that need to be expanded: :param \* spellcheck: in this case spellchecker is used to identify words that :param are not recognized: :param \* hard: here we directly search for the abbreviations in the provided :param sentence: :param \* mixed: here we perform first a "hard" search followed by a "spellcheck" :param search: :returns: list, list of corrected text options :rtype: options .. py:method:: generateAbbrDict(abbrDatabase) Generates an AbbrDict that can be used by handleAbbreviationsDict :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations :param and their corresponding full expression: :returns: dictionary, a abbreviations dictionary :rtype: abbrDict .. py:method:: handleAbbreviationsDict(abbrDict, text, type) Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literature and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method) :param abbrDict: dictionary, dictionary containing library of abbreviations :param and their corresponding full expression: :param text: str, string of text that will be analyzed :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed :param to determine which words are abbreviations that need to be expanded: :param \* spellcheck: in this case spellchecker is used to identify words that :param are not recognized: :param \* hard: here we directly search for the abbreviations in the provided :param sentence: :param \* mixed: here we perform first a "hard" search followed by a "spellcheck" :param search: :returns: list, list of corrected text options :rtype: options .. py:method:: findOptimalOption(options) Method to handle abbreviation with multiple meanings :param options: list, list of sentence options :returns: string, option from the provided options list that fits more the possible :rtype: optimalOpt