src.dackar.text_processing.Preprocessing ======================================== .. py:module:: src.dackar.text_processing.Preprocessing .. autoapi-nested-parse:: Created on October, 2022 @author: dgarrett622, wangc, mandd Attributes ---------- .. autoapisummary:: src.dackar.text_processing.Preprocessing.textacyNormalize src.dackar.text_processing.Preprocessing.textacyRemove src.dackar.text_processing.Preprocessing.textacyReplace src.dackar.text_processing.Preprocessing.numerizer src.dackar.text_processing.Preprocessing.preprocessorDefaultList src.dackar.text_processing.Preprocessing.preprocessorDefaultOptions Classes ------- .. autoapisummary:: src.dackar.text_processing.Preprocessing.Preprocessing src.dackar.text_processing.Preprocessing.SpellChecker src.dackar.text_processing.Preprocessing.AbbrExpander Module Contents --------------- .. py:data:: textacyNormalize :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode', 'whitespace'] .. py:data:: textacyRemove :value: ['accents', 'brackets', 'html_tags', 'punctuation'] .. py:data:: textacyReplace :value: ['currency_symbols', 'emails', 'emojis', 'hashtags', 'numbers', 'phone_numbers', 'urls', 'user_handles'] .. py:data:: numerizer :value: ['numerize'] .. py:data:: preprocessorDefaultList :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'whitespace',... .. py:data:: preprocessorDefaultOptions .. py:class:: Preprocessing(preprocessorList=preprocessorDefaultList, preprocessorOptions=preprocessorDefaultOptions) Bases: :py:obj:`object` NLP Preprocessing class .. py:attribute:: functionList :value: [] .. py:attribute:: preprocessorNames :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode',... .. py:attribute:: pipeline .. py:method:: createTextacyNormalizeFunction(name, options) Creates a function from textacy.preprocessing.normalize such that only argument is a string and adds it to the functionList :param name: str, name of the preprocessor :param options: dict, dictionary of preprocessor options :returns: None .. py:method:: createTextacyRemoveFunction(name, options) Creates a function from textacy.preprocessing.remove such that the only argument is a string and adds it to the functionList :param name: str, name of the preprocessor :param options: dict, dictionary of preprocessor options :returns: None .. py:method:: createTextacyReplaceFunction(name, options) Creates a function from textacy.preprocessing.replace such that the only argument is a string and adds it to the functionList :param name: str, name of the preprocessor :param options: dict, dictionary of preprocessor options :returns: None .. py:method:: __call__(text) Performs the preprocessing :param text: str, string of text to preprocess :returns: str, string of processed text :rtype: processed .. py:class:: SpellChecker(checker='autocorrect') Bases: :py:obj:`object` Object to find misspelled words and automatically correct spelling Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequences) .. py:attribute:: checker :value: '' .. py:method:: addWordsToDictionary(words) Adds a list of words to the spell check dictionary :param words: list, list of words to add to the dictionary :returns: None .. py:method:: getMisspelledWords(text) Returns a list of words that are misspelled according to the dictionary used :param None: :returns: list, list of misspelled words :rtype: misspelled .. py:method:: correct(text) Performs automatic spelling correction and returns corrected text :param None: :returns: str, spelling corrected text :rtype: corrected .. py:method:: handleAbbreviations(abbrDatabase, text, type) Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: `src/nlp/data/abbreviations.xlsx` This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method) :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations :param and their correspoding full expression: :param text: str, string of text that will be analyzed :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed :param to determine which words are abbreviations that need to be expanded: :param \* spellcheck: in this case spellchecker is used to identify words that :param are not recognized: :param \* hard: here we directly search for the abbreviations in the provided :param sentence: :param \* mixed: here we perform first a "hard" search followed by a "spellcheck" :param search: :returns: list, list of corrected text options :rtype: options .. py:method:: generateAbbrDict(abbrDatabase) Generates an AbbrDict that can be used by handleAbbreviationsDict :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations :param and their correspoding full expression: :returns: dictionary, a abbreviations dictionary :rtype: abbrDict .. py:method:: handleAbbreviationsDict(abbrDict, text, type) Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method) :param abbrDict: dictionary, dictionary containing library of abbreviations :param and their correspoding full expression: :param text: str, string of text that will be analyzed :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed :param to determine which words are abbreviations that need to be expanded: :param \* spellcheck: in this case spellchecker is used to identify words that :param are not recognized: :param \* hard: here we directly search for the abbreviations in the provided :param sentence: :param \* mixed: here we perform first a "hard" search followed by a "spellcheck" :param search: :returns: list, list of corrected text options :rtype: options .. py:method:: findOptimalOption(options) Method to handle abbreviation with multiple meanings :param options: list, list of sentence options :returns: string, option from the provided options list that fits more the possible :rtype: optimalOpt .. py:class:: AbbrExpander(abbreviationsFilename, checkerType='autocorrect', abbrType='mixed') Bases: :py:obj:`object` Class to expand abbreviations .. py:attribute:: abbrType :value: 'mixed' .. py:attribute:: checkerType :value: 'autocorrect' .. py:attribute:: abbrList .. py:attribute:: preprocessorList :value: ['hyphenated_words', 'whitespace', 'numerize'] .. py:attribute:: preprocess .. py:attribute:: checker .. py:attribute:: abbrDict .. py:method:: abbrProcess(text, splitToList=False) Expands the abbreviations in text :param text: string, the text to expand :returns: string, the text with abbreviations expanded :rtype: expandedText