src.dackar.text_processing.Preprocessing
========================================

.. py:module:: src.dackar.text_processing.Preprocessing

.. autoapi-nested-parse::

   Created on October, 2022

   @author: dgarrett622, wangc, mandd


Attributes
----------

.. autoapisummary::

   src.dackar.text_processing.Preprocessing.textacyNormalize
   src.dackar.text_processing.Preprocessing.textacyRemove
   src.dackar.text_processing.Preprocessing.textacyReplace
   src.dackar.text_processing.Preprocessing.numerizer
   src.dackar.text_processing.Preprocessing.preprocessorDefaultList
   src.dackar.text_processing.Preprocessing.preprocessorDefaultOptions


Classes
-------

.. autoapisummary::

   src.dackar.text_processing.Preprocessing.Preprocessing
   src.dackar.text_processing.Preprocessing.SpellChecker
   src.dackar.text_processing.Preprocessing.AbbrExpander


Module Contents
---------------

.. py:data:: textacyNormalize
   :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode', 'whitespace']


.. py:data:: textacyRemove
   :value: ['accents', 'brackets', 'html_tags', 'punctuation']


.. py:data:: textacyReplace
   :value: ['currency_symbols', 'emails', 'emojis', 'hashtags', 'numbers', 'phone_numbers', 'urls', 'user_handles']


.. py:data:: numerizer
   :value: ['numerize']


.. py:data:: preprocessorDefaultList
   :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'whitespace',...


.. py:data:: preprocessorDefaultOptions

.. py:class:: Preprocessing(preprocessorList=preprocessorDefaultList, preprocessorOptions=preprocessorDefaultOptions)

   Bases: :py:obj:`object`


   NLP Preprocessing class


   .. py:attribute:: functionList
      :value: []


   .. py:attribute:: preprocessorNames
      :value: ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode',...


   .. py:attribute:: pipeline


   .. py:method:: createTextacyNormalizeFunction(name, options)

      Creates a function from textacy.preprocessing.normalize such that only argument is a string
      and adds it to the functionList

      :param name: str, name of the preprocessor
      :param options: dict, dictionary of preprocessor options

      :returns: None


   .. py:method:: createTextacyRemoveFunction(name, options)

      Creates a function from textacy.preprocessing.remove such that the only argument is a string
      and adds it to the functionList

      :param name: str, name of the preprocessor
      :param options: dict, dictionary of preprocessor options

      :returns: None


   .. py:method:: createTextacyReplaceFunction(name, options)

      Creates a function from textacy.preprocessing.replace such that the only argument is a string
      and adds it to the functionList

      :param name: str, name of the preprocessor
      :param options: dict, dictionary of preprocessor options

      :returns: None


   .. py:method:: __call__(text)

      Performs the preprocessing

      :param text: str, string of text to preprocess

      :returns: str, string of processed text
      :rtype: processed


.. py:class:: SpellChecker(checker='autocorrect')

   Bases: :py:obj:`object`


   Object to find misspelled words and automatically correct spelling

   Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequences)


   .. py:attribute:: checker
      :value: ''


   .. py:method:: addWordsToDictionary(words)

      Adds a list of words to the spell check dictionary

      :param words: list, list of words to add to the dictionary

      :returns: None


   .. py:method:: getMisspelledWords(text)

      Returns a list of words that are misspelled according to the dictionary used

      :param None:

      :returns: list, list of misspelled words
      :rtype: misspelled


   .. py:method:: correct(text)

      Performs automatic spelling correction and returns corrected text

      :param None:

      :returns: str, spelling corrected text
      :rtype: corrected


   .. py:method:: handleAbbreviations(abbrDatabase, text, type)

      Performs automatic correction of abbreviations and returns corrected text
      This method relies on a database of abbreviations located at:
      `src/nlp/data/abbreviations.xlsx`
      This database contains the most common abbreviations collected from literarture and
      it provides for each abbreviation its corresponding full word(s); an abbreviation might
      have multple words associated. In such case the full word that makes more sense given the
      context is chosen (see findOptimalOption method)

      :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations
      :param and their correspoding full expression:
      :param text: str, string of text that will be analyzed
      :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed
      :param to determine which words are abbreviations that need to be expanded:
      :param \* spellcheck: in this case spellchecker is used to identify words that
      :param are not recognized:
      :param \* hard: here we directly search for the abbreviations in the provided
      :param sentence:
      :param \* mixed: here we perform first a "hard" search followed by a "spellcheck"
      :param search:

      :returns: list, list of corrected text options
      :rtype: options


   .. py:method:: generateAbbrDict(abbrDatabase)

      Generates an AbbrDict that can be used by handleAbbreviationsDict

      :param abbrDatabase: pandas dataframe, dataframe containing library of abbreviations
      :param and their correspoding full expression:

      :returns: dictionary, a abbreviations dictionary
      :rtype: abbrDict


   .. py:method:: handleAbbreviationsDict(abbrDict, text, type)

      Performs automatic correction of abbreviations and returns corrected text
      This method relies on a database of abbreviations located at:
      src/nlp/data/abbreviations.xlsx
      This database contains the most common abbreviations collected from literarture and
      it provides for each abbreviation its corresponding full word(s); an abbreviation might
      have multple words associated. In such case the full word that makes more sense given the
      context is chosen (see findOptimalOption method)

      :param abbrDict: dictionary, dictionary containing library of abbreviations
      :param and their correspoding full expression:
      :param text: str, string of text that will be analyzed
      :param type: string, type of abbreviation method ('spellcheck','hard','mixed') that are employed
      :param to determine which words are abbreviations that need to be expanded:
      :param \* spellcheck: in this case spellchecker is used to identify words that
      :param are not recognized:
      :param \* hard: here we directly search for the abbreviations in the provided
      :param sentence:
      :param \* mixed: here we perform first a "hard" search followed by a "spellcheck"
      :param search:

      :returns: list, list of corrected text options
      :rtype: options


   .. py:method:: findOptimalOption(options)

      Method to handle abbreviation with multiple meanings

      :param options: list, list of sentence options

      :returns: string, option from the provided options list that fits more the
                possible
      :rtype: optimalOpt


.. py:class:: AbbrExpander(abbreviationsFilename, checkerType='autocorrect', abbrType='mixed')

   Bases: :py:obj:`object`


   Class to expand abbreviations


   .. py:attribute:: abbrType
      :value: 'mixed'


   .. py:attribute:: checkerType
      :value: 'autocorrect'


   .. py:attribute:: abbrList


   .. py:attribute:: preprocessorList
      :value: ['hyphenated_words', 'whitespace', 'numerize']


   .. py:attribute:: preprocess


   .. py:attribute:: checker


   .. py:attribute:: abbrDict


   .. py:method:: abbrProcess(text, splitToList=False)

      Expands the abbreviations in text

      :param text: string, the text to expand

      :returns: string, the text with abbreviations expanded
      :rtype: expandedText