src.dackar.text_processing.Preprocessing

Created on October, 2022

@author: dgarrett622, wangc, mandd

Attributes

textacyNormalize

textacyRemove

textacyReplace

numerizer

preprocessorDefaultList

preprocessorDefaultOptions

Classes

Preprocessing

NLP Preprocessing class

SpellChecker

Object to find misspelled words and automatically correct spelling

AbbrExpander

Class to expand abbreviations

Module Contents

src.dackar.text_processing.Preprocessing.textacyNormalize = ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode', 'whitespace'][source]
src.dackar.text_processing.Preprocessing.textacyRemove = ['accents', 'brackets', 'html_tags', 'punctuation'][source]
src.dackar.text_processing.Preprocessing.textacyReplace = ['currency_symbols', 'emails', 'emojis', 'hashtags', 'numbers', 'phone_numbers', 'urls', 'user_handles'][source]
src.dackar.text_processing.Preprocessing.numerizer = ['numerize'][source]
src.dackar.text_processing.Preprocessing.preprocessorDefaultList = ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'whitespace',...[source]
src.dackar.text_processing.Preprocessing.preprocessorDefaultOptions[source]
class src.dackar.text_processing.Preprocessing.Preprocessing(preprocessorList=preprocessorDefaultList, preprocessorOptions=preprocessorDefaultOptions)[source]

Bases: object

NLP Preprocessing class

functionList = [][source]
preprocessorNames = ['bullet_points', 'hyphenated_words', 'quotation_marks', 'repeating_chars', 'unicode',...[source]
pipeline[source]
createTextacyNormalizeFunction(name, options)[source]

Creates a function from textacy.preprocessing.normalize such that only argument is a string and adds it to the functionList

Parameters:
  • name – str, name of the preprocessor

  • options – dict, dictionary of preprocessor options

Returns:

None

createTextacyRemoveFunction(name, options)[source]

Creates a function from textacy.preprocessing.remove such that the only argument is a string and adds it to the functionList

Parameters:
  • name – str, name of the preprocessor

  • options – dict, dictionary of preprocessor options

Returns:

None

createTextacyReplaceFunction(name, options)[source]

Creates a function from textacy.preprocessing.replace such that the only argument is a string and adds it to the functionList

Parameters:
  • name – str, name of the preprocessor

  • options – dict, dictionary of preprocessor options

Returns:

None

__call__(text)[source]

Performs the preprocessing

Parameters:

text – str, string of text to preprocess

Returns:

str, string of processed text

Return type:

processed

class src.dackar.text_processing.Preprocessing.SpellChecker(checker='autocorrect')[source]

Bases: object

Object to find misspelled words and automatically correct spelling

Note: when using autocorrect, one need to conduct a spell test to identify the threshold (the word frequences)

checker = ''[source]
addWordsToDictionary(words)[source]

Adds a list of words to the spell check dictionary

Parameters:

words – list, list of words to add to the dictionary

Returns:

None

getMisspelledWords(text)[source]

Returns a list of words that are misspelled according to the dictionary used

Parameters:

None

Returns:

list, list of misspelled words

Return type:

misspelled

correct(text)[source]

Performs automatic spelling correction and returns corrected text

Parameters:

None

Returns:

str, spelling corrected text

Return type:

corrected

handleAbbreviations(abbrDatabase, text, type)[source]

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:
  • abbrDatabase – pandas dataframe, dataframe containing library of abbreviations

  • expression (and their correspoding full)

  • text – str, string of text that will be analyzed

  • type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed

  • expanded (to determine which words are abbreviations that need to be)

  • spellcheck (*) – in this case spellchecker is used to identify words that

  • recognized (are not)

  • hard (*) – here we directly search for the abbreviations in the provided

  • sentence

  • mixed (*) – here we perform first a “hard” search followed by a “spellcheck”

  • search

Returns:

list, list of corrected text options

Return type:

options

generateAbbrDict(abbrDatabase)[source]

Generates an AbbrDict that can be used by handleAbbreviationsDict

Parameters:
  • abbrDatabase – pandas dataframe, dataframe containing library of abbreviations

  • expression (and their correspoding full)

Returns:

dictionary, a abbreviations dictionary

Return type:

abbrDict

handleAbbreviationsDict(abbrDict, text, type)[source]

Performs automatic correction of abbreviations and returns corrected text This method relies on a database of abbreviations located at: src/nlp/data/abbreviations.xlsx This database contains the most common abbreviations collected from literarture and it provides for each abbreviation its corresponding full word(s); an abbreviation might have multple words associated. In such case the full word that makes more sense given the context is chosen (see findOptimalOption method)

Parameters:
  • abbrDict – dictionary, dictionary containing library of abbreviations

  • expression (and their correspoding full)

  • text – str, string of text that will be analyzed

  • type – string, type of abbreviation method (‘spellcheck’,’hard’,’mixed’) that are employed

  • expanded (to determine which words are abbreviations that need to be)

  • spellcheck (*) – in this case spellchecker is used to identify words that

  • recognized (are not)

  • hard (*) – here we directly search for the abbreviations in the provided

  • sentence

  • mixed (*) – here we perform first a “hard” search followed by a “spellcheck”

  • search

Returns:

list, list of corrected text options

Return type:

options

findOptimalOption(options)[source]

Method to handle abbreviation with multiple meanings

Parameters:

options – list, list of sentence options

Returns:

string, option from the provided options list that fits more the possible

Return type:

optimalOpt

class src.dackar.text_processing.Preprocessing.AbbrExpander(abbreviationsFilename, checkerType='autocorrect', abbrType='mixed')[source]

Bases: object

Class to expand abbreviations

abbrType = 'mixed'[source]
checkerType = 'autocorrect'[source]
abbrList[source]
preprocessorList = ['hyphenated_words', 'whitespace', 'numerize'][source]
preprocess[source]
checker[source]
abbrDict[source]
abbrProcess(text, splitToList=False)[source]

Expands the abbreviations in text

Parameters:

text – string, the text to expand

Returns:

string, the text with abbreviations expanded

Return type:

expandedText