Preprocessing demo

This notebook shows how to use the Postprocessing class for cleaning, numerizing, and spell checking raw data.

[1]:
import os, sys, time

cwd = os.getcwd()
frameworkDir = os.path.abspath(os.path.join(cwd, os.pardir, 'src'))
sys.path.append(frameworkDir)

from dackar.text_processing.Preprocessing import Preprocessing
from dackar.text_processing.Preprocessing import SpellChecker
import warnings
warnings.filterwarnings("ignore")
Warming up PyWSD (takes ~10 secs)... took 2.4006917476654053 secs.

Text to clean and numerize

[2]:
text = ("bullet_points:\n"
        "\n‣ item1\n⁃ item2\n⁌ item3\n⁍ item4\n∙ item5\n▪ item6\n● item7\n◦ item8\n"
        "=======================\n"
        "hyphenated_words:\n"
        "I see you shiver with antici- pation.\n"
        "I see you shiver with antici-   \npation.\n"
        "I see you shiver with antici- PATION.\n"
        "I see you shiver with antici- 1pation.\n"
        "I see you shiver with antici pation.\n"
        "I see you shiver with antici-pation.\n"
        "My phone number is 555- 1234.\n"
        "I got an A- on the test.\n"
        "=======================\n"
        "quotation_marks:\n"
        "These are ´funny single quotes´.\n"
        "These are ‘fancy single quotes’.\n"
        "These are “fancy double quotes”.\n"
        "=======================\n"
        "repeating_chars:\n"
        "**Hello**, world!!! I wonder....... How are *you* doing?!?! lololol\n"
        "=======================\n"
        "unicode:\n"
        "Well… That's a long story.\n"
        "=======================\n"
        "whitespace:\n"
        "Hello,  world!\n"
        "Hello,     world!\n"
        "Hello,\tworld!\n"
        "Hello,\t\t  world!\n"
        "Hello,\n\nworld!\n"
        "Hello,\r\nworld!\n"
        "Hello\uFEFF, world!\n"
        "Hello\u200B\u200B, world!\n"
        "Hello\uFEFF,\n\n\nworld   !  \n"
        "=======================\n"
        "accents:\n"
        "El niño se asustó del pingüino -- qué miedo!\n"
        "Le garçon est très excité pour la forêt.\n"
        "=======================\n"
        "brackets:\n"
        "Hello, {name}!\n"
        "Hello, world (DeWilde et al., 2021, p. 42)!\n"
        "Hello, world (1)!\n"
        "Hello, world [1]!\n"
        "Hello, world (and whomever it may concern [not that it's any of my business])!\n"
        "Hello, world (and whomever it may concern (not that it's any of my business))!\n"
        "Hello, world (and whomever it may concern [not that it's any of my business])!\n"
        "Hello, world [1]!\n"
        "Hello, world [1]!\n"
        "=======================\n"
        "html_tags:\n"
        "Hello, <i>world!</i>\n"
        "<title>Hello, world!</title>\n"
        '<title class="foo">Hello, world!</title>\n'
        "<html><head><title>Hello, <i>world!</i></title></head></html>\n"
            "<html>\n"
            "  <head>\n"
            '    <title class="foo">Hello, <i>world!</i></title>\n'
            "  </head>\n"
            "  <!--this is a comment-->\n"
            "  <body>\n"
            "    <p>How's it going?</p>\n"
            "  </body>\n"
            "</html>\n"
        "=======================\n"
        "punctuation:\n"
        "I can't. No, I won't! It's a matter of \"principle\"; of -- what's the word? -- conscience.\n"
        "=======================\n"
        "currency_symbols:\n"
        "$1.00 equals 100¢.\n"
        "How much is ¥100 in £?\n"
        "My password is 123$abc฿.\n"
        "=======================\n"
        "emails:\n"
        "Reach out at username@example.com.\n"
        "Click here: mailto:username@example.com.\n"
        "=======================\n"
        "emoji:\n"
        "ugh, it's raining *again* ☔\n"
        "✌ tests are passing ✌\n"
        "=======================\n"
        "hashtags:\n"
        "like omg it's #ThrowbackThursday\n"
        "#TextacyIn4Words: \"but it's honest work\"\n"
        "wth twitter #ican'teven #why-even-try\n"
        "www.foo.com#fragment is not a hashtag\n"
        "=======================\n"
        "numbers:\n"
        "I owe $1,000.99 to 123 people for 2 +1 reasons.\n"
        "=======================\n"
        "phone_numbers:\n"
        "I can be reached at 555-123-4567 through next Friday.\n"
        "=======================\n"
        "urls:\n"
        "I learned everything I know from www.stackoverflow.com and http://wikipedia.org/ and Mom.\n"
        "=======================\n"
        "user_handles:\n"
        "@Real_Burton_DeWilde: definitely not a bot\n"
        "wth twitter @b.j.dewilde\n"
        "foo@bar.com is not a user handle\n"
        "=======================\n"
        "numerize:\n"
        "forty-two\n"
        "four hundred and sixty two\n"
        "one fifty\n"
        "twelve hundred\n"
        "twenty one thousand four hundred and seventy three\n"
        "one billion and one\n"
        "nine and three quarters\n"
)

Pipeline creation

Preprocessing requires a list of all desired preprocessors and a dictionary containing any additional options from textacy. The main keys of the options dictionary correspond to the name of the preprocessor. See the Text Preprocessing section from https://textacy.readthedocs.io/en/latest/ for available options.

This example pipeline includes all of the textacy preprocessors and numerize. Unexpected behavior may happen when using all of the textacy preprocessors, depending on the ordering which they are applied.

[3]:
preprocessorList = ['bullet_points',
                    'hyphenated_words',
                    'quotation_marks',
                    'repeating_chars',
                    'unicode',
                    'whitespace',
                    'accents',
                    'brackets',
                    'html_tags',
                    'punctuation',
                    'currency_symbols',
                    'emails',
                    'emojis',
                    'hashtags',
                    'numbers',
                    'phone_numbers',
                    'urls',
                    'user_handles',
                    'numerize']
preprocessorOptions = {'repeating_chars': {'chars': 'ol', 'maxn': 2},
                       'unicode': {'form': 'NFKC'},
                       'accents': {'fast': False},
                       'brackets': {'only': 'square'},
                       'punctuation': {'only': '\''}}
[4]:
preprocess = Preprocessing(preprocessorList, preprocessorOptions)
post = preprocess(text)
print(post)
bullet_points:
  item1
  item2
  item3
  item4
  item5
  item6
  item7
  item8
=======================
hyphenated_words:
I see you shiver with anticipation.
I see you shiver with anticipation.
I see you shiver with anticiPATION.
I see you shiver with antici  1pation.
I see you shiver with antici pation.
I see you shiver with antici pation.
My phone number is _NUMBER_  _NUMBER_.
I got an 1  on the test.
=======================
quotation_marks:
These are funny single quotes .
These are fancy single quotes .
These are "fancy double quotes".
=======================
repeating_chars:
**Hello**, world!!! I wonder....... How are *you* doing?!?! lolol
=======================
unicode:
Well... That s a long story.
=======================
whitespace:
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello,
world!
Hello,
world!
Hello, world!
Hello, world!
Hello,
world !
=======================
accents:
El nino se asusto del pinguino -  que miedo!
Le garcon est tres excite pour la foret.
=======================
brackets:
Hello, {name}!
Hello, world (DeWilde et al., _NUMBER_, p. _NUMBER_)!
Hello, world (_NUMBER_)!
Hello, world !
Hello, world (and whomever it may concern )!
Hello, world (and whomever it may concern (not that it s any of my business))!
Hello, world (and whomever it may concern )!
Hello, world !
Hello, world !
=======================
html_tags:
Hello, world!
Hello, world!
Hello, world!
Hello, world!


 Hello, world!



 How s it going?


=======================
punctuation:
I can t. No, I won t! It s a matter of "principle"; of -  what s the word? -  conscience.
=======================
currency_symbols:
_CUR_1.00 equals 100_CUR_.
How much is _CUR_100 in _CUR_?
My password is 123_CUR_abc_CUR_.
=======================
emails:
Reach out at username at example.com.
Click here: mailto:username at example.com.
=======================
emoji:
ugh, it s raining *again* _EMOJI_
_EMOJI_ tests are passing _EMOJI_
=======================
hashtags:
like omg it s _TAG_
_TAG_: "but it s honest work"
wth twitter _TAG_ teven _TAG_ even try
_URL_#fragment is not a hashtag
=======================
numbers:
I owe _CUR_1,000.99 to _NUMBER_ people for _NUMBER_ _NUMBER_ reasons.
=======================
phone_numbers:
I can be reached at _NUMBER_ _NUMBER_ _NUMBER_ through next Friday.
=======================
urls:
I learned everything I know from _URL_ and _URL_ and Mom.
=======================
user_handles:
 at Real_Burton_DeWilde: definitely not a bot
wth twitter at b.j.dewilde
foo at bar.com is not a user handle
=======================
numerize:
42
462
150
1200
21473
1000000001
9.75

Coherent text example with Autocorrect and ContextualSpellCheck spelling correction

The text was taken from data/raw_text.txt and modified to have spelling mistakes and other items to clean up.

Note that numerizer automatically changes the first “A” to “1” and cannot be avoided.

[5]:
text = ("A laek was noticed from the RCP pump 1A.\n"
        "A light was unplugged.\n"
        "RCP pump 1A presure gauge was found not operating.\n"
        "RCP pump 1A pressure gauge was found inoperative.\n"
        "RCP pump 1A had signs of past leakage.\n"
        "The Pump is not experiencing enough flow druing test.\n"
        "Slight Vibrations is noticed - likely from pump shaft deflection.\n"
        "Pump flow meter was not responding.\n"
        "Rupture of pump bearings caused pump shaft degradation.\n"
        "Rupture of pump bearings caused pump shaft degradation and consequent flow reduction.\n"
        "Power supply has been found burnout.\n"
        "Pump test failed due to power supply failure.\n"
        "Pump inspection revieled excessive impeller degradation.\n"
        "Pump inspection revealed exessive impeller degradation likely due to cavitation.\n"
        "Oil puddle was found in proximity of RCP pump 1A.\n"
        "Anomalous vibrations were observed for RCP pump 1A.\n"
        "Three cracks on pump shaft were observed; they could have caused pump failure within four days.\n"
        "RCP pump 1A was cavitating and vibrating to some degree during test.\n"
        "This is most likely due to low flow conditions rather than mechanical issues.\n"
        "Cavitation was noticed but did not seem severe.\n"
        "The pump shaft vibration appears to be causing the motor to vibrate as well.\n"
        "Pump had noise of cavitation which became faint after OPS bled off the air. Low flow conditions most likely causing cavit-\n"
        "ation.\n"
        "The pump shaft deflection is causing the safety cage to rattle.\n"
        "The Pump is not experiencing enough flow for the pumps to keep the check valves open during test.\n"
        "Pump shaft made noise.\n"
        "Vibration seems like it is coming from the pump shaft.\n"
        "Visible pump shaft deflection in operation.\n"
        "Pump bearings appear in acceptable condition.\n"
        "Pump made noises - not enough to affect performance.\n"
        "Pump shaft has a slight deflection.\n"
        "Prfr chann calib.\n"
)

First do text preprocessing

[6]:
preprocessorList = ['hyphenated_words',
                    'whitespace',
                    'numerize']
preprocessorOptions = {}
preprocess = Preprocessing(preprocessorList, preprocessorOptions)
post = preprocess(text)

Autocorrect: Setup

[7]:
checker = SpellChecker(checker='autocorrect')

Autocorrect: Find acronyms or unexpected misspelled words

[8]:
checker.getMisspelledWords(post)
[8]:
['1',
 'laek',
 '1A',
 '1A',
 'presure',
 '1A',
 '1A',
 'druing',
 'Rupture',
 'Rupture',
 'revieled',
 'exessive',
 '1A',
 '1A',
 '3',
 '4',
 '1A',
 'Cavitation',
 'Prfr',
 'chann',
 'calib']

Autocorrect: Add any additional words to dictionary

[9]:
words = ['OPS', 'RCP']
checker.addWordsToDictionary(words)

Autocorrect: Get automatically corrected text

[10]:
corrected = checker.correct(post)
print(corrected)
1 lack was noticed from the RCP pump 1A.
A light was unplugged.
RCP pump 1A pressure gauge was found not operating.
RCP pump 1A pressure gauge was found inoperative.
RCP pump 1A had signs of past leakage.
The Pump is not experiencing enough flow during test.
Slight Vibrations is noticed - likely from pump shaft deflection.
Pump flow meter was not responding.
Rupture of pump bearings caused pump shaft degradation.
Rupture of pump bearings caused pump shaft degradation and consequent flow reduction.
Power supply has been found burnout.
Pump test failed due to power supply failure.
Pump inspection reviewed excessive impeller degradation.
Pump inspection revealed excessive impeller degradation likely due to cavitation.
Oil puddle was found in proximity of RCP pump 1A.
Anomalous vibrations were observed for RCP pump 1A.
3 cracks on pump shaft were observed; they could have caused pump failure within 4 days.
RCP pump 1A was cavitating and vibrating to some degree during test.
This is most likely due to low flow conditions rather than mechanical issues.
Cavitation was noticed but did not seem severe.
The pump shaft vibration appears to be causing the motor to vibrate as well.
Pump had noise of cavitation which became faint after OPS bled off the air. Low flow conditions most likely causing cavitation.
The pump shaft deflection is causing the safety cage to rattle.
The Pump is not experiencing enough flow for the pumps to keep the check valves open during test.
Pump shaft made noise.
Vibration seems like it is coming from the pump shaft.
Visible pump shaft deflection in operation.
Pump bearings appear in acceptable condition.
Pump made noises - not enough to affect performance.
Pump shaft has a slight deflection.
Pfr chain club.

PySpellChecker

[11]:
from dackar.text_processing.Preprocessing import SpellChecker
checker = SpellChecker(checker='pyspellchecker')
[12]:
checker.getMisspelledWords(post)
[12]:
{'1a',
 'calib',
 'chann',
 'druing',
 'exessive',
 'laek',
 'presure',
 'prfr',
 'rcp',
 'revieled'}
[13]:
words = ['OPS', 'RCP']
checker.addWordsToDictionary(words)
[14]:
corrected = checker.correct(post)
print(corrected)
1 lack was noticed from the RCP pump 1A.
A light was unplugged.
RCP pump 1A pressure gauge was found not operating.
RCP pump 1A pressure gauge was found inoperative.
RCP pump 1A had signs of past leakage.
The Pump is not experiencing enough flow during test.
Slight Vibrations is noticed - likely from pump shaft deflection.
Pump flow meter was not responding.
Rupture of pump bearings caused pump shaft degradation.
Rupture of pump bearings caused pump shaft degradation and consequent flow reduction.
Power supply has been found burnout.
Pump test failed due to power supply failure.
Pump inspection reviewed excessive impeller degradation.
Pump inspection revealed excessive impeller degradation likely due to cavitation.
Oil puddle was found in proximity of RCP pump 1A.
Anomalous vibrations were observed for RCP pump 1A.
3 cracks on pump shaft were observed; they could have caused pump failure within 4 days.
RCP pump 1A was cavitating and vibrating to some degree during test.
This is most likely due to low flow conditions rather than mechanical issues.
Cavitation was noticed but did not seem severe.
The pump shaft vibration appears to be causing the motor to vibrate as well.
Pump had noise of cavitation which became faint after OPS bled off the air. Low flow conditions most likely causing cavitation.
The pump shaft deflection is causing the safety cage to rattle.
The Pump is not experiencing enough flow for the pumps to keep the check valves open during test.
Pump shaft made noise.
Vibration seems like it is coming from the pump shaft.
Visible pump shaft deflection in operation.
Pump bearings appear in acceptable condition.
Pump made noises - not enough to affect performance.
Pump shaft has a slight deflection.
poor chain calif.

ContextualSpellCheck: Setup

[15]:
# checker = SpellChecker(checker='ContextualSpellCheck')

ContextualSpellCheck: Find acronyms or unexpected misspelled words

[16]:
# checker.getMisspelledWords(post)

ContextualSpellCheck: Add any additional words to dictionary

[17]:
# words = ['RCP', 'OPS', 'consequent', '1A', 'unplugged']
# checker.addWordsToDictionary(words)

ContextualSpellCheck: Get automatically corrected text

[18]:
# corrected = checker.correct(post)
# print(corrected)

Time Autocorrect workflow

[19]:
tic = time.time()
preprocessorList = ['hyphenated_words',
                    'whitespace',
                    'numerize']
preprocessorOptions = {}
preprocess = Preprocessing(preprocessorList, preprocessorOptions)
post = preprocess(text)
checker = SpellChecker(checker='autocorrect')
words = ['OPS', 'RCP']
checker.addWordsToDictionary(words)
corrected = checker.correct(post)
print(f'autocorrect time: {time.time() - tic} s')
print('===============================================================')
print(corrected)
autocorrect time: 0.049269914627075195 s
===============================================================
1 lack was noticed from the RCP pump 1A.
A light was unplugged.
RCP pump 1A pressure gauge was found not operating.
RCP pump 1A pressure gauge was found inoperative.
RCP pump 1A had signs of past leakage.
The Pump is not experiencing enough flow during test.
Slight Vibrations is noticed - likely from pump shaft deflection.
Pump flow meter was not responding.
Rupture of pump bearings caused pump shaft degradation.
Rupture of pump bearings caused pump shaft degradation and consequent flow reduction.
Power supply has been found burnout.
Pump test failed due to power supply failure.
Pump inspection reviewed excessive impeller degradation.
Pump inspection revealed excessive impeller degradation likely due to cavitation.
Oil puddle was found in proximity of RCP pump 1A.
Anomalous vibrations were observed for RCP pump 1A.
3 cracks on pump shaft were observed; they could have caused pump failure within 4 days.
RCP pump 1A was cavitating and vibrating to some degree during test.
This is most likely due to low flow conditions rather than mechanical issues.
Cavitation was noticed but did not seem severe.
The pump shaft vibration appears to be causing the motor to vibrate as well.
Pump had noise of cavitation which became faint after OPS bled off the air. Low flow conditions most likely causing cavitation.
The pump shaft deflection is causing the safety cage to rattle.
The Pump is not experiencing enough flow for the pumps to keep the check valves open during test.
Pump shaft made noise.
Vibration seems like it is coming from the pump shaft.
Visible pump shaft deflection in operation.
Pump bearings appear in acceptable condition.
Pump made noises - not enough to affect performance.
Pump shaft has a slight deflection.
Pfr chain club.

Time ContextualSpellCheck workflow

[20]:
# tic = time.time()
# preprocessorList = ['hyphenated_words',
#                     'whitespace',
#                     'numerize']
# preprocessorOptions = {}
# preprocess = Preprocessing(preprocessorList, preprocessorOptions)
# post = preprocess(text)
# checker = SpellChecker(checker='ContextualSpellCheck')
# words = ['RCP', 'OPS', 'consequent', '1A']
# checker.addWordsToDictionary(words)
# corrected = checker.correct(post)
# print(f'ContextualSpellCheck time: {time.time() - tic} s')
# print('==============================================================')
# print(corrected)
[21]:
post = "Prfr chann calib of chan"
checker = SpellChecker(checker='autocorrect')
checker.getMisspelledWords(post)
[21]:
['Prfr', 'chann', 'calib']
[ ]: