src.dackar.utils.nlp.nlp_utils

Created on March, 2022

@author: wangc, mandd

Attributes

logger

Functions

displayNER(doc[, includePunct])

Generate data frame for visualization of spaCy doc with custom attributes.

resetPipeline(nlp, pipes)

remove all custom pipes, and add new pipes

printDepTree(doc[, skipPunct])

Utility function to pretty print the dependency tree.

plotDAG(edges[, colors])

extractLemma(var, nlp)

Lammatize the variable list

generatePattern(form, label, id[, attr])

Generate entity pattern

generatePatternList(entList, label, id, nlp[, attr])

Generate a list of entity patterns

extendEnt(matcher, doc, i, matches)

Extend the doc's entity

customTokenizer(nlp)

custom tokenizer to keep hyphens between letters and digits

Module Contents

src.dackar.utils.nlp.nlp_utils.logger[source]
src.dackar.utils.nlp.nlp_utils.displayNER(doc, includePunct=False)[source]

Generate data frame for visualization of spaCy doc with custom attributes.

Parameters:
  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • includePunct – bool, True if the punctuaction is included

Returns:

pandas.DataFrame, data frame contains attributes of tokens

Return type:

df

src.dackar.utils.nlp.nlp_utils.resetPipeline(nlp, pipes)[source]

remove all custom pipes, and add new pipes

Parameters:
  • nlp – spacy.Language object, contains all components and data needed to process text

  • pipes – list, list of pipes that will be added to nlp pipeline

Returns:

spacy.Language object, contains updated components and data needed to process text

Return type:

nlp

src.dackar.utils.nlp.nlp_utils.printDepTree(doc, skipPunct=True)[source]

Utility function to pretty print the dependency tree.

Parameters:
  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • skipPunct – bool, True if skip punctuactions

Returns:

None

src.dackar.utils.nlp.nlp_utils.plotDAG(edges, colors='k')[source]
Parameters:
  • edges – list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {“color”:”blue”}), (..,..)]

  • colors – str or list, list of colors

src.dackar.utils.nlp.nlp_utils.extractLemma(var, nlp)[source]

Lammatize the variable list

Parameters:
  • var – str, string

  • nlp – object, preloaded nlp model

Returns:

list, list of lammatized variables

Return type:

lemVar

src.dackar.utils.nlp.nlp_utils.generatePattern(form, label, id, attr='LOWER')[source]

Generate entity pattern

Parameters:
  • form – str or list, the given str or list of lemmas that will be used to generate pattern

  • label – str, the label name for the pattern

  • id – str, the id name for the pattern

  • attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

dict, pattern will be used by entity matcher

Return type:

pattern

src.dackar.utils.nlp.nlp_utils.generatePatternList(entList, label, id, nlp, attr='LOWER')[source]

Generate a list of entity patterns

Parameters:
  • entList – list, list of entities

  • label – str, the label name for the pattern

  • id – str, the id name for the pattern

  • attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

ptnList, list, list of patterns will be used by entity matcher

src.dackar.utils.nlp.nlp_utils.extendEnt(matcher, doc, i, matches)[source]

Extend the doc’s entity

Parameters:
  • matcher – spacy.Matcher, the spacy matcher instance

  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • i – int, index of the current match (matches[i])

  • matches – List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing

  • doc[start (the matches. A match tuple describes a span) – end]

src.dackar.utils.nlp.nlp_utils.customTokenizer(nlp)[source]

custom tokenizer to keep hyphens between letters and digits When apply tokenizer, the words with hyphens will be splitted into multiple tokens, this function can be used to avoid the split of the words when hyphens are present.

Parameters:

nlp (spacy nlp model) – spacy nlp model

Returns:

nlp with custom tokenizer

Return type:

nlp