src.dackar.utils.nlp.nlp_utils

Created on March, 2022

@author: wangc, mandd

Attributes

logger

Functions

displayNER(doc[, includePunct])

Generate data frame for visualization of spaCy doc with custom attributes.

resetPipeline(nlp, pipes)

remove all custom pipes, and add new pipes

printDepTree(doc[, skipPunct])

Utility function to pretty print the dependency tree.

plotDAG(edges[, colors])

extractLemma(var, nlp)

Lammatize the variable list

generatePattern(form, label, id[, attr])

Generate entity pattern

generatePatternList(entList, label, id, nlp[, attr])

Generate a list of entity patterns

extendEnt(matcher, doc, i, matches)

Extend the doc's entity

Module Contents

src.dackar.utils.nlp.nlp_utils.logger[source]
src.dackar.utils.nlp.nlp_utils.displayNER(doc, includePunct=False)[source]

Generate data frame for visualization of spaCy doc with custom attributes.

Parameters:
  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • includePunct – bool, True if the punctuaction is included

Returns:

pandas.DataFrame, data frame contains attributes of tokens

Return type:

df

src.dackar.utils.nlp.nlp_utils.resetPipeline(nlp, pipes)[source]

remove all custom pipes, and add new pipes

Parameters:
  • nlp – spacy.Language object, contains all components and data needed to process text

  • pipes – list, list of pipes that will be added to nlp pipeline

Returns:

spacy.Language object, contains updated components and data needed to process text

Return type:

nlp

src.dackar.utils.nlp.nlp_utils.printDepTree(doc, skipPunct=True)[source]

Utility function to pretty print the dependency tree.

Parameters:
  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • skipPunct – bool, True if skip punctuactions

Returns:

None

src.dackar.utils.nlp.nlp_utils.plotDAG(edges, colors='k')[source]
Parameters:
  • edges – list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {“color”:”blue”}), (..,..)]

  • colors – str or list, list of colors

src.dackar.utils.nlp.nlp_utils.extractLemma(var, nlp)[source]

Lammatize the variable list

Parameters:
  • var – str, string

  • nlp – object, preloaded nlp model

Returns:

list, list of lammatized variables

Return type:

lemVar

src.dackar.utils.nlp.nlp_utils.generatePattern(form, label, id, attr='LOWER')[source]

Generate entity pattern

Parameters:
  • form – str or list, the given str or list of lemmas that will be used to generate pattern

  • label – str, the label name for the pattern

  • id – str, the id name for the pattern

  • attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

dict, pattern will be used by entity matcher

Return type:

pattern

src.dackar.utils.nlp.nlp_utils.generatePatternList(entList, label, id, nlp, attr='LOWER')[source]

Generate a list of entity patterns

Parameters:
  • entList – list, list of entities

  • label – str, the label name for the pattern

  • id – str, the id name for the pattern

  • attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

ptnList, list, list of patterns will be used by entity matcher

src.dackar.utils.nlp.nlp_utils.extendEnt(matcher, doc, i, matches)[source]

Extend the doc’s entity

Parameters:
  • matcher – spacy.Matcher, the spacy matcher instance

  • doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines

  • i – int, index of the current match (matches[i])

  • matches – List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing

  • doc[start (the matches. A match tuple describes a span) – end]