src.dackar.utils.nlp.nlp_utils¶

Created on March, 2022

@author: wangc, mandd

Attributes¶

logger

Functions¶

`displayNER`(doc[, includePunct])	Generate data frame for visualization of spaCy doc with custom attributes.
`resetPipeline`(nlp, pipes)	remove all custom pipes, and add new pipes
`printDepTree`(doc[, skipPunct])	Utility function to pretty print the dependency tree.
`plotDAG`(edges[, colors])
`extractLemma`(var, nlp)	Lammatize the variable list
`generatePattern`(form, label, id[, attr])	Generate entity pattern
`generatePatternList`(entList, label, id, nlp[, attr])	Generate a list of entity patterns
`extendEnt`(matcher, doc, i, matches)	Extend the doc's entity
`customTokenizer`(nlp)	custom tokenizer to keep hyphens between letters and digits

Module Contents¶

src.dackar.utils.nlp.nlp_utils.logger[source]¶

src.dackar.utils.nlp.nlp_utils.displayNER(doc, includePunct=False)[source]¶

Generate data frame for visualization of spaCy doc with custom attributes.

Parameters:

doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
includePunct – bool, True if the punctuaction is included

Returns:

pandas.DataFrame, data frame contains attributes of tokens

Return type:

df

src.dackar.utils.nlp.nlp_utils.resetPipeline(nlp, pipes)[source]¶

remove all custom pipes, and add new pipes

Parameters:

nlp – spacy.Language object, contains all components and data needed to process text
pipes – list, list of pipes that will be added to nlp pipeline

Returns:

spacy.Language object, contains updated components and data needed to process text

Return type:

nlp

src.dackar.utils.nlp.nlp_utils.printDepTree(doc, skipPunct=True)[source]¶

Utility function to pretty print the dependency tree.

Parameters:

doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
skipPunct – bool, True if skip punctuactions

Returns:

None

src.dackar.utils.nlp.nlp_utils.plotDAG(edges, colors='k')[source]¶

Parameters:

edges – list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {“color”:”blue”}), (..,..)]
colors – str or list, list of colors

src.dackar.utils.nlp.nlp_utils.extractLemma(var, nlp)[source]¶

Lammatize the variable list

Parameters:

var – str, string
nlp – object, preloaded nlp model

Returns:

list, list of lammatized variables

Return type:

lemVar

src.dackar.utils.nlp.nlp_utils.generatePattern(form, label, id, attr='LOWER')[source]¶

Generate entity pattern

Parameters:

form – str or list, the given str or list of lemmas that will be used to generate pattern
label – str, the label name for the pattern
id – str, the id name for the pattern
attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

dict, pattern will be used by entity matcher

Return type:

pattern

src.dackar.utils.nlp.nlp_utils.generatePatternList(entList, label, id, nlp, attr='LOWER')[source]¶

Generate a list of entity patterns

Parameters:

entList – list, list of entities
label – str, the label name for the pattern
id – str, the id name for the pattern
attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”

Returns:

ptnList, list, list of patterns will be used by entity matcher

src.dackar.utils.nlp.nlp_utils.extendEnt(matcher, doc, i, matches)[source]¶

Extend the doc’s entity

Parameters:

matcher – spacy.Matcher, the spacy matcher instance
doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
i – int, index of the current match (matches[i])
matches – List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing
doc[start (the matches. A match tuple describes a span) – end]

src.dackar.utils.nlp.nlp_utils.customTokenizer(nlp)[source]¶

custom tokenizer to keep hyphens between letters and digits When apply tokenizer, the words with hyphens will be splitted into multiple tokens, this function can be used to avoid the split of the words when hyphens are present.

Parameters:: nlp (spacy nlp model) – spacy nlp model
Returns:: nlp with custom tokenizer
Return type:: nlp