src.dackar.utils.nlp.nlp_utils¶
Created on March, 2022
@author: wangc, mandd
Attributes¶
Functions¶
|
Generate data frame for visualization of spaCy doc with custom attributes. |
|
remove all custom pipes, and add new pipes |
|
Utility function to pretty print the dependency tree. |
|
|
|
Lammatize the variable list |
|
Generate entity pattern |
|
Generate a list of entity patterns |
|
Extend the doc's entity |
|
custom tokenizer to keep hyphens between letters and digits |
Module Contents¶
- src.dackar.utils.nlp.nlp_utils.displayNER(doc, includePunct=False)[source]¶
Generate data frame for visualization of spaCy doc with custom attributes.
- Parameters:
doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
includePunct – bool, True if the punctuaction is included
- Returns:
pandas.DataFrame, data frame contains attributes of tokens
- Return type:
df
- src.dackar.utils.nlp.nlp_utils.resetPipeline(nlp, pipes)[source]¶
remove all custom pipes, and add new pipes
- Parameters:
nlp – spacy.Language object, contains all components and data needed to process text
pipes – list, list of pipes that will be added to nlp pipeline
- Returns:
spacy.Language object, contains updated components and data needed to process text
- Return type:
nlp
- src.dackar.utils.nlp.nlp_utils.printDepTree(doc, skipPunct=True)[source]¶
Utility function to pretty print the dependency tree.
- Parameters:
doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
skipPunct – bool, True if skip punctuactions
- Returns:
None
- src.dackar.utils.nlp.nlp_utils.plotDAG(edges, colors='k')[source]¶
- Parameters:
edges – list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {“color”:”blue”}), (..,..)]
colors – str or list, list of colors
- src.dackar.utils.nlp.nlp_utils.extractLemma(var, nlp)[source]¶
Lammatize the variable list
- Parameters:
var – str, string
nlp – object, preloaded nlp model
- Returns:
list, list of lammatized variables
- Return type:
lemVar
- src.dackar.utils.nlp.nlp_utils.generatePattern(form, label, id, attr='LOWER')[source]¶
Generate entity pattern
- Parameters:
form – str or list, the given str or list of lemmas that will be used to generate pattern
label – str, the label name for the pattern
id – str, the id name for the pattern
attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”
- Returns:
dict, pattern will be used by entity matcher
- Return type:
pattern
- src.dackar.utils.nlp.nlp_utils.generatePatternList(entList, label, id, nlp, attr='LOWER')[source]¶
Generate a list of entity patterns
- Parameters:
entList – list, list of entities
label – str, the label name for the pattern
id – str, the id name for the pattern
attr – str, attribute used for the pattern, either “LOWER” or “LEMMA”
- Returns:
ptnList, list, list of patterns will be used by entity matcher
- src.dackar.utils.nlp.nlp_utils.extendEnt(matcher, doc, i, matches)[source]¶
Extend the doc’s entity
- Parameters:
matcher – spacy.Matcher, the spacy matcher instance
doc – spacy.tokens.doc.Doc, the processed document using nlp pipelines
i – int, index of the current match (matches[i])
matches – List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing
doc[start (the matches. A match tuple describes a span) – end]
- src.dackar.utils.nlp.nlp_utils.customTokenizer(nlp)[source]¶
custom tokenizer to keep hyphens between letters and digits When apply tokenizer, the words with hyphens will be splitted into multiple tokens, this function can be used to avoid the split of the words when hyphens are present.
- Parameters:
nlp (spacy nlp model) – spacy nlp model
- Returns:
nlp with custom tokenizer
- Return type:
nlp