src.dackar.utils.nlp.nlp_utils ============================== .. py:module:: src.dackar.utils.nlp.nlp_utils .. autoapi-nested-parse:: Created on March, 2022 @author: wangc, mandd Attributes ---------- .. autoapisummary:: src.dackar.utils.nlp.nlp_utils.logger Functions --------- .. autoapisummary:: src.dackar.utils.nlp.nlp_utils.displayNER src.dackar.utils.nlp.nlp_utils.resetPipeline src.dackar.utils.nlp.nlp_utils.printDepTree src.dackar.utils.nlp.nlp_utils.plotDAG src.dackar.utils.nlp.nlp_utils.extractLemma src.dackar.utils.nlp.nlp_utils.generatePattern src.dackar.utils.nlp.nlp_utils.generatePatternList src.dackar.utils.nlp.nlp_utils.extendEnt src.dackar.utils.nlp.nlp_utils.customTokenizer Module Contents --------------- .. py:data:: logger .. py:function:: displayNER(doc, includePunct=False) Generate data frame for visualization of spaCy doc with custom attributes. :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param includePunct: bool, True if the punctuaction is included :returns: pandas.DataFrame, data frame contains attributes of tokens :rtype: df .. py:function:: resetPipeline(nlp, pipes) remove all custom pipes, and add new pipes :param nlp: spacy.Language object, contains all components and data needed to process text :param pipes: list, list of pipes that will be added to nlp pipeline :returns: spacy.Language object, contains updated components and data needed to process text :rtype: nlp .. py:function:: printDepTree(doc, skipPunct=True) Utility function to pretty print the dependency tree. :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param skipPunct: bool, True if skip punctuactions :returns: None .. py:function:: plotDAG(edges, colors='k') :param edges: list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {"color":"blue"}), (..,..)] :param colors: str or list, list of colors .. py:function:: extractLemma(var, nlp) Lammatize the variable list :param var: str, string :param nlp: object, preloaded nlp model :returns: list, list of lammatized variables :rtype: lemVar .. py:function:: generatePattern(form, label, id, attr='LOWER') Generate entity pattern :param form: str or list, the given str or list of lemmas that will be used to generate pattern :param label: str, the label name for the pattern :param id: str, the id name for the pattern :param attr: str, attribute used for the pattern, either "LOWER" or "LEMMA" :returns: dict, pattern will be used by entity matcher :rtype: pattern .. py:function:: generatePatternList(entList, label, id, nlp, attr='LOWER') Generate a list of entity patterns :param entList: list, list of entities :param label: str, the label name for the pattern :param id: str, the id name for the pattern :param attr: str, attribute used for the pattern, either "LOWER" or "LEMMA" :returns: ptnList, list, list of patterns will be used by entity matcher .. py:function:: extendEnt(matcher, doc, i, matches) Extend the doc's entity :param matcher: spacy.Matcher, the spacy matcher instance :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param i: int, index of the current match (matches[i]) :param matches: List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing :param the matches. A match tuple describes a span doc[start: end] .. py:function:: customTokenizer(nlp) custom tokenizer to keep hyphens between letters and digits When apply tokenizer, the words with hyphens will be splitted into multiple tokens, this function can be used to avoid the split of the words when hyphens are present. :param nlp: spacy nlp model :type nlp: spacy nlp model :returns: nlp with custom tokenizer :rtype: nlp