src.dackar.utils.nlp.nlp_utils ============================== .. py:module:: src.dackar.utils.nlp.nlp_utils .. autoapi-nested-parse:: Created on March, 2022 @author: wangc, mandd Attributes ---------- .. autoapisummary:: src.dackar.utils.nlp.nlp_utils.logger Functions --------- .. autoapisummary:: src.dackar.utils.nlp.nlp_utils.displayNER src.dackar.utils.nlp.nlp_utils.resetPipeline src.dackar.utils.nlp.nlp_utils.printDepTree src.dackar.utils.nlp.nlp_utils.plotDAG src.dackar.utils.nlp.nlp_utils.extractLemma src.dackar.utils.nlp.nlp_utils.generatePattern src.dackar.utils.nlp.nlp_utils.generatePatternList src.dackar.utils.nlp.nlp_utils.extendEnt src.dackar.utils.nlp.nlp_utils.customTokenizer Module Contents --------------- .. py:data:: logger .. py:function:: displayNER(doc, includePunct=False) Generate data frame for visualization of spaCy doc with custom attributes. :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param includePunct: bool, True if the punctuaction is included :returns: pandas.DataFrame, data frame contains attributes of tokens :rtype: df .. py:function:: resetPipeline(nlp, pipes) remove all custom pipes, and add new pipes :param nlp: spacy.Language object, contains all components and data needed to process text :param pipes: list, list of pipes that will be added to nlp pipeline :returns: spacy.Language object, contains updated components and data needed to process text :rtype: nlp .. py:function:: printDepTree(doc, skipPunct=True) Utility function to pretty print the dependency tree. :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param skipPunct: bool, True if skip punctuactions :returns: None .. py:function:: plotDAG(edges, colors='k') :param edges: list of tuples, [(subj, conj), (..,..)] or [(subj, conj, {"color":"blue"}), (..,..)] :param colors: str or list, list of colors .. py:function:: extractLemma(var, nlp) Lammatize the variable list :param var: str, string :param nlp: object, preloaded nlp model :returns: list, list of lammatized variables :rtype: lemVar .. py:function:: generatePattern(form, label, id, attr='LOWER') Generate entity pattern :param form: str or list, the given str or list of lemmas that will be used to generate pattern :param label: str, the label name for the pattern :param id: str, the id name for the pattern :param attr: str, attribute used for the pattern, either "LOWER" or "LEMMA" :returns: dict, pattern will be used by entity matcher :rtype: pattern .. py:function:: generatePatternList(entList, label, id, nlp, attr='LOWER') Generate a list of entity patterns :param entList: list, list of entities :param label: str, the label name for the pattern :param id: str, the id name for the pattern :param attr: str, attribute used for the pattern, either "LOWER" or "LEMMA" :returns: ptnList, list, list of patterns will be used by entity matcher .. py:function:: extendEnt(matcher, doc, i, matches) Extend the doc's entity :param matcher: spacy.Matcher, the spacy matcher instance :param doc: spacy.tokens.doc.Doc, the processed document using nlp pipelines :param i: int, index of the current match (matches[i]) :param matches: List[Tuple[int, int, int]], a list of (match_id, start, end) tuples, describing :param the matches. A match tuple describes a span doc[start: end] .. py:function:: customTokenizer(nlp) custom tokenizer to keep hyphens between letters and digits :param nlp: spacy nlp model :type nlp: spacy nlp model :returns: nlp with custom tokenizer :rtype: nlp