src.dackar.similarity.simUtils¶
Functions¶
|
Compute sentence similarity using both semantic and word order similarity |
|
Compute sentence similarity using word order similarity |
|
Construct word order vector |
|
Compute sentence similarity using semantic similarity |
|
Construct semantic vector |
Compute word dict and word numbers in NLTK brown corpus |
|
|
Employ statistics from Brown Corpus to compute the information content of given word in the corpus |
|
Identify the best similar word in a word set for a given word |
|
Compute the similarity between two words using semantic analysis |
|
Identify the best synset pair for given two words using wordnet similarity analysis |
|
Taking out Noun and Verb for comparison word based |
|
removing the disambiguity by getting the context |
|
General method for compute words similarity |
|
removing the disambiguity by getting the context |
|
Wrap for sentence sense disambiguation method from pywsd |
|
Compute semantic similarity for given two sentences that disambiguation will be performed |
|
Use sentence itself to identify the best synset |
|
Convert a list/set of words into a list of synsets |
|
Identify the best synset for given word with provided additional information (i.e., jointWordList) |
|
Use sentence itself to identify the best synset |
Module Contents¶
- src.dackar.similarity.simUtils.sentenceSimilarity(sentenceA, sentenceB, infoContentNorm=False, delta=0.85)[source]¶
Compute sentence similarity using both semantic and word order similarity The semantic similarity is based on maximum word similarity between one word and another sentence
- Parameters:
sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity
infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors
delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity
similarity (contribution from word order)
- Returns:
float, [0, 1], the computed similarity for given two sentences
- Return type:
similarity
- src.dackar.similarity.simUtils.wordOrderSimilaritySentences(sentenceA, sentenceB)[source]¶
Compute sentence similarity using word order similarity
- Parameters:
sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity
- Returns:
float, [0, 1], the computed word order similarity for given two sentences
- Return type:
similarity
- src.dackar.similarity.simUtils.constructWordOrderVector(words, jointWords, index)[source]¶
Construct word order vector
- Parameters:
words – set of words, a set of words for one sentence
jointWords – set of joint words, a set of joint words for both sentences
index – dict, word index in the joint set of words
- Returns:
numpy.array, the word order vector
- Return type:
vector
- src.dackar.similarity.simUtils.semanticSimilaritySentences(sentenceA, sentenceB, infoContentNorm)[source]¶
Compute sentence similarity using semantic similarity The semantic similarity is based on maximum word similarity between one word and another sentence
- Parameters:
sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity
infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors
- Returns:
float, [0, 1], the computed similarity for given two sentences
- Return type:
semSimilarity
- src.dackar.similarity.simUtils.constructSemanticVector(words, jointWords, infoContentNorm)[source]¶
Construct semantic vector
- Parameters:
words – set of words, a set of words for one sentence
jointWords – set of joint words, a set of joint words for both sentences
infoContentNorm – bool, consider word statistics in Brown Corpus if True
- Returns:
numpy.array, the semantic vector
- Return type:
vector
- src.dackar.similarity.simUtils.brownInfo()[source]¶
Compute word dict and word numbers in NLTK brown corpus
- Parameters:
None
- Returns:
int, the total number of words in brown brownDict: dict, the brown word dict, {word:count}
- Return type:
wordCount
- src.dackar.similarity.simUtils.content(wordData, wordCount=0, brownDict=None)[source]¶
Employ statistics from Brown Corpus to compute the information content of given word in the corpus ref: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1644735 information content I(w) = 1 - log(n+1)/log(N+1) The significance of a word is weighted using its information content. The assumption here is that words occur with a higher frequency (in corpus) contain less information than those that occur with lower frequencies.
- Parameters:
wordData – string, a given word
wordCount – int, the total number of words in brown corpus
brownDict – dict, the brown word dict, {word:count}
- Returns:
float, [0, 1], the information content of a word in the corpus
- Return type:
content
- src.dackar.similarity.simUtils.identifyBestSimilarWordFromWordSet(wordA, wordSet)[source]¶
Identify the best similar word in a word set for a given word
- Parameters:
wordA – str, a given word that looking for the best similar word in a word set
wordSet – set/list, a pool of words
- Returns:
str, the best similar word in the word set for given word similarity: float, [0, 1], similarity score between the best pair of words
- Return type:
word
- src.dackar.similarity.simUtils.semanticSimilarityWords(wordA, wordB)[source]¶
Compute the similarity between two words using semantic analysis First identify the best similar synset pair using wordnet similarity, then compute the similarity using both path length and depth information in wordnet
- Parameters:
wordA – str, the first word
wordB – str, the second word
- Returns:
float, [0, 1], the similarity score
- Return type:
similarity
- src.dackar.similarity.simUtils.identifyBestSimilarSynsetPair(wordA, wordB)[source]¶
Identify the best synset pair for given two words using wordnet similarity analysis
- Parameters:
wordA – str, the first word
wordB – str, the second word
- Returns:
tuple, (first synset, second synset), identified best synset pair using wordnet similarity
- Return type:
bestPair
- src.dackar.similarity.simUtils.identifyNounAndVerbForComparison(sentence)[source]¶
Taking out Noun and Verb for comparison word based
- Parameters:
sentence – string, sentence string
- Returns:
list, list of dict {token/word:pos_tag}
- Return type:
pos
- src.dackar.similarity.simUtils.sentenceSenseDisambiguation(sentence, method='simple_lesk')[source]¶
removing the disambiguity by getting the context
- Parameters:
sentence – str, sentence string
method – str, the method for disambiguation, this method only support simple_lesk method
- Returns:
set, set of wordnet.Synset for the estimated best sense
- Return type:
sense
- src.dackar.similarity.simUtils.wordsSimilarity(wordA, wordB, method='semantic_similarity_synsets')[source]¶
General method for compute words similarity
- Parameters:
wordA – str, the first word
wordB – str, the second word
method – str, the method used to compute word similarity
- Returns:
float, [0, 1], the similarity score
- Return type:
similarity
- src.dackar.similarity.simUtils.wordSenseDisambiguation(word, sentence, senseMethod='simple_lesk', simMethod='path')[source]¶
removing the disambiguity by getting the context
- Parameters:
word – str/list/set, given word or set of words
sentence – str, sentence that will be used to disambiguate the given word
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']
- Returns:
str/list/set, the type for given word, identified best sense for given word with disambiguation performed using given sentence
- Return type:
sense
- src.dackar.similarity.simUtils.sentenceSenseDisambiguationPyWSD(sentence, senseMethod='simple_lesk', simMethod='path')[source]¶
Wrap for sentence sense disambiguation method from pywsd https://github.com/alvations/pywsd
- Parameters:
sentence – str, given sentence
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']
- Returns:
list, list of words from sentence that has an identified synset from wordnet synsetList: list, list of corresponding synset for wordList
- Return type:
wordList
- src.dackar.similarity.simUtils.sentenceSimilarityWithDisambiguation(sentenceA, sentenceB, senseMethod='simple_lesk', simMethod='semantic_similarity_synsets', disambiguationSimMethod='path', delta=0.85)[source]¶
Compute semantic similarity for given two sentences that disambiguation will be performed
- Parameters:
sentenceA – str, first sentence
sentenceB – str, second sentence
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis in the construction of semantic vectors
['semantic_similarity_synsets' (one of)
'path'
'wup'
'lch'
'res'
'jcn'
'lin']
disambiguationSimMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']
delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity
similarity (contribution from word order)
- Returns:
float, [0, 1], the computed similarity for given two sentences
- Return type:
similarity
- src.dackar.similarity.simUtils.convertSentsToSynsetsWithDisambiguation(sentList)[source]¶
Use sentence itself to identify the best synset
- Parameters:
sentList – list of sentences
- Returns:
list of synsets for corresponding sentences
- Return type:
sentSynsets
- src.dackar.similarity.simUtils.convertToSynsets(wordSet)[source]¶
Convert a list/set of words into a list of synsets
- Parameters:
wordSet – list/set of words
- Returns:
list, list of words without duplications synsets: list, list of synsets correponding wordList
- Return type:
wordList
- src.dackar.similarity.simUtils.identifyBestSynset(word, jointWordList, jointSynsetList)[source]¶
Identify the best synset for given word with provided additional information (i.e., jointWordList)
- Parameters:
word – str, a single word
jointWordList – list, a list of words without duplications
jointSynsetList – list, a list of synsets correponding to jointWordList
- Returns:
wn.synset, identified synset for given word
- Return type:
bestSyn
- src.dackar.similarity.simUtils.convertSentsToSynsets(sentList, info=None)[source]¶
Use sentence itself to identify the best synset
- Parameters:
sentList – list, list of sentences
info – list, additional list of words that will be used to determine the synset
- Returns:
list, lis of synsets correponding to provided sentList
- Return type:
sentSynsets