src.dackar.similarity.simUtils¶

Attributes¶

pywsd

//ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1644735

Functions¶

`sentenceSimilarity`(sentenceA, sentenceB[, ...])	Compute sentence similarity using both semantic and word order similarity
`wordOrderSimilaritySentences`(sentenceA, sentenceB)	Compute sentence similarity using word order similarity
`constructWordOrderVector`(words, jointWords)	Construct word order vector
`semanticSimilaritySentences`(sentenceA, sentenceB, ...)	Compute sentence similarity using semantic similarity
`constructSemanticVector`(words, jointWords, infoContentNorm)	Construct semantic vector
`brownInfo`()	Compute word dict and word numbers in NLTK brown corpus
`content`(wordData[, wordCount, brownDict])	Employ statistics from Brown Corpus to compute the information content of given word in the corpus
`identifyBestSimilarWordFromWordSet`(wordA, wordSet)	Identify the best similar word in a word set for a given word
`semanticSimilarityWords`(wordA, wordB)	Compute the similarity between two words using semantic analysis
`identifyBestSimilarSynsetPair`(wordA, wordB)	Identify the best synset pair for given two words using wordnet similarity analysis
`identifyNounAndVerbForComparison`(sentence)	Taking out Noun and Verb for comparison word based
`sentenceSenseDisambiguation`(sentence[, method])	removing the disambiguity by getting the context
`wordsSimilarity`(wordA, wordB[, method])	General method for compute words similarity
`wordSenseDisambiguation`(word, sentence[, senseMethod, ...])	removing the disambiguity by getting the context
`sentenceSenseDisambiguationPyWSD`(sentence[, ...])	Wrap for sentence sense disambiguation method from pywsd
`sentenceSimilarityWithDisambiguation`(sentenceA, sentenceB)	Compute semantic similarity for given two sentences that disambiguation will be performed
`convertSentsToSynsetsWithDisambiguation`(sentList)	Use sentence itself to identify the best synset
`convertToSynsets`(wordSet)	Convert a list/set of words into a list of synsets
`identifyBestSynset`(word, jointWordList, jointSynsetList)	Identify the best synset for given word with provided additional information (i.e., jointWordList)
`convertSentsToSynsets`(sentList[, info])	Use sentence itself to identify the best synset

Module Contents¶

src.dackar.similarity.simUtils.pywsd[source]¶

//ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1644735 Codes are modified from https://github.com/anishvarsha/Sentence-Similaritity-using-corpus-statistics

Type:: Methods proposed by
Type:: https

src.dackar.similarity.simUtils.sentenceSimilarity(sentenceA, sentenceB, infoContentNorm=True, delta=0.85)[source]¶

Compute sentence similarity using both semantic and word order similarity The semantic similarity is based on maximum word similarity between one word and another sentence

Parameters:

sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity
infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors
delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity
similarity (contribution from word order)
0.85 (default is)

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.wordOrderSimilaritySentences(sentenceA, sentenceB)[source]¶

Compute sentence similarity using word order similarity Ref: Li, Yuhua, et al. “Sentence similarity based on semantic nets and corpus statistics.” IEEE transactions on knowledge and data engineering 18.8 (2006): 1138-1150.

Parameters:

sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity

Returns:

float, [0, 1], the computed word order similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.constructWordOrderVector(words, jointWords)[source]¶

Construct word order vector

Parameters:

words – set of words, a set of words for one sentence
jointWords – set of joint words, a set of joint words for both sentences

Returns:

numpy.array, the word order vector

Return type:

vector

src.dackar.similarity.simUtils.semanticSimilaritySentences(sentenceA, sentenceB, infoContentNorm)[source]¶

Compute sentence similarity using semantic similarity The semantic similarity is based on maximum word similarity between one word and another sentence Ref: Li, Yuhua, et al. “Sentence similarity based on semantic nets and corpus statistics.” IEEE transactions on knowledge and data engineering 18.8 (2006): 1138-1150.

Parameters:

sentenceA – str, first sentence used to compute sentence similarity
sentenceB – str, second sentence used to compute sentence similarity
infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

semSimilarity

src.dackar.similarity.simUtils.constructSemanticVector(words, jointWords, infoContentNorm)[source]¶

Construct semantic vector

Parameters:

words – set of words, a set of words for one sentence
jointWords – set of joint words, a set of joint words for both sentences
infoContentNorm – bool, consider word statistics in Brown Corpus if True

Returns:

numpy.array, the semantic vector

Return type:

vector

src.dackar.similarity.simUtils.brownInfo()[source]¶

Compute word dict and word numbers in NLTK brown corpus

Parameters:: None
Returns:: int, the total number of words in brown brownDict: dict, the brown word dict, {word:count}
Return type:: wordCount

src.dackar.similarity.simUtils.content(wordData, wordCount=0, brownDict=None)[source]¶

Employ statistics from Brown Corpus to compute the information content of given word in the corpus ref: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1644735 information content I(w) = 1 - log(n+1)/log(N+1) The significance of a word is weighted using its information content. The assumption here is that words occur with a higher frequency (in corpus) contain less information than those that occur with lower frequencies.

Parameters:

wordData – string, a given word
wordCount – int, the total number of words in brown corpus
brownDict – dict, the brown word dict, {word:count}

Returns:

float, [0, 1], the information content of a word in the corpus

Return type:

content

src.dackar.similarity.simUtils.identifyBestSimilarWordFromWordSet(wordA, wordSet)[source]¶

Identify the best similar word in a word set for a given word

Parameters:

wordA – str, a given word that looking for the best similar word in a word set
wordSet – set/list, a pool of words

Returns:

str, the best similar word in the word set for given word similarity: float, [0, 1], similarity score between the best pair of words

Return type:

word

src.dackar.similarity.simUtils.semanticSimilarityWords(wordA, wordB)[source]¶

Compute the similarity between two words using semantic analysis First identify the best similar synset pair using wordnet similarity, then compute the similarity using both path length and depth information in wordnet

Parameters:

wordA – str, the first word
wordB – str, the second word

Returns:

float, [0, 1], the similarity score

Return type:

similarity

src.dackar.similarity.simUtils.identifyBestSimilarSynsetPair(wordA, wordB)[source]¶

Identify the best synset pair for given two words using wordnet similarity analysis

Parameters:

wordA – str, the first word
wordB – str, the second word

Returns:

tuple, (first synset, second synset), identified best synset pair using wordnet similarity similarity: float, the similarity score

Return type:

bestPair

src.dackar.similarity.simUtils.identifyNounAndVerbForComparison(sentence)[source]¶

Taking out Noun and Verb for comparison word based

Parameters:: sentence – string, sentence string
Returns:: list, list of dict {token/word:pos_tag}
Return type:: pos

src.dackar.similarity.simUtils.sentenceSenseDisambiguation(sentence, method='simple_lesk')[source]¶

removing the disambiguity by getting the context

Parameters:

sentence – str, sentence string
method – str, the method for disambiguation, this method only support simple_lesk method

Returns:

set, set of wordnet.Synset for the estimated best sense

Return type:

sense

src.dackar.similarity.simUtils.wordsSimilarity(wordA, wordB, method='semantic_similarity_synsets')[source]¶

General method for compute words similarity

Parameters:

wordA – str, the first word
wordB – str, the second word
method – str, the method used to compute word similarity

Returns:

float, [0, 1], the similarity score

Return type:

similarity

src.dackar.similarity.simUtils.wordSenseDisambiguation(word, sentence, senseMethod='simple_lesk', simMethod='path')[source]¶

removing the disambiguity by getting the context

Parameters:

word – str/list/set, given word or set of words
sentence – str, sentence that will be used to disambiguate the given word
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']

Returns:

str/list/set, the type for given word, identified best sense for given word with disambiguation performed using given sentence

Return type:

sense

src.dackar.similarity.simUtils.sentenceSenseDisambiguationPyWSD(sentence, senseMethod='simple_lesk', simMethod='path')[source]¶

Wrap for sentence sense disambiguation method from pywsd https://github.com/alvations/pywsd

Parameters:

sentence – str, given sentence
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']

Returns:

list, list of words from sentence that has an identified synset from wordnet synsetList: list, list of corresponding synset for wordList

Return type:

wordList

src.dackar.similarity.simUtils.sentenceSimilarityWithDisambiguation(sentenceA, sentenceB, senseMethod='simple_lesk', simMethod='semantic_similarity_synsets', disambiguationSimMethod='path', delta=0.85)[source]¶

Compute semantic similarity for given two sentences that disambiguation will be performed

Parameters:

sentenceA – str, first sentence
sentenceB – str, second sentence
senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]
simMethod – str, method for similarity analysis in the construction of semantic vectors
['semantic_similarity_synsets' (one of)
'path'
'wup'
'lch'
'res'
'jcn'
'lin']
disambiguationSimMethod – str, method for similarity analysis when ‘max_similarity’ is used,
['path' (one of)
'wup'
'lch'
'res'
'jcn'
'lin']
delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity
similarity (contribution from word order)

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.convertSentsToSynsetsWithDisambiguation(sentList)[source]¶

Use sentence itself to identify the best synset

Parameters:: sentList – list of sentences
Returns:: list of synsets for corresponding sentences
Return type:: sentSynsets

src.dackar.similarity.simUtils.convertToSynsets(wordSet)[source]¶

Convert a list/set of words into a list of synsets

Parameters:: wordSet – list/set of words
Returns:: list, list of words without duplications synsets: list, list of synsets correponding wordList
Return type:: wordList

src.dackar.similarity.simUtils.identifyBestSynset(word, jointWordList, jointSynsetList)[source]¶

Identify the best synset for given word with provided additional information (i.e., jointWordList)

Parameters:

word – str, a single word
jointWordList – list, a list of words without duplications
jointSynsetList – list, a list of synsets correponding to jointWordList

Returns:

wn.synset, identified synset for given word

Return type:

bestSyn

src.dackar.similarity.simUtils.convertSentsToSynsets(sentList, info=None)[source]¶

Use sentence itself to identify the best synset

Parameters:

sentList – list, list of sentences
info – list, additional list of words that will be used to determine the synset

Returns:

list, lis of synsets correponding to provided sentList

Return type:

sentSynsets