src.dackar.similarity.simUtils

Functions

sentenceSimilarity(sentenceA, sentenceB[, ...])

Compute sentence similarity using both semantic and word order similarity

wordOrderSimilaritySentences(sentenceA, sentenceB)

Compute sentence similarity using word order similarity

constructWordOrderVector(words, jointWords, index)

Construct word order vector

semanticSimilaritySentences(sentenceA, sentenceB, ...)

Compute sentence similarity using semantic similarity

constructSemanticVector(words, jointWords, infoContentNorm)

Construct semantic vector

brownInfo()

Compute word dict and word numbers in NLTK brown corpus

content(wordData[, wordCount, brownDict])

Employ statistics from Brown Corpus to compute the information content of given word in the corpus

identifyBestSimilarWordFromWordSet(wordA, wordSet)

Identify the best similar word in a word set for a given word

semanticSimilarityWords(wordA, wordB)

Compute the similarity between two words using semantic analysis

identifyBestSimilarSynsetPair(wordA, wordB)

Identify the best synset pair for given two words using wordnet similarity analysis

identifyNounAndVerbForComparison(sentence)

Taking out Noun and Verb for comparison word based

sentenceSenseDisambiguation(sentence[, method])

removing the disambiguity by getting the context

wordsSimilarity(wordA, wordB[, method])

General method for compute words similarity

wordSenseDisambiguation(word, sentence[, senseMethod, ...])

removing the disambiguity by getting the context

sentenceSenseDisambiguationPyWSD(sentence[, ...])

Wrap for sentence sense disambiguation method from pywsd

sentenceSimilarityWithDisambiguation(sentenceA, sentenceB)

Compute semantic similarity for given two sentences that disambiguation will be performed

convertSentsToSynsetsWithDisambiguation(sentList)

Use sentence itself to identify the best synset

convertToSynsets(wordSet)

Convert a list/set of words into a list of synsets

identifyBestSynset(word, jointWordList, jointSynsetList)

Identify the best synset for given word with provided additional information (i.e., jointWordList)

convertSentsToSynsets(sentList[, info])

Use sentence itself to identify the best synset

Module Contents

src.dackar.similarity.simUtils.sentenceSimilarity(sentenceA, sentenceB, infoContentNorm=False, delta=0.85)[source]

Compute sentence similarity using both semantic and word order similarity The semantic similarity is based on maximum word similarity between one word and another sentence

Parameters:
  • sentenceA – str, first sentence used to compute sentence similarity

  • sentenceB – str, second sentence used to compute sentence similarity

  • infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors

  • delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity

  • similarity (contribution from word order)

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.wordOrderSimilaritySentences(sentenceA, sentenceB)[source]

Compute sentence similarity using word order similarity

Parameters:
  • sentenceA – str, first sentence used to compute sentence similarity

  • sentenceB – str, second sentence used to compute sentence similarity

Returns:

float, [0, 1], the computed word order similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.constructWordOrderVector(words, jointWords, index)[source]

Construct word order vector

Parameters:
  • words – set of words, a set of words for one sentence

  • jointWords – set of joint words, a set of joint words for both sentences

  • index – dict, word index in the joint set of words

Returns:

numpy.array, the word order vector

Return type:

vector

src.dackar.similarity.simUtils.semanticSimilaritySentences(sentenceA, sentenceB, infoContentNorm)[source]

Compute sentence similarity using semantic similarity The semantic similarity is based on maximum word similarity between one word and another sentence

Parameters:
  • sentenceA – str, first sentence used to compute sentence similarity

  • sentenceB – str, second sentence used to compute sentence similarity

  • infoContentNorm – bool, True if statistics corpus is used to weight similarity vectors

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

semSimilarity

src.dackar.similarity.simUtils.constructSemanticVector(words, jointWords, infoContentNorm)[source]

Construct semantic vector

Parameters:
  • words – set of words, a set of words for one sentence

  • jointWords – set of joint words, a set of joint words for both sentences

  • infoContentNorm – bool, consider word statistics in Brown Corpus if True

Returns:

numpy.array, the semantic vector

Return type:

vector

src.dackar.similarity.simUtils.brownInfo()[source]

Compute word dict and word numbers in NLTK brown corpus

Parameters:

None

Returns:

int, the total number of words in brown brownDict: dict, the brown word dict, {word:count}

Return type:

wordCount

src.dackar.similarity.simUtils.content(wordData, wordCount=0, brownDict=None)[source]

Employ statistics from Brown Corpus to compute the information content of given word in the corpus ref: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1644735 information content I(w) = 1 - log(n+1)/log(N+1) The significance of a word is weighted using its information content. The assumption here is that words occur with a higher frequency (in corpus) contain less information than those that occur with lower frequencies.

Parameters:
  • wordData – string, a given word

  • wordCount – int, the total number of words in brown corpus

  • brownDict – dict, the brown word dict, {word:count}

Returns:

float, [0, 1], the information content of a word in the corpus

Return type:

content

src.dackar.similarity.simUtils.identifyBestSimilarWordFromWordSet(wordA, wordSet)[source]

Identify the best similar word in a word set for a given word

Parameters:
  • wordA – str, a given word that looking for the best similar word in a word set

  • wordSet – set/list, a pool of words

Returns:

str, the best similar word in the word set for given word similarity: float, [0, 1], similarity score between the best pair of words

Return type:

word

src.dackar.similarity.simUtils.semanticSimilarityWords(wordA, wordB)[source]

Compute the similarity between two words using semantic analysis First identify the best similar synset pair using wordnet similarity, then compute the similarity using both path length and depth information in wordnet

Parameters:
  • wordA – str, the first word

  • wordB – str, the second word

Returns:

float, [0, 1], the similarity score

Return type:

similarity

src.dackar.similarity.simUtils.identifyBestSimilarSynsetPair(wordA, wordB)[source]

Identify the best synset pair for given two words using wordnet similarity analysis

Parameters:
  • wordA – str, the first word

  • wordB – str, the second word

Returns:

tuple, (first synset, second synset), identified best synset pair using wordnet similarity

Return type:

bestPair

src.dackar.similarity.simUtils.identifyNounAndVerbForComparison(sentence)[source]

Taking out Noun and Verb for comparison word based

Parameters:

sentence – string, sentence string

Returns:

list, list of dict {token/word:pos_tag}

Return type:

pos

src.dackar.similarity.simUtils.sentenceSenseDisambiguation(sentence, method='simple_lesk')[source]

removing the disambiguity by getting the context

Parameters:
  • sentence – str, sentence string

  • method – str, the method for disambiguation, this method only support simple_lesk method

Returns:

set, set of wordnet.Synset for the estimated best sense

Return type:

sense

src.dackar.similarity.simUtils.wordsSimilarity(wordA, wordB, method='semantic_similarity_synsets')[source]

General method for compute words similarity

Parameters:
  • wordA – str, the first word

  • wordB – str, the second word

  • method – str, the method used to compute word similarity

Returns:

float, [0, 1], the similarity score

Return type:

similarity

src.dackar.similarity.simUtils.wordSenseDisambiguation(word, sentence, senseMethod='simple_lesk', simMethod='path')[source]

removing the disambiguity by getting the context

Parameters:
  • word – str/list/set, given word or set of words

  • sentence – str, sentence that will be used to disambiguate the given word

  • senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]

  • simMethod – str, method for similarity analysis when ‘max_similarity’ is used,

  • ['path' (one of)

  • 'wup'

  • 'lch'

  • 'res'

  • 'jcn'

  • 'lin']

Returns:

str/list/set, the type for given word, identified best sense for given word with disambiguation performed using given sentence

Return type:

sense

src.dackar.similarity.simUtils.sentenceSenseDisambiguationPyWSD(sentence, senseMethod='simple_lesk', simMethod='path')[source]

Wrap for sentence sense disambiguation method from pywsd https://github.com/alvations/pywsd

Parameters:
  • sentence – str, given sentence

  • senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]

  • simMethod – str, method for similarity analysis when ‘max_similarity’ is used,

  • ['path' (one of)

  • 'wup'

  • 'lch'

  • 'res'

  • 'jcn'

  • 'lin']

Returns:

list, list of words from sentence that has an identified synset from wordnet synsetList: list, list of corresponding synset for wordList

Return type:

wordList

src.dackar.similarity.simUtils.sentenceSimilarityWithDisambiguation(sentenceA, sentenceB, senseMethod='simple_lesk', simMethod='semantic_similarity_synsets', disambiguationSimMethod='path', delta=0.85)[source]

Compute semantic similarity for given two sentences that disambiguation will be performed

Parameters:
  • sentenceA – str, first sentence

  • sentenceB – str, second sentence

  • senseMethod – str, method for disambiguation, one of [‘simple_lesk’, ‘original_lesk’, ‘cosine_lesk’, ‘adapted_lesk’, ‘max_similarity’]

  • simMethod – str, method for similarity analysis in the construction of semantic vectors

  • ['semantic_similarity_synsets' (one of)

  • 'path'

  • 'wup'

  • 'lch'

  • 'res'

  • 'jcn'

  • 'lin']

  • disambiguationSimMethod – str, method for similarity analysis when ‘max_similarity’ is used,

  • ['path' (one of)

  • 'wup'

  • 'lch'

  • 'res'

  • 'jcn'

  • 'lin']

  • delta – float, [0,1], similarity contribution from semantic similarity, 1-delta is the similarity

  • similarity (contribution from word order)

Returns:

float, [0, 1], the computed similarity for given two sentences

Return type:

similarity

src.dackar.similarity.simUtils.convertSentsToSynsetsWithDisambiguation(sentList)[source]

Use sentence itself to identify the best synset

Parameters:

sentList – list of sentences

Returns:

list of synsets for corresponding sentences

Return type:

sentSynsets

src.dackar.similarity.simUtils.convertToSynsets(wordSet)[source]

Convert a list/set of words into a list of synsets

Parameters:

wordSet – list/set of words

Returns:

list, list of words without duplications synsets: list, list of synsets correponding wordList

Return type:

wordList

src.dackar.similarity.simUtils.identifyBestSynset(word, jointWordList, jointSynsetList)[source]

Identify the best synset for given word with provided additional information (i.e., jointWordList)

Parameters:
  • word – str, a single word

  • jointWordList – list, a list of words without duplications

  • jointSynsetList – list, a list of synsets correponding to jointWordList

Returns:

wn.synset, identified synset for given word

Return type:

bestSyn

src.dackar.similarity.simUtils.convertSentsToSynsets(sentList, info=None)[source]

Use sentence itself to identify the best synset

Parameters:
  • sentList – list, list of sentences

  • info – list, additional list of words that will be used to determine the synset

Returns:

list, lis of synsets correponding to provided sentList

Return type:

sentSynsets