Semantic Similarity Analysis¶
Leveraging WordNet for semantic similarity calculations with word disambiguation. The semantic similarity is based on maximum word similarity between one word and another sentence. In addition, word order similarity can be also considered in the semantic similarity calculations.

Setup the path and load DACKAR modules for similarity analysis¶
[1]:
import os
import sys
cwd = os.getcwd()
frameworkDir = os.path.abspath(os.path.join(cwd, os.pardir, 'src'))
sys.path.append(frameworkDir)
import time
from dackar.similarity import synsetUtils as SU
from dackar.similarity import simUtils
Example¶
[2]:
sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']
Compute sentences similarity without disambiguation¶
delta \in [0, 1] is used to control the similarity contribution from semantic and word order similarity
[3]:
similarity = simUtils.sentenceSimilarity(sents[0], sents[1], delta=.8)
print('Similarity Score: ',similarity)
Similarity Score: 0.597345002737055
Compute sentences similarity with disambiguation¶
[4]:
similarity = simUtils.sentenceSimilarityWithDisambiguation(sents[0], sents[1], delta=.8)
print('Similarity Score: ',similarity)
Warming up PyWSD (takes ~10 secs)...
Similarity Score: 0.05641469403833227
took 0.9650449752807617 secs.
Convert sentences into synsets list, and then compute similarity¶
[5]:
sentSynsets = simUtils.convertSentsToSynsets(sents)
similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)
print('Similarity Score: ',similarity)
Similarity Score: 0.43946127500409304
#### Using disambiguation method to create synsets
[6]:
sentSynsets = simUtils.convertSentsToSynsetsWithDisambiguation(sents)
similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)
print('Similarity Score: ',similarity)
Similarity Score: 0.31713942870949496
Timing for performance¶
[7]:
st = time.time()
for i in range(100):
sentSynsets = simUtils.convertSentsToSynsets(sents)
print('%s second'% (time.time()-st))
4.767474889755249 second
[8]:
st = time.time()
for i in range(1000):
similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)
print('%s second'% (time.time()-st))
1.6129579544067383 second
[9]:
st = time.time()
sentSynsets = []
for i in range(1000):
for j in range(len(sents)):
_, synsetsA = simUtils.sentenceSenseDisambiguationPyWSD(sents[j], senseMethod='simple_lesk', simMethod='path')
sentSynsets.append(synsetsA)
print('%s second'% (time.time()-st))
0.8050062656402588 second