{ "cells": [ { "cell_type": "markdown", "id": "af2a9522-ded9-4b7b-9c75-1935b12578fe", "metadata": {}, "source": [ "## Semantic Similarity Analysis \n", "\n", "Leveraging WordNet for semantic similarity calculations with word disambiguation. The semantic similarity is based on maximum word similarity between one word and another sentence. In addition, word order similarity can be also considered in the semantic similarity calculations.\n", "\n", "![similarity](./images/similarity.png)" ] }, { "cell_type": "markdown", "id": "ed0ab04f", "metadata": {}, "source": [ "#### Setup the path and load DACKAR modules for similarity analysis" ] }, { "cell_type": "code", "execution_count": 1, "id": "b460f1e1-f63d-4199-aa75-513c97e40e1d", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "cwd = os.getcwd()\n", "frameworkDir = os.path.abspath(os.path.join(cwd, os.pardir, 'src'))\n", "sys.path.append(frameworkDir)\n", "\n", "import time\n", "from dackar.similarity import synsetUtils as SU\n", "from dackar.similarity import simUtils" ] }, { "cell_type": "markdown", "id": "db7abfdd", "metadata": {}, "source": [ "#### Example" ] }, { "cell_type": "code", "execution_count": 2, "id": "a50d7391", "metadata": {}, "outputs": [], "source": [ "sents = ['The workers at the industrial plant were overworked',\n", " 'The plant was no longer bearing flowers']" ] }, { "cell_type": "markdown", "id": "8835ccfb", "metadata": {}, "source": [ "#### Compute sentences similarity without disambiguation\n", "\n", "- delta $\\in$ [0, 1] is used to control the similarity contribution from semantic and word order similarity" ] }, { "cell_type": "code", "execution_count": 3, "id": "26873c79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.597345002737055\n" ] } ], "source": [ "similarity = simUtils.sentenceSimilarity(sents[0], sents[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "6f494fe6", "metadata": {}, "source": [ "#### Compute sentences similarity with disambiguation" ] }, { "cell_type": "code", "execution_count": 4, "id": "536e7062", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warming up PyWSD (takes ~10 secs)... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.05641469403833227\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "took 0.9650449752807617 secs.\n" ] } ], "source": [ "similarity = simUtils.sentenceSimilarityWithDisambiguation(sents[0], sents[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "5f7b2b20-b751-4d06-8eb6-8b43aacde955", "metadata": {}, "source": [ "#### Convert sentences into synsets list, and then compute similarity" ] }, { "cell_type": "code", "execution_count": 5, "id": "1a168728-3b3b-44d2-88ac-a36618191b90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.43946127500409304\n" ] } ], "source": [ "sentSynsets = simUtils.convertSentsToSynsets(sents)\n", "similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "f8a81268-29ea-4ef4-9a36-53088374d54d", "metadata": {}, "source": [ " #### Using disambiguation method to create synsets " ] }, { "cell_type": "code", "execution_count": 6, "id": "b349155c-6e69-42c5-bc84-a128ea7d8cb8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.31713942870949496\n" ] } ], "source": [ "sentSynsets = simUtils.convertSentsToSynsetsWithDisambiguation(sents)\n", "similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "7d868f88-fed9-400d-a2d0-9cf77a811c35", "metadata": {}, "source": [ "### Timing for performance\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "1c8b368e-e76b-4f4b-94c7-b1c72a490492", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.767474889755249 second\n" ] } ], "source": [ "st = time.time()\n", "for i in range(100):\n", " sentSynsets = simUtils.convertSentsToSynsets(sents)\n", "print('%s second'% (time.time()-st))" ] }, { "cell_type": "code", "execution_count": 8, "id": "71a8acbf-d911-48a6-a971-5e9185cad99e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.6129579544067383 second\n" ] } ], "source": [ "st = time.time()\n", "for i in range(1000):\n", " similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)\n", "print('%s second'% (time.time()-st))" ] }, { "cell_type": "code", "execution_count": 9, "id": "523fef2d-9490-4595-a9d9-44d7fd9b0a4a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8050062656402588 second\n" ] } ], "source": [ "st = time.time()\n", "sentSynsets = []\n", "for i in range(1000):\n", " for j in range(len(sents)):\n", " _, synsetsA = simUtils.sentenceSenseDisambiguationPyWSD(sents[j], senseMethod='simple_lesk', simMethod='path')\n", " sentSynsets.append(synsetsA)\n", "print('%s second'% (time.time()-st))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }