{ "cells": [ { "cell_type": "markdown", "id": "af2a9522-ded9-4b7b-9c75-1935b12578fe", "metadata": {}, "source": [ "## Semantic Similarity Analysis \n", "\n", "Leveraging WordNet for semantic similarity calculations with word disambiguation. The semantic similarity is based on maximum word similarity between one word and another sentence. In addition, word order similarity can be also considered in the semantic similarity calculations.\n", "\n", "![similarity](./images/similarity.png)" ] }, { "cell_type": "markdown", "id": "ed0ab04f", "metadata": {}, "source": [ "#### Setup the path and load DACKAR modules for similarity analysis" ] }, { "cell_type": "code", "execution_count": 1, "id": "b460f1e1-f63d-4199-aa75-513c97e40e1d", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "cwd = os.getcwd()\n", "frameworkDir = os.path.abspath(os.path.join(cwd, os.pardir, 'src'))\n", "sys.path.append(frameworkDir)\n", "\n", "import time\n", "from dackar.similarity import synsetUtils as SU\n", "from dackar.similarity import simUtils" ] }, { "cell_type": "markdown", "id": "db7abfdd", "metadata": {}, "source": [ "#### Example" ] }, { "cell_type": "code", "execution_count": 2, "id": "a50d7391", "metadata": {}, "outputs": [], "source": [ "# sents = ['The workers at the industrial plant were overworked',\n", "# 'The plant was no longer bearing flowers']\n", "sents = ['RAM keeps things being worked with', 'The CPU uses RAM as a short-term memory store']\n", "# sents = ['The pump is not working due to leakage', 'There is a leakage in the pump']" ] }, { "cell_type": "markdown", "id": "8835ccfb", "metadata": {}, "source": [ "#### Compute sentences similarity without disambiguation\n", "\n", "- delta $\\in$ [0, 1] is used to control the similarity contribution from semantic and word order similarity" ] }, { "cell_type": "code", "execution_count": 3, "id": "26873c79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.5366827832678444\n" ] } ], "source": [ "similarity = simUtils.sentenceSimilarity(sents[0], sents[1], infoContentNorm=True, delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "code", "execution_count": 4, "id": "60754485", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.45519466650531865\n" ] } ], "source": [ "similarity = simUtils.sentenceSimilarity(sents[0], sents[1], infoContentNorm=False, delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "6f494fe6", "metadata": {}, "source": [ "#### Compute sentences similarity with disambiguation" ] }, { "cell_type": "code", "execution_count": 5, "id": "536e7062", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warming up PyWSD (takes ~10 secs)... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.4897634312885952\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "took 1.7747421264648438 secs.\n" ] } ], "source": [ "similarity = simUtils.sentenceSimilarityWithDisambiguation(sents[0], sents[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "5d8aa47c", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "5f7b2b20-b751-4d06-8eb6-8b43aacde955", "metadata": {}, "source": [ "#### Convert sentences into synsets list, and then compute similarity" ] }, { "cell_type": "code", "execution_count": 6, "id": "1a168728-3b3b-44d2-88ac-a36618191b90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.2612493301234213\n" ] } ], "source": [ "sentSynsets = simUtils.convertSentsToSynsets(sents)\n", "similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "f8a81268-29ea-4ef4-9a36-53088374d54d", "metadata": {}, "source": [ " #### Using disambiguation method to create synsets " ] }, { "cell_type": "code", "execution_count": 7, "id": "b349155c-6e69-42c5-bc84-a128ea7d8cb8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity Score: 0.4992052446929406\n" ] } ], "source": [ "sentSynsets = simUtils.convertSentsToSynsetsWithDisambiguation(sents)\n", "\n", "similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=0.8)\n", "print('Similarity Score: ',similarity)" ] }, { "cell_type": "markdown", "id": "7d868f88-fed9-400d-a2d0-9cf77a811c35", "metadata": {}, "source": [ "### Timing for performance\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "1c8b368e-e76b-4f4b-94c7-b1c72a490492", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14.189666271209717 second\n" ] } ], "source": [ "st = time.time()\n", "for i in range(100):\n", " sentSynsets = simUtils.convertSentsToSynsets(sents)\n", "print('%s second'% (time.time()-st))" ] }, { "cell_type": "code", "execution_count": 9, "id": "71a8acbf-d911-48a6-a971-5e9185cad99e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.876173257827759 second\n" ] } ], "source": [ "st = time.time()\n", "for i in range(1000):\n", " similarity = SU.synsetListSimilarity(sentSynsets[0], sentSynsets[1], delta=.8)\n", "print('%s second'% (time.time()-st))" ] }, { "cell_type": "code", "execution_count": 10, "id": "523fef2d-9490-4595-a9d9-44d7fd9b0a4a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.1626660823822021 second\n" ] } ], "source": [ "st = time.time()\n", "sentSynsets = []\n", "for i in range(1000):\n", " for j in range(len(sents)):\n", " _, synsetsA = simUtils.sentenceSenseDisambiguationPyWSD(sents[j], senseMethod='simple_lesk', simMethod='path')\n", " sentSynsets.append(synsetsA)\n", "print('%s second'% (time.time()-st))" ] }, { "cell_type": "code", "execution_count": null, "id": "c4858fa4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }