Knowledge Graph Construction Through Neo4j

Install Additional Required Libraries

In order to communicate with neo4j Desktop graph database, the neo4j_python_driver need to be installed:

pip install neo4j

Connect to Neo4j via DACKAR Py2Neo Module

  • Set up Neo4j import folder path This directory is used to store data files intended for bulk import operations. The location of this directory can be configured in the Neo4j configuration file (neo4j.conf). By default, the import directory is located at: $NEO4J_HOME/import. Here, $NEO4J_HOME represents the root directory of your Neo4j installation. To change the import directory, you can modify the dbms.directories.import setting in the neo4j.conf file. For example:

    dbms.directories.import=/path/to/your/import/directory
    

    Make sure to replace /path/to/your/import/directory with the actual path you want to use for your import files. After making changes to the neo4j.conf file, you will need to restart the Neo4j server for the changes to take effect.

  • Alternative way to set up Neo4j import folder path Open the settings and replace dbms.directories.import with the actual path you want to use for your import files.

    Set up Neo4j import folder path
  • Start Neo4j desktop DBMS, set password if needed. This password will be used in the Python driver to connect to the DBMS database.

    Neo4j Desktop Graph Database Management System
  • Load DACKAR Py2Neo module to load csv files into Neo4j DBMS.

    from dackar.knowledge_graph.py2neo import Py2Neo
    
    # Create python to neo4j driver
    uri = "neo4j://localhost:7687"  # for a cluster
    pwd = "123456789"  # user need to provide the DBMS database password
    py2neo = Py2Neo(uri=uri, user='neo4j', pwd=pwd)
    
    # clean up the DBMS
    py2neo.reset()
    
    # Load node data
    file_path = './test_nodes.csv'
    
    # node label
    label = 'MBSE'
    
    # node attributes: keys, values are corresponding to neo4j node attributes and csv column names, respectively
    attribute = {'nodeId': 'nodeId', 'label': 'label', 'ID': 'ID', 'type': 'type'}
    
    # API to load csv files to create nodes in neo4j
    py2neo.load_csv_for_nodes(file_path, label, attribute)
    
    # Load relationship data
    file_path = 'test_edges.csv'
    l1 = 'MBSE'  # label for source node
    
    # node attributes: keys, values are corresponding to neo4j node attributes and csv column names, respectively
    p1 = {'nodeId': 'sourceNodeId'}
    
    # label for second node
    l2 = 'MBSE'
    
    # node attributes: keys, values are corresponding to neo4j node attributes and csv column names, respectively
    p2 = {'nodeId': 'targetNodeId'}
    
    # label for neo4j edge
    lr = 'MBSE_link'
    
    # edge attributes: keys, values are corresponding to neo4j node attributes and csv column names, respectively
    pr = {'prop': 'type'}
    
    # API to load csv files to create relations in neo4j
    py2neo.load_csv_for_relations(file_path, l1, p1, l2, p2, lr, pr)
    

Knowledge Graph Construction Through Schemas

  • Knowledge Graph Schemas The objective of a schema is to provide precise guidelines on the construction of knowledge graphs. A schema in a knowledge graph defines the structure (i.e., what types of nodes and edges are allowed), semantics, and constraints of the data (i.e., data type formats) represented in the graph. It formalizes the meaning of entities (nodes) and relationships (edges), ensuring that data is interpreted consistently across systems and users. In addition, it provides a common framework to unify disparate data sources under a shared vocabulary, which is especially valuable in enterprise environments where data silos are common. By encoding relationships and constraints, schemas enable advanced reasoning (such as deducing new facts from existing ones) using logic-based AI/ML techniques. In AI contexts, schemas help large language models and other algorithms understand the structure and semantics of the data, enabling more accurate parsing, querying, and text generation. In DACKAR a schema is defined through a .toml file where the allowed set of nodes and relations are defined. For each of these two entities, a textual description is provided along with the specifics of the allowed properties (i.e., name, data-type, required/optional setting, property description). An example, the following schema defines two nodes (i.e., cause and effect nodes) and a relation between them

    Example of knowledge graph schema
     1title = "sampleSchema"
     2version = 1.0
     3
     4# ====================
     5# Nodes 
     6# ====================
     7[node.cause]
     8node_description = """causal node"""
     9node_properties = [{name = "prop1", type = "string" , optional = false, description = 'first causal property'},
    10                   {name = "prop2", type = "integer", optional = false, description = 'second causal property'}]
    11
    12[node.consequence]
    13node_description = """consequence node"""
    14node_properties = [{name = "ID_A", type = "boolean" , optional = false, description = 'first consequence property'},
    15                   {name = "ID_B", type = "floating", optional = false, description = 'second consequence property'}]
    16
    17# ====================
    18# Relations
    19# ====================
    20[relation.causality]
    21relation_description = """Indicates cause-effect relation"""
    22from_entity = "cause"
    23to_entity = "consequence"
    24relation_properties = [{name = "intensity", type = "floating", optional = false, description = 'Strength of the cause-effect relation'}]
    

    In this example, each node is characterized by two properties (prop1 and prop2 for the cause node, ID_A and ID_B for the consequence node). The defined relation represents a cause-effect relation that links a “cause” node to a “consequence” node where one single property is defined (i.e., “intensity”). In DACKAR the user can define multiple schemas (each described in its own .toml file) depending on the specific application. The following checks for each provided schema are performed: - Validation against main schema - Data type check for each property - Check for redundancy against other nodes/relations defined in other schemas

    In this context, knowledge graphs are constructed from pandas dataframes. In addition, the user is required to provide a construction schema that defines the nodes and relations to be populated in the knowledge graph. A construction schema is defined in the form of a python dictionary. The following example indicates that two nodes are created: - Node with label nodeLabel1 is characterized by two properties (i.e., property1 and property2) whose values are indicated in the columns colA and colB of the provided dataframe - Node with label nodeLabel2 is characterized by one property (i.e., property3) whose values are indicated in column colC of the provided dataframe and one relation is also created: - Relation of type relationType links a node with label nodeLabel1 (with property1 values contained in col1 of the provided dataframe) to a node with label nodeLabel2 (with property3 values contained in col2 of the provided dataframe)

    constructionSchema = {
        'nodes': {
            'nodeLabel1': {
                'property1': 'dataframe.colA',
                'property2': 'dataframe.colB'
            },
            'nodeLabel2': {
                'property3': 'dataframe.colC'
            }
        },
        'relations': {
            'causality': {
                'source': {'cause.prop1': 'A'},
                'target': {'consequence.ID_B': 'D'},
                'properties': {'intensity': 'prop_rel'}
            }
        }
    }