This notebooks demonstrates how to use Runway’s EDA module with Neo4j’s example dataset containing information on Stackoverflow.

import os

from neo4j_runway.database.neo4j import Neo4jGraph
from neo4j_runway.graph_eda import GraphEDA

Create a Neo4j Instance

g = Neo4jGraph(uri=os.environ.get("NEO4J_URI"), username=os.environ.get("NEO4J_USERNAME"), password=os.environ.get("NEO4J_PASSWORD"), database="stackoverflow")

GraphEDA

eda = GraphEDA(g)

We can run analytical queries individually via the GraphEDA class. For example let’s retrieve information on the data constraints.

eda.database_constraints()
id name type entityType labelsOrTypes properties ownedIndex propertyType
0 20 constraint_32ea8862 UNIQUENESS NODE [Comment] [uuid] constraint_32ea8862 None
1 18 constraint_401df8db UNIQUENESS NODE [Question] [uuid] constraint_401df8db None
2 22 constraint_64b1b1cf UNIQUENESS NODE [Tag] [name] constraint_64b1b1cf None
3 19 constraint_7e29bbac UNIQUENESS NODE [Answer] [uuid] constraint_7e29bbac None
4 21 constraint_b13a3b7d UNIQUENESS NODE [User] [uuid] constraint_b13a3b7d None

When we run a quering method, the results are appended to an internal cache. By default we return the stored content, but we can choose to refresh the cache by providing refresh=True.

Collecting Insights

We can run all the analytical queries in the GraphEDA class by calling the run method.

This can be computationally intensive!

WARNING: The methods in this module can be computationally expensive. It is not recommended to use this module on massive Neo4j databases (i.e., nodes and relationships in the hundreds of millions)

%%capture
eda.run()

Now that we have our cache filled, let’s see if there are any isolated nodes in the database.

eda.disconnected_node_count()
0

No disconnected nodes is a good sign!

Reports

We can generate a report containing all the information we’ve gathered from our queries by calling create_eda_report.

Some of the sections can become quite lengthy, so there are arguments to control the data that is returned.

%%capture
eda.create_eda_report(include_disconnected_node_ids=True, include_unlabeled_node_ids=True, include_node_degrees=True, view_report=False)
eda.view_report(notebook=True)

Runway EDA Report

Database Information

  databaseName databaseVersion databaseEdition APOCVersion GDSVersion
0 stackoverflow 5.15.0 enterprise 5.15.1 not installed

Counts

  nodeCount unlabeledNodeCount disconnectedNodeCount relationshipCount
0 6193 0 0 11540

Indexes

  id name state populationPercent type entityType labelsOrTypes properties indexProvider owningConstraint lastRead readCount
0 17 constraint_32ea8862 ONLINE 100 RANGE NODE [‘Comment’] [‘uuid’] range-1.0 constraint_32ea8862    
1 13 constraint_401df8db ONLINE 100 RANGE NODE [‘Question’] [‘uuid’] range-1.0 constraint_401df8db    
2 14 constraint_64b1b1cf ONLINE 100 RANGE NODE [‘Tag’] [‘name’] range-1.0 constraint_64b1b1cf    
3 16 constraint_7e29bbac ONLINE 100 RANGE NODE [‘Answer’] [‘uuid’] range-1.0 constraint_7e29bbac    
4 15 constraint_b13a3b7d ONLINE 100 RANGE NODE [‘User’] [‘uuid’] range-1.0 constraint_b13a3b7d    
5 1 index_343aff4e ONLINE 100 LOOKUP NODE     token-lookup-1.0      
6 2 index_f7700477 ONLINE 100 LOOKUP RELATIONSHIP     token-lookup-1.0      

Constraints

  id name type entityType labelsOrTypes properties ownedIndex propertyType
0 20 constraint_32ea8862 UNIQUENESS NODE [‘Comment’] [‘uuid’] constraint_32ea8862  
1 18 constraint_401df8db UNIQUENESS NODE [‘Question’] [‘uuid’] constraint_401df8db  
2 22 constraint_64b1b1cf UNIQUENESS NODE [‘Tag’] [‘name’] constraint_64b1b1cf  
3 19 constraint_7e29bbac UNIQUENESS NODE [‘Answer’] [‘uuid’] constraint_7e29bbac  
4 21 constraint_b13a3b7d UNIQUENESS NODE [‘User’] [‘uuid’] constraint_b13a3b7d  

Nodes Overview

Label Counts

  label count
0 Question 1589
1 Comment 1396
2 Answer 1367
3 User 1365
4 Tag 476

Properties

  nodeLabels propertyName propertyTypes mandatory
0 [‘User’] uuid [‘Long’, ‘String’] True
1 [‘User’] display_name [‘String’] True
2 [‘Tag’] name [‘String’] True
3 [‘Tag’] link [‘String’] True
4 [‘Answer’] uuid [‘Long’] True
5 [‘Answer’] title [‘String’] True
6 [‘Answer’] link [‘String’] True
7 [‘Answer’] is_accepted [‘Boolean’] True
8 [‘Answer’] body_markdown [‘String’] True
9 [‘Answer’] score [‘Long’] True
10 [‘Comment’] uuid [‘Long’] True
11 [‘Comment’] link [‘String’] True
12 [‘Comment’] score [‘Long’] True
13 [‘Question’] uuid [‘Long’] True
14 [‘Question’] title [‘String’] True
15 [‘Question’] creation_date [‘Long’] True
16 [‘Question’] accepted_answer_id [‘Long’] False
17 [‘Question’] link [‘String’] True
18 [‘Question’] view_count [‘Long’] True
19 [‘Question’] answer_count [‘Long’] True
20 [‘Question’] body_markdown [‘String’] True

Relationships Overview

Type Counts

  relType count
0 TAGGED 4425
1 ASKED 1589
2 COMMENTED_ON 1396
3 COMMENTED 1396
4 ANSWERED 1367
5 PROVIDED 1367

Properties

no relationship properties

Unlabeled Nodes

no unlabeled nodes data in cache

Disconnected Nodes

no disconnected nodes data in cache

Node Degrees

  • Top 5 Ordered By outDegree
  nodeId nodeLabel inDegree outDegree
0 5620 [‘User’] 0 318
1 2441 [‘User’] 0 193
2 2452 [‘User’] 0 178
3 2485 [‘User’] 0 144
4 2445 [‘User’] 0 138

Runway v0.12.0

Report Generated @ 2024-10-25 10:53:46.134954

We can also save the report to a Markdown file.

eda.save_report(file_name="outputs/stackoverflow_runway_report.md")