Stackoverflow Exploratory Data Analysis
This notebooks demonstrates how to use Runway’s EDA module with Neo4j’s example dataset containing information on Stackoverflow.
import os
from neo4j_runway.database.neo4j import Neo4jGraph
from neo4j_runway.graph_eda import GraphEDA
Create a Neo4j Instance
g = Neo4jGraph(uri=os.environ.get("NEO4J_URI"), username=os.environ.get("NEO4J_USERNAME"), password=os.environ.get("NEO4J_PASSWORD"), database="stackoverflow")
GraphEDA
eda = GraphEDA(g)
We can run analytical queries individually via the GraphEDA class. For example let’s retrieve information on the data constraints.
eda.database_constraints()
| id | name | type | entityType | labelsOrTypes | properties | ownedIndex | propertyType | |
|---|---|---|---|---|---|---|---|---|
| 0 | 20 | constraint_32ea8862 | UNIQUENESS | NODE | [Comment] | [uuid] | constraint_32ea8862 | None |
| 1 | 18 | constraint_401df8db | UNIQUENESS | NODE | [Question] | [uuid] | constraint_401df8db | None |
| 2 | 22 | constraint_64b1b1cf | UNIQUENESS | NODE | [Tag] | [name] | constraint_64b1b1cf | None |
| 3 | 19 | constraint_7e29bbac | UNIQUENESS | NODE | [Answer] | [uuid] | constraint_7e29bbac | None |
| 4 | 21 | constraint_b13a3b7d | UNIQUENESS | NODE | [User] | [uuid] | constraint_b13a3b7d | None |
When we run a quering method, the results are appended to an internal cache. By default we return the stored content, but we can choose to refresh the cache by providing refresh=True.
Collecting Insights
We can run all the analytical queries in the GraphEDA class by calling the run method.
This can be computationally intensive!
WARNING: The methods in this module can be computationally expensive. It is not recommended to use this module on massive Neo4j databases (i.e., nodes and relationships in the hundreds of millions)
%%capture
eda.run()
Now that we have our cache filled, let’s see if there are any isolated nodes in the database.
eda.disconnected_node_count()
0
No disconnected nodes is a good sign!
Reports
We can generate a report containing all the information we’ve gathered from our queries by calling create_eda_report.
Some of the sections can become quite lengthy, so there are arguments to control the data that is returned.
%%capture
eda.create_eda_report(include_disconnected_node_ids=True, include_unlabeled_node_ids=True, include_node_degrees=True, view_report=False)
eda.view_report(notebook=True)
Runway EDA Report
Database Information
| databaseName | databaseVersion | databaseEdition | APOCVersion | GDSVersion | |
|---|---|---|---|---|---|
| 0 | stackoverflow | 5.15.0 | enterprise | 5.15.1 | not installed |
Counts
| nodeCount | unlabeledNodeCount | disconnectedNodeCount | relationshipCount | |
|---|---|---|---|---|
| 0 | 6193 | 0 | 0 | 11540 |
Indexes
| id | name | state | populationPercent | type | entityType | labelsOrTypes | properties | indexProvider | owningConstraint | lastRead | readCount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17 | constraint_32ea8862 | ONLINE | 100 | RANGE | NODE | [‘Comment’] | [‘uuid’] | range-1.0 | constraint_32ea8862 | ||
| 1 | 13 | constraint_401df8db | ONLINE | 100 | RANGE | NODE | [‘Question’] | [‘uuid’] | range-1.0 | constraint_401df8db | ||
| 2 | 14 | constraint_64b1b1cf | ONLINE | 100 | RANGE | NODE | [‘Tag’] | [‘name’] | range-1.0 | constraint_64b1b1cf | ||
| 3 | 16 | constraint_7e29bbac | ONLINE | 100 | RANGE | NODE | [‘Answer’] | [‘uuid’] | range-1.0 | constraint_7e29bbac | ||
| 4 | 15 | constraint_b13a3b7d | ONLINE | 100 | RANGE | NODE | [‘User’] | [‘uuid’] | range-1.0 | constraint_b13a3b7d | ||
| 5 | 1 | index_343aff4e | ONLINE | 100 | LOOKUP | NODE | token-lookup-1.0 | |||||
| 6 | 2 | index_f7700477 | ONLINE | 100 | LOOKUP | RELATIONSHIP | token-lookup-1.0 |
Constraints
| id | name | type | entityType | labelsOrTypes | properties | ownedIndex | propertyType | |
|---|---|---|---|---|---|---|---|---|
| 0 | 20 | constraint_32ea8862 | UNIQUENESS | NODE | [‘Comment’] | [‘uuid’] | constraint_32ea8862 | |
| 1 | 18 | constraint_401df8db | UNIQUENESS | NODE | [‘Question’] | [‘uuid’] | constraint_401df8db | |
| 2 | 22 | constraint_64b1b1cf | UNIQUENESS | NODE | [‘Tag’] | [‘name’] | constraint_64b1b1cf | |
| 3 | 19 | constraint_7e29bbac | UNIQUENESS | NODE | [‘Answer’] | [‘uuid’] | constraint_7e29bbac | |
| 4 | 21 | constraint_b13a3b7d | UNIQUENESS | NODE | [‘User’] | [‘uuid’] | constraint_b13a3b7d |
Nodes Overview
Label Counts
| label | count | |
|---|---|---|
| 0 | Question | 1589 |
| 1 | Comment | 1396 |
| 2 | Answer | 1367 |
| 3 | User | 1365 |
| 4 | Tag | 476 |
Properties
| nodeLabels | propertyName | propertyTypes | mandatory | |
|---|---|---|---|---|
| 0 | [‘User’] | uuid | [‘Long’, ‘String’] | True |
| 1 | [‘User’] | display_name | [‘String’] | True |
| 2 | [‘Tag’] | name | [‘String’] | True |
| 3 | [‘Tag’] | link | [‘String’] | True |
| 4 | [‘Answer’] | uuid | [‘Long’] | True |
| 5 | [‘Answer’] | title | [‘String’] | True |
| 6 | [‘Answer’] | link | [‘String’] | True |
| 7 | [‘Answer’] | is_accepted | [‘Boolean’] | True |
| 8 | [‘Answer’] | body_markdown | [‘String’] | True |
| 9 | [‘Answer’] | score | [‘Long’] | True |
| 10 | [‘Comment’] | uuid | [‘Long’] | True |
| 11 | [‘Comment’] | link | [‘String’] | True |
| 12 | [‘Comment’] | score | [‘Long’] | True |
| 13 | [‘Question’] | uuid | [‘Long’] | True |
| 14 | [‘Question’] | title | [‘String’] | True |
| 15 | [‘Question’] | creation_date | [‘Long’] | True |
| 16 | [‘Question’] | accepted_answer_id | [‘Long’] | False |
| 17 | [‘Question’] | link | [‘String’] | True |
| 18 | [‘Question’] | view_count | [‘Long’] | True |
| 19 | [‘Question’] | answer_count | [‘Long’] | True |
| 20 | [‘Question’] | body_markdown | [‘String’] | True |
Relationships Overview
Type Counts
| relType | count | |
|---|---|---|
| 0 | TAGGED | 4425 |
| 1 | ASKED | 1589 |
| 2 | COMMENTED_ON | 1396 |
| 3 | COMMENTED | 1396 |
| 4 | ANSWERED | 1367 |
| 5 | PROVIDED | 1367 |
Properties
no relationship properties
Unlabeled Nodes
no unlabeled nodes data in cache
Disconnected Nodes
no disconnected nodes data in cache
Node Degrees
- Top 5 Ordered By outDegree
| nodeId | nodeLabel | inDegree | outDegree | |
|---|---|---|---|---|
| 0 | 5620 | [‘User’] | 0 | 318 |
| 1 | 2441 | [‘User’] | 0 | 193 |
| 2 | 2452 | [‘User’] | 0 | 178 |
| 3 | 2485 | [‘User’] | 0 | 144 |
| 4 | 2445 | [‘User’] | 0 | 138 |
Runway v0.12.0
Report Generated @ 2024-10-25 10:53:46.134954
We can also save the report to a Markdown file.
eda.save_report(file_name="outputs/stackoverflow_runway_report.md")