End To End Demo
This notebooks demonstrates the data flow of generating a graph from a CSV file using Runway v0.9.0
import os
import pandas as pd
from dotenv import load_dotenv
from neo4j_runway import Discovery, GraphDataModeler, PyIngest, UserInput
from neo4j_runway.code_generation import PyIngestConfigGenerator
from neo4j_runway.llm.openai import OpenAIDataModelingLLM, OpenAIDiscoveryLLM
from neo4j_runway.utils import test_database_connection
load_dotenv()
True
Load and Describe Data
The USER_GENERATED_INPUT variable contains a general discription and feature descriptions for each feature we’d like to use in our graph.
USER_GENERATED_INPUT = UserInput(
general_description='This is data on different countries.',
column_descriptions={
'id': 'unique id for a country.',
'name': 'the country name.',
'phone_code': 'country area code.',
'capital': 'the capital of the country.',
'currency_name': "name of the country's currency.",
'region': 'primary region of the country.',
'subregion': 'subregion location of the country.',
'timezones': 'timezones contained within the country borders.',
'latitude': 'the latitude coordinate of the country center.',
'longitude': 'the longitude coordinate of the country center.'
},
use_cases=[
"Which region contains the most subregions?",
"What currencies are most popular?",
"Which countries share timezones?"
]
)
data = pd.read_csv("data/csv/countries.csv")
data.head()
id | name | iso3 | iso2 | numeric_code | phone_code | capital | currency | currency_name | currency_symbol | tld | native | region | subregion | timezones | latitude | longitude | emoji | emojiU | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Afghanistan | AFG | AF | 4 | 93 | Kabul | AFN | Afghan afghani | ؋ | .af | افغانستان | Asia | Southern Asia | [{zoneName:'Asia\/Kabul',gmtOffset:16200,gmtOf... | 33.000000 | 65.0 | 🇦🇫 | U+1F1E6 U+1F1EB |
1 | 2 | Aland Islands | ALA | AX | 248 | +358-18 | Mariehamn | EUR | Euro | € | .ax | Åland | Europe | Northern Europe | [{zoneName:'Europe\/Mariehamn',gmtOffset:7200,... | 60.116667 | 19.9 | 🇦🇽 | U+1F1E6 U+1F1FD |
2 | 3 | Albania | ALB | AL | 8 | 355 | Tirana | ALL | Albanian lek | Lek | .al | Shqipëria | Europe | Southern Europe | [{zoneName:'Europe\/Tirane',gmtOffset:3600,gmt... | 41.000000 | 20.0 | 🇦🇱 | U+1F1E6 U+1F1F1 |
3 | 4 | Algeria | DZA | DZ | 12 | 213 | Algiers | DZD | Algerian dinar | دج | .dz | الجزائر | Africa | Northern Africa | [{zoneName:'Africa\/Algiers',gmtOffset:3600,gm... | 28.000000 | 3.0 | 🇩🇿 | U+1F1E9 U+1F1FF |
4 | 5 | American Samoa | ASM | AS | 16 | +1-684 | Pago Pago | USD | US Dollar | $ | .as | American Samoa | Oceania | Polynesia | [{zoneName:'Pacific\/Pago_Pago',gmtOffset:-396... | -14.333333 | -170.0 | 🇦🇸 | U+1F1E6 U+1F1F8 |
Initialize LLMs
We now initialize the LLMs to use in data discovery and data model creation.
llm_disc = OpenAIDiscoveryLLM(model_name='gpt-4o-2024-05-13', model_params={"temperature": 0})
llm_dm = OpenAIDataModelingLLM(model_name='gpt-4o-2024-05-13', model_params={"temperature": 0.5})
Discovery
We now load the above data into a Discovery object.
disc = Discovery(llm=llm_disc, user_input=USER_GENERATED_INPUT, data=data)
disc.run(show_result=True, notebook=True)
Preliminary Analysis of Country Data
Overall Details
- Data Completeness:
- The dataset contains 250 entries with 10 features.
- Most features are complete, but there are some missing values:
capital
: 5 missing valuesregion
: 2 missing valuessubregion
: 3 missing values
- Data Types:
id
: Integername
,phone_code
,capital
,currency_name
,region
,subregion
,timezones
: String (Object)latitude
,longitude
: Float
- Unique Values:
name
: 250 unique values (each country is unique)phone_code
: 235 unique values (some countries share phone codes)capital
: 244 unique values (one capital is shared by two countries)currency_name
: 161 unique values (some currencies are shared by multiple countries)region
: 6 unique valuessubregion
: 22 unique valuestimezones
: 245 unique values (some timezones are shared by multiple countries)
Use Case Analysis
- Which region contains the most subregions?
- The
region
andsubregion
columns are crucial for this analysis. - We need to count the number of unique subregions within each region.
- The
- What currencies are most popular?
- The
currency_name
column is essential for this analysis. - We need to count the frequency of each currency to determine the most popular ones.
- The
- Which countries share timezones?
- The
timezones
column is key for this analysis. - We need to identify countries that have overlapping timezones.
- The
Important Features
- Region and Subregion:
- These features help in understanding the geographical and political divisions of the countries.
- They are essential for analyzing the distribution of subregions within regions.
- Currency Name:
- This feature is important for economic analysis and understanding the monetary systems of different countries.
- It helps in identifying the most commonly used currencies.
- Timezones:
- This feature is crucial for understanding the temporal aspects and time-related relationships between countries.
- It helps in identifying countries that share the same timezones, which can be important for various applications like business operations, travel, and communication.
- Capital:
- Although not directly related to the use cases, the capital city is an important feature for general knowledge and can be useful in various analyses.
- Latitude and Longitude:
- These features provide geographical coordinates, which can be useful for spatial analysis and mapping.
Summary
- The dataset is relatively complete with a few missing values in the
capital
,region
, andsubregion
columns. - The most important features for the given use cases are
region
,subregion
,currency_name
, andtimezones
. - Additional features like
latitude
,longitude
, andcapital
can provide valuable context and support further analyses.
Next steps would involve:
- Cleaning the data to handle missing values.
- Aggregating and analyzing the data to answer the specific use cases.
- Considering how to represent these relationships in a graph data model for Neo4j.
Data Modeling
We can now use our Discovery object to provide context to the LLM for data model generation. We don’t need the discovery information for this step to work, but it provides much better models.
gdm = GraphDataModeler(
llm=llm_dm,
discovery=disc
)
We now generate our first pass data model.
gdm.create_initial_model()
validation failed
performing chain of thought process for error fix recommendations...
recieved a valid response
DataModel(nodes=[Node(label='Country', properties=[Property(name='id', type='int', csv_mapping='id', csv_mapping_other=None, is_unique=True, part_of_key=False), Property(name='name', type='str', csv_mapping='name', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='phoneCode', type='str', csv_mapping='phone_code', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='capital', type='str', csv_mapping='capital', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='latitude', type='float', csv_mapping='latitude', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='longitude', type='float', csv_mapping='longitude', csv_mapping_other=None, is_unique=False, part_of_key=False)], csv_name=''), Node(label='Region', properties=[Property(name='region', type='str', csv_mapping='region', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Subregion', properties=[Property(name='subregion', type='str', csv_mapping='subregion', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Currency', properties=[Property(name='currencyName', type='str', csv_mapping='currency_name', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Timezone', properties=[Property(name='timezones', type='str', csv_mapping='timezones', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name='')], relationships=[Relationship(type='BELONGS_TO_REGION', properties=[], source='Country', target='Region', csv_name=''), Relationship(type='BELONGS_TO_SUBREGION', properties=[], source='Country', target='Subregion', csv_name=''), Relationship(type='USES_CURRENCY', properties=[], source='Country', target='Currency', csv_name=''), Relationship(type='HAS_TIMEZONE', properties=[], source='Country', target='Timezone', csv_name=''), Relationship(type='HAS_SUBREGION', properties=[], source='Region', target='Subregion', csv_name='')], metadata=None)
gdm.current_model
DataModel(nodes=[Node(label='Country', properties=[Property(name='id', type='int', csv_mapping='id', csv_mapping_other=None, is_unique=True, part_of_key=False), Property(name='name', type='str', csv_mapping='name', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='phoneCode', type='str', csv_mapping='phone_code', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='capital', type='str', csv_mapping='capital', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='latitude', type='float', csv_mapping='latitude', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='longitude', type='float', csv_mapping='longitude', csv_mapping_other=None, is_unique=False, part_of_key=False)], csv_name=''), Node(label='Region', properties=[Property(name='region', type='str', csv_mapping='region', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Subregion', properties=[Property(name='subregion', type='str', csv_mapping='subregion', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Currency', properties=[Property(name='currencyName', type='str', csv_mapping='currency_name', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Timezone', properties=[Property(name='timezones', type='str', csv_mapping='timezones', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name='')], relationships=[Relationship(type='BELONGS_TO_REGION', properties=[], source='Country', target='Region', csv_name=''), Relationship(type='BELONGS_TO_SUBREGION', properties=[], source='Country', target='Subregion', csv_name=''), Relationship(type='USES_CURRENCY', properties=[], source='Country', target='Currency', csv_name=''), Relationship(type='HAS_TIMEZONE', properties=[], source='Country', target='Timezone', csv_name=''), Relationship(type='HAS_SUBREGION', properties=[], source='Region', target='Subregion', csv_name='')], metadata=None)
gdm.current_model.visualize()
We can clean this up a bit, so let’s prompt the LLM to make some corrections.
gdm.iterate_model(user_corrections="""Remove The relationship between Country and Region.""")
recieved a valid response
DataModel(nodes=[Node(label='Country', properties=[Property(name='id', type='int', csv_mapping='id', csv_mapping_other=None, is_unique=True, part_of_key=False), Property(name='name', type='str', csv_mapping='name', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='phoneCode', type='str', csv_mapping='phone_code', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='capital', type='str', csv_mapping='capital', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='latitude', type='float', csv_mapping='latitude', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='longitude', type='float', csv_mapping='longitude', csv_mapping_other=None, is_unique=False, part_of_key=False)], csv_name=''), Node(label='Subregion', properties=[Property(name='subregion', type='str', csv_mapping='subregion', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Currency', properties=[Property(name='currencyName', type='str', csv_mapping='currency_name', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Timezone', properties=[Property(name='timezones', type='str', csv_mapping='timezones', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Region', properties=[Property(name='region', type='str', csv_mapping='region', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name='')], relationships=[Relationship(type='BELONGS_TO_SUBREGION', properties=[], source='Country', target='Subregion', csv_name=''), Relationship(type='USES_CURRENCY', properties=[], source='Country', target='Currency', csv_name=''), Relationship(type='HAS_TIMEZONE', properties=[], source='Country', target='Timezone', csv_name=''), Relationship(type='HAS_SUBREGION', properties=[], source='Region', target='Subregion', csv_name='')], metadata=None)
gdm.current_model
DataModel(nodes=[Node(label='Country', properties=[Property(name='id', type='int', csv_mapping='id', csv_mapping_other=None, is_unique=True, part_of_key=False), Property(name='name', type='str', csv_mapping='name', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='phoneCode', type='str', csv_mapping='phone_code', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='capital', type='str', csv_mapping='capital', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='latitude', type='float', csv_mapping='latitude', csv_mapping_other=None, is_unique=False, part_of_key=False), Property(name='longitude', type='float', csv_mapping='longitude', csv_mapping_other=None, is_unique=False, part_of_key=False)], csv_name=''), Node(label='Subregion', properties=[Property(name='subregion', type='str', csv_mapping='subregion', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Currency', properties=[Property(name='currencyName', type='str', csv_mapping='currency_name', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Timezone', properties=[Property(name='timezones', type='str', csv_mapping='timezones', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name=''), Node(label='Region', properties=[Property(name='region', type='str', csv_mapping='region', csv_mapping_other=None, is_unique=True, part_of_key=False)], csv_name='')], relationships=[Relationship(type='BELONGS_TO_SUBREGION', properties=[], source='Country', target='Subregion', csv_name=''), Relationship(type='USES_CURRENCY', properties=[], source='Country', target='Currency', csv_name=''), Relationship(type='HAS_TIMEZONE', properties=[], source='Country', target='Timezone', csv_name=''), Relationship(type='HAS_SUBREGION', properties=[], source='Region', target='Subregion', csv_name='')], metadata=None)
gdm.current_model.visualize()
We can now create some ingestion code to get our data into our database.
Ingestion Code Generation
We can provide our credentials here in this step if we plan on using PyIngest to load our data. This will inject our credentials into the generated YAML file. If we leave the credential fields blank, then we can just fill in the blanks in the generated YAML file later.
gen = PyIngestConfigGenerator(data_model=gdm.current_model,
username=os.environ.get("NEO4J_USERNAME"),
password=os.environ.get("NEO4J_PASSWORD"),
uri=os.environ.get("NEO4J_URI"),
database=os.environ.get("NEO4J_DATABASE"),
file_directory="data/csv/", csv_name="countries.csv")
pyingest_yaml = gen.generate_config_string()
print(pyingest_yaml)
server_uri: bolt://localhost:7687
admin_user: neo4j
admin_pass: password
database: neo4j
basepath: ./
pre_ingest:
- CREATE CONSTRAINT country_id IF NOT EXISTS FOR (n:Country) REQUIRE n.id IS UNIQUE;
- CREATE CONSTRAINT subregion_subregion IF NOT EXISTS FOR (n:Subregion) REQUIRE n.subregion IS UNIQUE;
- CREATE CONSTRAINT currency_currencyname IF NOT EXISTS FOR (n:Currency) REQUIRE n.currencyName IS UNIQUE;
- CREATE CONSTRAINT timezone_timezones IF NOT EXISTS FOR (n:Timezone) REQUIRE n.timezones IS UNIQUE;
- CREATE CONSTRAINT region_region IF NOT EXISTS FOR (n:Region) REQUIRE n.region IS UNIQUE;
files:
- chunk_size: 100
cql: |-
WITH $dict.rows AS rows
UNWIND rows AS row
MERGE (n:Country {id: toIntegerOrNull(row.id)})
SET n.name = row.name, n.phoneCode = row.phone_code, n.capital = row.capital, n.latitude = toFloatOrNull(row.latitude), n.longitude = toFloatOrNull(row.longitude)
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows AS row
MERGE (n:Subregion {subregion: row.subregion})
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows AS row
MERGE (n:Currency {currencyName: row.currency_name})
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows AS row
MERGE (n:Timezone {timezones: row.timezones})
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows AS row
MERGE (n:Region {region: row.region})
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows as row
MATCH (source:Country {id: toIntegerOrNull(row.id)})
MATCH (target:Subregion {subregion: row.subregion})
MERGE (source)-[n:BELONGS_TO_SUBREGION]->(target)
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows as row
MATCH (source:Country {id: toIntegerOrNull(row.id)})
MATCH (target:Currency {currencyName: row.currency_name})
MERGE (source)-[n:USES_CURRENCY]->(target)
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows as row
MATCH (source:Country {id: toIntegerOrNull(row.id)})
MATCH (target:Timezone {timezones: row.timezones})
MERGE (source)-[n:HAS_TIMEZONE]->(target)
url: $BASE/data/csv/countries.csv
- chunk_size: 100
cql: |
WITH $dict.rows AS rows
UNWIND rows as row
MATCH (source:Region {region: row.region})
MATCH (target:Subregion {subregion: row.subregion})
MERGE (source)-[n:HAS_SUBREGION]->(target)
url: $BASE/data/csv/countries.csv
Ingest Data
We can use the generated yaml string above to orchestrate the data loading via a modified PyIngest function. First let’s confirm our connection though.
test_database_connection(credentials={"username": os.environ.get("NEO4J_USERNAME"), "password": os.environ.get("NEO4J_PASSWORD"), "uri": os.environ.get("NEO4J_URI")})
{'valid': True, 'message': 'Connection and Auth Verified!'}
PyIngest(config=pyingest_yaml, dataframe=data)
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:47.801049
loading... 1 2024-08-08 13:30:47.880631
{} : Completed file 2024-08-08 13:30:47.903632
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:47.904284
loading... 1 2024-08-08 13:30:47.943986
{} : Completed file 2024-08-08 13:30:47.957237
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:47.957757
loading... 1 2024-08-08 13:30:47.987587
{} : Completed file 2024-08-08 13:30:48.007453
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.007837
loading... 1 2024-08-08 13:30:48.038595
{} : Completed file 2024-08-08 13:30:48.059046
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.059566
loading... 1 2024-08-08 13:30:48.087154
{} : Completed file 2024-08-08 13:30:48.099672
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.100013
loading... 1 2024-08-08 13:30:48.134636
{} : Completed file 2024-08-08 13:30:48.158952
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.159475
loading... 1 2024-08-08 13:30:48.194488
{} : Completed file 2024-08-08 13:30:48.214928
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.215646
loading... 1 2024-08-08 13:30:48.266017
{} : Completed file 2024-08-08 13:30:48.285937
File {} .//data/csv/countries.csv
loading... 0 2024-08-08 13:30:48.286428
loading... 1 2024-08-08 13:30:48.321062
{} : Completed file 2024-08-08 13:30:48.333758
If we check our database we can see that we’ve ingested our CSV according to the data model we’ve created!