Runway

Neo4j Runway

Neo4j Runway is a Python library that simplifies the process of migrating your relational data into a graph. It provides tools that abstract communication with OpenAI to run discovery on your data and generate a data model, as well as tools to generate ingestion code and load your data into a Neo4j instance.

Key Features

Data Discovery: Harness OpenAI LLMs to provide valuable insights from your data
Graph Data Modeling: Utilize OpenAI and the Instructor Python library to create valid graph data models
Code Generation: Generate ingestion code to easily load your data
Data Ingestion: Load your data using Runway’s built in implementation of PyIngest - Neo4j’s popular ingestion tool

Requirements

Runway uses Graphviz to visualize data models. To enjoy this feature please download graphviz.

You’ll need a Neo4j instance to fully utilize Runway. Start up a free cloud hosted Aura instance or download the Neo4j Desktop app.

Get Running in Minutes

Follow the steps below or check out Neo4j Runway end-to-end examples

pip install neo4j-runway

Now let’s walk through a basic example.

Here we import the modules we’ll be using.

from neo4j_runway import Discovery, GraphDataModeler, PyIngest, UserInput
from neo4j_runway.code_generation import PyIngestConfigGenerator
from neo4j_runway.llm.openai import OpenAIDiscoveryLLM, OpenAIDataModelingLLM

Discovery

Now we…

Define a general description of our data
Provide brief descriptions of the columns of interest
Provide any use cases we’d like our data model to address
Load our csv via Runway’s load_local_files function

data_directory = "../../../data/countries/"

data_dictionary = {
                'id': 'unique id for a country.',
                'name': 'the country name.',
                'phone_code': 'country area code.',
                'capital': 'the capital of the country.',
                'currency_name': "name of the country's currency.",
                'region': 'primary region of the country.',
                'subregion': 'subregion location of the country.',
                'timezones': 'timezones contained within the country borders.',
                'latitude': 'the latitude coordinate of the country center.',
                'longitude': 'the longitude coordinate of the country center.'
                }

use_cases = [
        "Which region contains the most subregions?",
        "What currencies are most popular?",
        "Which countries share timezones?"
    ]

data = load_local_files(data_directory=data_directory,
                        data_dictionary=data_dictionary,
                        general_description="This is data on countries and their attributes.",
                        use_cases=use_cases,
                        include_files=["countries.csv"])

We may also preview our csv data before running any processes

data.tables[0].dataframe.head()

	id	name	phone_code	capital	currency_name	region	subregion	timezones	latitude	longitude
0	1	Afghanistan	93	Kabul	Afghan afghani	Asia	Southern Asia	[{zoneName:'Asia\/Kabul',gmtOffset:16200,gmtOf...	33.000000	65.0
1	2	Aland Islands	+358-18	Mariehamn	Euro	Europe	Northern Europe	[{zoneName:'Europe\/Mariehamn',gmtOffset:7200,...	60.116667	19.9
2	3	Albania	355	Tirana	Albanian lek	Europe	Southern Europe	[{zoneName:'Europe\/Tirane',gmtOffset:3600,gmt...	41.000000	20.0
3	4	Algeria	213	Algiers	Algerian dinar	Africa	Northern Africa	[{zoneName:'Africa\/Algiers',gmtOffset:3600,gm...	28.000000	3.0
4	5	American Samoa	+1-684	Pago Pago	US Dollar	Oceania	Polynesia	[{zoneName:'Pacific\/Pago_Pago',gmtOffset:-396...	-14.333333	-170.0

We may then initialize our discovery and data modeling LLMs. By default we use GPT-4o and define our OpenAI API key in an environment variable.

llm_disc = OpenAIDiscoveryLLM(model_name='gpt-4o-mini-2024-07-18', model_params={"temperature": 0})
llm_dm = OpenAIDataModelingLLM(model_name='gpt-4o-2024-05-13', model_params={"temperature": 0.5})

And we run discovery on our data.

disc = Discovery(llm=llm_disc, data=data)disc.run()

disc.run(show_result=True, notebook=True)

Discovery Output

Preliminary Analysis of Country Data

Overall Data Characteristics:

Data Size: The dataset contains 250 entries (countries) and 10 attributes.
Data Types: The attributes include integers, floats, and objects (strings). The presence of both numerical and categorical data allows for diverse analyses.
Missing Values:
- capital: 5 missing values (2% of the data)
- region: 2 missing values (0.8% of the data)
- subregion: 3 missing values (1.2% of the data)
- Other columns have no missing values.

Important Features:

id: Unique identifier for each country. It is uniformly distributed from 1 to 250.
name: Each country has a unique name, which is crucial for identification.
phone_code: There are 235 unique phone codes, indicating that some countries share the same code. This could be relevant for understanding regional telecommunications.
capital: The capital city is a significant attribute, but with 5 missing values, it may require attention during analysis.
currency_name: There are 161 unique currencies, with the Euro being the most common (35 occurrences). This suggests a potential clustering of countries using the same currency, which could be relevant for economic analyses.
region: There are 6 unique regions, with Africa having the highest frequency (60 countries). This could indicate a need to explore regional characteristics further.
subregion: 22 unique subregions exist, with the Caribbean being the most frequent (28 occurrences). This suggests that some regions have more subdivisions than others.
timezones: The dataset contains 245 unique timezones, indicating that many countries share timezones. This could be useful for understanding global time coordination.

Use Case Insights:

Regions and Subregions: To determine which region contains the most subregions, we can analyze the region and subregion columns. The region with the highest number of unique subregions will be identified.
Popular Currencies: The currency_name column can be analyzed to find the most frequently occurring currencies, highlighting economic ties between countries.
Shared Timezones: The timezones column can be examined to identify countries that share the same timezone, which may have implications for trade, communication, and travel.

Conclusion:

The dataset provides a rich source of information about countries, their geographical locations, and economic attributes. The most important features for analysis include region, subregion, currency_name, and timezones, as they directly relate to the use cases outlined. Addressing the missing values in capital, region, and subregion will also be essential for a comprehensive analysis.

Data Modeling

We can now use our Discovery object to provide context to the LLM for data model generation. Notice that we don’t need to pass our actual data to the modeler, just insights we’ve gathered so far.

gdm = GraphDataModeler(llm=llm_dm, discovery=disc)

We may now generate our first graph data model.

gdm.create_initial_model()

If we have graphviz installed, we can take a look at our model.

gdm.current_model.visualize()

Our data model seems to address the three use cases we’d like answered:

Which region contains the most subregions?
What currencies are most popular?
Which countries share timezones?

If we would like the data model modified, we may request the LLM to make changes.

gdm.iterate_model(corrections="Create a Capital node from the capital property.")
gdm.current_model.visualize()

Code Generation

We can now use our data model to generate some ingestion code.

gen = PyIngestConfigGenerator(data_model=gdm.current_model,
                         username=os.environ.get("NEO4J_USERNAME"),
                         password=os.environ.get("NEO4J_PASSWORD"),
                         uri=os.environ.get("NEO4J_URI"),
                         database=os.environ.get("NEO4J_DATABASE"),
                         file_directory=data_directory, source_name="countries.csv")

pyingest_yaml = gen.generate_config_string()

Ingestion

We will use the generated PyIngest yaml config to ingest our data into our Neo4j instance.

PyIngest(config=pyingest_yaml, verbose=False)

We can also save this as a .yaml file and use with the original PyIngest.

gen.generate_config_yaml(file_name="countries.yaml")

Here’s a snapshot of our new graph!

Limitations

Runway is currently in beta and under rapid development. Please raise GitHub issues and provide feedback on any features you’d like. The following are some of the current limitations:

Nodes may only have a single label
Only uniqueness and key constraints are supported
Only OpenAI models may be used at this time
Runway only supports ingesting local files, though it supports code generation for other ingest methods