Discovery

from neo4j_runway import Discovery

The Discovery module that handles summarization and discovery generation via Pandas and an optional LLM.

Attributes
----------
llm : BaseDiscoveryLLM
    The LLM instance used to generate data discovery.
user_input : UserInput
    User provided descriptions of the data.
    A class containing user provided information about
    the data.
data : TableCollection
    The data contained in a TableCollection.
    All data provided to the Discovery constructor is
    converted to a Table and placed in a TableCollection
    class.

Class Methods

init

The Discovery module that handles summarization and discovery generation via Pandas and an optional LLM.

Parameters
----------
data : Union[pd.DataFrame, Table, TableCollection]
    The data to run discovery on. Can be either a Pandas
    DataFrame, Runway Table or Runway TableCollection.
    Multi file inputs should be provided via the
    TableCollection class.
    Single file inputs may be provided as a DataFrame or
    Runway Table. They will be placed in a
    TableCollection class upon initialization of the
    Discovery class.
llm : LLM, optional
    The LLM instance used to generate data discovery.
    If running discovery for multiple files,
    it is recommended to use an async compatible LLM and
    use the `run_async` method.
    Not required if only interested in generating Pandas
    summaries. By default None.
user_input : Union[Dict[str, str], UserInput]
    User provided descriptions of the data.
    If a dictionary, then should contain the keys
    "general_description" and all desired columns.
    This is only necessary if providing a Pandas
    DataFrame as data input. Otherwise it will be
    ignored. By default = dict()

run

Run the discovery process on the provided data. This method is compatible with non-async LLM classes. If using an async LLM, please use the run_async method instead. Access generated discovery with the .view_discovery() method of the Discovery class.

If running multi-file discovery, the parameter priority
    is as follows:
1. custom_batches
2. bulk_process
3. num_calls
4. batch_size

If more than one of the above is provided, the highest
    priority will overwrite any others.

Parameters
----------
show_result : bool, optional
    Whether to print the final generated discovery upon
    retrieval. By default True
notebook : bool, optional
    Whether code is executed in a notebook. Affects the
    result print formatting. By default True
ignore_files : List[str], optional
    A list of files to ignore. For multi-file input. By
    default list()
batch_size : int, optional
    The number of files to include in a discovery call.
    For multi-file input. By default 1
bulk_process : bool, optional
    Whether to include all files in a single batch. For
    multi-file input. By default False
num_calls : Optional[int], optional
    The max number of LLM calls to make during the
    discovery process. For multi-file input. By default
    None
custom_batches : Optional[List[List[str]]], optional
    A list of custom batches to run discovery on. For
    multi-file input. By default None
pandas_only : bool, optional
    Whether to only run Pandas summary generation and
    skip LLM calls. By default False

Raises
------
RuntimeError
    If an async LLM is provided to the Discovery
    constructor.
PandasDataSummariesNotGeneratedError
    If Pandas summaries are unable to be generated.

run_async

Run the discovery process on the provided data asynchronously. This method is compatible with async LLM classes. If using a non async LLM, please use the run method instead. Access generated discovery with the .view_discovery() method of the Discovery class.

If running multi-file discovery, the parameter priority
    is as follows:
1. custom_batches
2. bulk_process
3. num_calls
4. batch_size

If more than one of the above is provided, the highest
    priority will overwrite any others.

Parameters
----------
show_result : bool, optional
    Whether to print the final generated discovery upon
    retrieval. By default True
notebook : bool, optional
    Whether code is executed in a notebook. Affects the
    result print formatting. By default True
ignore_files : List[str], optional
    A list of files to ignore. For multi-file input. By
    default list()
batch_size : int, optional
    The number of files to include in a discovery call.
    For multi-file input. By default 1
bulk_process : bool, optional
    Whether to include all files in a single batch. For
    multi-file input. By default False
num_calls : Optional[int], optional
    The max number of LLM calls to make during the
    discovery process. For multi-file input. By default
    None
custom_batches : Optional[List[List[str]]], optional
    A list of custom batches to run discovery on. For
    multi-file input. By default None

Raises
------
RuntimeError
    If a non async LLM is provided to the Discovery
    constructor.

to_markdown

Output findings to a .md file.

Parameters
----------
file_dir : str, optional
    The directory to save files to, by default "./"
file_name : str, optional
    'all' to export all data, 'final' to export only
    final discovery result, file name to export the
    desired file only, by default "all"
include_pandas : bool, optional
    Whether to include the Pandas summaries, by default
    True

to_txt

Output findings to a .txt file.

Parameters
----------
file_dir : str, optional
    The directory to save files to, by default "./"
file_name : str, optional
    'all' to export all data, 'final' to export only
    final discovery result, file name to export the
    desired file only, by default "all"
include_pandas : bool, optional
    Whether to include the Pandas summaries, by default
    True

view_discovery

Print the discovery information of the provided file. If no file_name is provided, then displays the summarized final discovery.

Parameters
----------
file_name : str, optional
    The file to display discovery. If not provided, then
    displays the summarized final discovery. By default
    = None
notebook : bool, optional
    Whether executing in a notebook, by default True

Class Properties

discovery

The final generated discovery for the data.

Returns
-------
str
    The `discovery` attribute of the `data` attribute.

is_multifile

Whether data is multi-file or not.

Returns
-------
bool
    True if multi-file detected, else False