Version Data

With log_data_version and log_s3_data_version helpers you can log data location and data hash to Neptune. It will be stored as property and can be viewed both in the Details section of an experiment:

img

and in the experiment dashboard as a column.

img1

Check this example project to see more.

Prerequisites

Initialize Neptune

[ ]:
import neptune
neptune.init('USER_NAME/PROJECT_NAME')

File data version

[ ]:
from neptunecontrib.versioning.data import log_data_version

FILEPATH = '/path/to/data/my_data.csv'
with neptune.create_experiment():
    log_data_version(FILEPATH)

Folder data version

[ ]:
from neptunecontrib.versioning.data import log_data_version

DIRPATH = '/path/to/data/folder'
with neptune.create_experiment():
    log_data_version(DIRPATH)

S3 bucket data version

We can log both a version of a particular key which is similar to file versioning.

[ ]:
BUCKET = 'my-bucket'
PATH = 'training_dataset.csv'
with neptune.create_experiment():
    log_s3_data_version(BUCKET, PATH)

We can log a combined version of all the keys that start with a particular string which is similar to versioning a directory

[ ]:
BUCKET = 'my-bucket'
PATH = 'train_dir/'
with neptune.create_experiment():
    log_s3_data_version(BUCKET, PATH)

Prefixing

If you want to track multiple data sources make sure to prefix them before logging. For example:

[ ]:
from neptunecontrib.versioning.data import log_data_version

FILEPATH_TABLE_1 = '/path/to/data/my_table_1.csv'
FILEPATH_TABLE_2 = '/path/to/data/my_table_2.csv'

with neptune.create_experiment():
    log_data_version(FILEPATH_TABLE_1, prefix='table_1_')
    log_data_version(FILEPATH_TABLE_2, prefix='table_2_')