Validate the Dataset in Binder Service

Validate the Dataset in Binder Service#

Binder is an online service for building and running computational environments in the browser, which can be used to reproduce and demonstrate research results hosted by code repositories.

Source: Depositar Docs

Introduction#

In addition to using the Data Validation Tool to validate the quality of the dataset, users can also upload the modified validation program together with the dataset when using the depositar. After opening the Binder service, the validation program can be used to validate the dataset.

Note

The validation program uploaded to the Binder service only has the function of dataset validation and does not include the generation and modification of the dataset specification. Users need to complete the preparation of the dataset specification in advance and upload it to the dataset together.

Implementation#

Modify the validation code of the Data Validation Tool and rewrite it into an .ipynb file, directly output the prompt messages generated by the system for user reference, and add a new record to a validation result file (validate_report.txt).

from frictionless import validate

data_filename = 'data.csv'
yaml_filename = f'{data_filename}.schema.yaml'
report_filename = f'{data_filename}_validate_report.txt'

def messege(str):
    print(str)
    with open(report_filename, 'a') as file:
        file.write(str)

with open(report_filename, 'w') as file:
    pass

yaml_report = validate(yaml_filename)

if yaml_report.valid:
    data_report = validate(data_filename, schema=yaml_filename)
    error_num = data_report.stats['errors']
    if error_num == 0:
        messege('There are no errors in the dataset.')
    elif error_num == 1:
        messege('There is 1 error in the dataset:')
    else:
        messege(f'There are {error_num} errors in the dataset:')
    for i in range(error_num):
        messege(f'{data_report.tasks[0].errors[i].title}:\n{data_report.tasks[0].errors[i].message}')
else:
    messege('The .yaml file is not valid:')
    error_num = yaml_report.stats['errors']
    for i in range(error_num):
        messege(f'{yaml_report.tasks[0].errors[i].title}:\n{yaml_report.tasks[0].errors[i].message}')