Using datasets
Datasets are created, edited, and deleted on the Datasets page of the Data Quality application. You can also create a pipeline that uses a selected dataset on this page.
Datasets are created and managed on the Datasets page of the Data Quality application. To add a dataset, you upload rows of data from a text or CSV file. Fields in the data may be delimited commas (,), periods (.), pipes (|), semicolons (;), spaces ( ), or tabs. Text in a field that contains the delimiter may be qualified by single or double quotation characters. Lines breaks may use the Unix (or OS X) LF character, the Windows CR and LF characters, or the Macintosh CR character. Field names may be defined by values in the first row of the data or manually specified after you create a dataset. Refer to the following procedure to create a dataset.
After you upload the dataset, can view sample data and adjust data settings to account for the file characteristics. You can specify character encoding, the field delimiter, the text qualifier, and line separator. Line delimiter selections support Linux, Windows, and Macintosh line breaks. Refer to the following procedure to edit a dataset.
Along with data settings you can edit the data type for each column. Data types include Boolean, Float, Integer, Long, String, Date, Time, and DateTime. You can also specify the semantic type for selected column. Semantic types include email, name and address options. Data Quality will attempt to identify the data by sampling the data. You can edit the data type formats to match actual formats in the data.
Edit data type formats for a dataset
You can create a pipeline for a selected dataset on either the Dataset. On the Datasets page, you find the Dataset you want to use, then create the pipeline for it. On the Pipelines page you can do the same thing, except that you select the dataset while you are configuring the pipeline. Use the following procedure to create a pipeline from the Datasets page.
Create a pipeline from a dataset
You can also delete any existing dataset as long as it is not used in a pipeline.