Defining a Data Warehouse Update Schedule

You can schedule Spectrum™ Technology Platform dataflows to extract and transform data from the normalized structure in the data source into the star schema structure in the data warehouse. Scheduling dataflows is useful because most load operations require system resources that are unavailable during the business day.

When deciding on an update schedule, consider the following:

  • Frequency
  • Sequence
  • Dependencies

Frequency

You should schedule dataflows to execute based on the grain of the most detailed fact table. For example:

  • If the grain of the fact table is daily, schedule the fact table's population dataflow to run daily.
  • If the grain is monthly, schedule the population dataflow to run monthly, not sooner, because users work only with data from completed past months.

Most population dataflows process large amounts of data, so schedule population dataflows to execute when use of the Spectrum™ Technology Platform server, the source and data warehouse databases, and the network is minimal.

Populate all dimension and fact tables during the initial load. After the initial load, refresh tables based on what was added or changed. Generally, fact tables are refreshed more frequently than dimension tables because:

  • Dimension tables are usually static unless an attribute in the source is changed or added.
  • Fact table data in a decision support database is typically historical and requires regular additions and updates to remain current. The initial load and most incremental loads affect fact tables.

Sequence

There are dependencies among data in the data warehouse databases, so determine the sequence in which to run population dataflows before setting the execution schedule.

Populate dimension tables before fact tables because every dimension record and key must exist before a related fact table can be populated. This restriction is a function of the primary-foreign key relationship between dimension and fact tables in a star schema.

Refresh base-level tables before populating aggregate tables in your decision support database. This sequence ensures that base-level and aggregate tables remain synchronized.

The correct order to run population dataflows is:

  1. Base-level dimension table Plans
  2. Base-level fact table Plans
  3. Aggregate dimension table Plans
  4. Aggregate fact table Plans

Dependencies

You can create dataflow dependencies if several population dataflows need to run in a specific order, or if the amount of time to run dataflows is unpredictable. A dataflow is run only if certain requirements are met such as, the previous dataflow has completed, or the previous dataflow has failed.

To create dataflow dependencies, create a process flow in Enterprise Designer. For more information about process flows, see the Dataflow Designer's Guide.