1. Introduction

The largest and most complex phase of any data warehouse project is the Extract-Transform-Load process. This process involves the extraction of data from source systems, transformation of that data so it conforms to standardized values and conventions and loading of the data into the data warehouse. ActiveWarehouse ETL, a subproject of the ActiveWarehouse open source project, will help with the ETL development process.

Note: this documentation covers ActiveWarehouse ETL version 0.9. There is no guarantee that functionality described here will work the same or at all in previous versions.

2. Installing

To install ActiveWarehouse ETL you must first install Ruby and Rubygems. Once you have a functioning Ruby and Rubygems installation you can install by simply running gem install:

sudo gem install activewarehouse-etl

ActiveWarehouse ETL depends on FasterCSV, ActiveRecord and ActiveSupport. If these libraries are not already installed they should be installed with ActiveWarehouse ETL.

2.1. Setting Up

First create a database.yml in the directory of your ETL control files that contains information on how to connect to the etl_execution database. This database stores meta data about each ETL execution. Here is an example of the YAML file from the example application rails_warehouse in the ActiveWarehouse subversion repository:

etl_execution: database: rails_warehouse_etl_execution adapter: mysql hostname: localhost username: root password:

Next create the database defined above in your database server. The first time you run the etl script it will execute the necessary migrations to construct the ETL execution environment database.

2.2. Defining Data Connections

In addition to the etl_execution connection you will also define all of your connections to operational databases as well as the data warehouse environment in the database.yml file in the directory of your ETL control files. You will then reference these connections by name through your control files using the :target option.

3. Control Files

Each ETL process is defined using an ETL control file. The common naming convention for ETL control files in ActiveWarehouse is destination.ctl where destination is the destination table name. For example, if you are loading data into a store dimension called store_dimension then the control file would be called store_dimension.ctl. If you are including the ETL control files as part of your Rails application I recommend putting your etl directory right in the RAILS_ROOT directory.

3.1. How It Works

ETL control files define a pipeline for the ETL process. It is possible to both read from multiple sources and write to multiple destinations. Sometimes it is beneficial to write your ETL process to take advantage of this pipeline concept in order to reduce ETL processing time. This is especially true if you are dealing with source data that contains many records since you may be able to implement the ETL process as a single pass.

Data passes through the ETL pipeline as Hashes. Specifically, each row is a Hash where the key is the name of the field and the value is the current value of the field. All records are initially extracted as text and should be transformed using the type transform or date and time conversions if necessary. In many cases though you may just want to leave the rows as Hashes of symbols mapped to strings.

One of the most important things to keep in mind is that control files are Ruby code, therefore you can use normal Ruby expressions. This can even include importing other libraries as well as your own libraries and support code. So for example, you could put the following at the top of your control file:

require File.dirname(__FILE__) + "/support"

This would require the file support.rb in the same directory as the control file.

3.2. Batching

As of version 0.9, ActiveWarehouse ETL now includes a batch execution system. Batch files end with .ebf which stands for “ETL Batch File”. The batch file contains directives which are applicable to the entire process. Directives are executed in the order in which they appear in the batch file.

3.2.1. Run Directive

The run directive indicates an ETL control file to run. For example, to execute the person_dimension.ctl control file you would include the following in the batch file:

run "person_dimension.ctl"

3.2.2. Use Temp Tables

The use_temp_tables directive indicates that temporary tables should be used during ETL processing. If the ETL process fails during ETL execution then the temp tables will survive and the production tables will be untouched.

The temp tables directive will affect the following processors:

It also affects the database source and database destination implementations.

4. Extract

Extraction takes data out of your operational databases and files and brings it into the ETL pipeline for processing. ActiveWarehouse ETL supports three types of extraction methods: flat file, database and in-memory enumeration. When you define a source to extract data from you may provide a variety of options to control aspects of the extraction. Some options are usable across all sources while some are source dependent. The source options which are applicable for all sources are:

:store_locally
Option that sets whether or not the source data should be extracted to a local file first before processing. When this option is true any data will first be stored in the directory called source_data in the current working directory as a CSV (comma-delimited text file). The default value is true.

There are several benefits for storing the data locally before processing. For the database source there is the benefit that the connection to the database does not need to be maintained, which is good in that it frees up resources. Another benefit is that the memory needed for processing is reduced because the data will be loaded line-by-line from the CSV file as it is processed. Finally, by storing locally you have an audit trail and you can go back and look through the original source data used during any ETL process.

When store locally is used the data will be written to a time-stamped file in a directory that is based on the source location. For example, if the source is a database on localhost, the source data might be written to source_data/localhost/rails_production_database/people/20070308110032.csv. In addition to the source data an empty file ending in .trig will be written once the data download is complete. This is used by the ETL engine so that it knows that the source data was fully written.

The format for the source definition in a control file is:

source :name, configuration, definition

The configuration argument is always a Hash. The definition argument may be a Hash or an Array, depending on the source type.

4.1. Flat File Extraction

Flat file extraction pulls data from a flat text file. AW-ETL comes with a variety of parsers that can be used to parse different text formats. The following parsers are included with ActiveWarehouse ETL:

The definition argument will vary depending on the parser used. For example, the delimited parser expects an Array corresponding to the fields in each row in the order they will appear in the source file, whereas the fixed-width parser expects a hash where each name/value pair has the field name and a Hash of options for that field.

All parsers share the following common configuration elements:

:file
The file to read from. This may be an absolute or relative path. For relative paths it is relative to the control file, not the current working directory. This configuration element is required.
:parser
The parser name, either as a String or Symbol, a Class that extends from ETL::Parser::Parser or a Hash with two records: :name and :options. If a String or Symbol is given then it will be converted to camel case and Parser will be appended to it, so :delimited becomes DelimitedParser. If a Hash is provided the :name will correspond to the :parser name and the :options Hash will be passed to the parser when it is instantiated. This element is required.

4.1.1. Delimited Parser

The delimited parser will be used when the configuration Hash contains a :file element and a :parser element with a value of :delimited.

The delimited parser includes the following configuration options:

:skip_lines
The number of lines that should be skipped. This is useful for skipping header lines. This element is optional and will default to 0.

The definition when using the delimited parser is an Array of field names in the order which they appear in the source file.

For example, consider the following CSV file, called people.csv:

first_name,last_name,email Bob,Smith,bsmith@foo.com Jane,Doe,jdoe@bar.com

The source definition might look like this:

source, :in, { :file => "people.csv", :parser => :delimited, :skip_lines => 1 }, [ :first_name, :last_name, :email ]

You could also pass parameters to the parser. Consider if you have a data file with tabs as the separator:

first_name last_name email Bob Smith bsmith@foo.com Jane Doe jdoe@bar.com

The configuration would look like this:

source, :in, { :file => people.txt’, :parser => { :name => :delimited, :options => { :col_sep => "\t" } }, :skip_lines => 1 }, [ :first_name, :last_name, :email ]

4.1.2. Fixed-width Parser

The fixed-width parser will be used when the configuration Hash contains a :file element and a :parser element with a value of :fixed_width.

The fixed-width parser includes the following configuration options:

:skip_lines
The number of lines that should be skipped. This is useful for skipping header lines. This element is optional and will default to 0.

The definition when using the fixed-width parser is a Hash. Each element in the Hash maps the field name to a Hash of options. The Hash of options may include the following elements:

:start
The start position of the element. The first starting position is always 1.
:length
The length of the field.
:end
The end of the field.

Either :start and :length or :start and :end must be specified.

Consider the following source file:

first last email Bob Smith bsmith@foo.com Jane Doe jdoe@bar.com

The source would look like this:

source :in, { :file => "people.txt", :parser => :fixed_width }, { :first_name => { :start => 1, :length => 9 }, :last_name => { :start => 10, :length => 12 }, :ssn => { :start => 22, :length => 50 } }

Note that the fixed width parser will automatically trim each value.

4.1.3. SAX Parser

The SAX parser will be used when the configuration Hash contains a :file element and a :parser element with a value of :sax. While a DOM-based XML parser is included in the ActiveWarehouse ETL library, you should always favor the SAX implementation as it performs and scales in a way that is not possible with the DOM-based XML parser.

Unlike the other parsers, the SAX parser passes its options directly via the definition. The definition takes two elements:

:write_trigger
A path that defines the element that will cause the current row to be written.
:fields
A Hash of fields where each field name is mapped to the XPath-like path for finding the field in the XML stream.

Consider the following XML document:

<?xml version="1.0"?> <people> <person ssn="123456789"> <first_name>Bob</first_name> <last_name>Smith</last_name> <email>bsmith@foo.com</email> </person> <person ssn="111223333"> <first_name>Jane</first_name> <last_name>Doe</last_name> <email>jdoe@bar.com</email> </person> </people>

The source would look like this:

source :in, { :file => "people.xml", :parser => :sax }, { :write_trigger => "people/person", :fields => { :first_name => "people/person/first_name", :last_name => "people/person/last_name", :email => "people/person/email", :ssn => "people/person[ssn]" } }

The path parser currently only handles simple expressions, specifically the path to the element or attribute. The syntax is XPath-like but it is not XPath.

4.1.4. Apache Combined Log Parser

The Apache combined log parser will be used when the configuration Hash contains a :file element and a :parser element with a value of :apache_combined_log.

The Apache combined log parser does not include any configuration options.

The definition when using the Apache combined log parser is not required and will be ignored. The Apache combined log parser will always inject the following fields into the ETL pipeline:

Values in the above that are a dash will be converted to nil. The timestamp value will be converted to a Ruby Time instance.

The following elements will be parsed from the :referrer string, if possible, and injected into the ETL pipeline:

In addition, the following elements will be parsed from the :user_agent string if possible and injected into the ETL pipeline:

4.1.5. Custom Parser

You can also define your own parser by extending ETL::Parser::Parser. For example:

class MyParser < ETL::Parser::Parser def each [{:name => "Bob"},{:name => "Jane"},{:name => "Joe"}].each do |row| yield row end end end

source :in, { :file => "", :parser => MyParser }, [ :name ]

4.2. Database Extraction

Database extraction uses the ActiveRecord library to extract data from an operational database. As described in the beginning of this section, the data will be extracted to a local file first prior to being processed through the ETL pipeline.

Database connections always require the following configuration elements:

:database
The name of the database to extract from.
:target
The target database connection.
:table
The name of the table to extract from.

In addition, the following optional elements may be provided:

:join
Specify the join portion of the query that will be used to extract the data. The value must be a SQL fragment if specified.
:select
Specify the select portion of the query that will be used to extract the data. Defaults to an asterisk. The value must be a SQL fragment if specified.
:group
Specify the group by portion of the query that will be used to extract the data. The value must be a SQL fragment if specified.
:order
Specify the order by portion of the query that will be used to extract the data. The value must be a SQL fragment if specified.
:new_records_only
Specify the column to use when comparing timestamps against the last successful ETL job execution for the current control file.

Given the following database table:

create table people ( first_name varchar(50), last_name varchar(50), email varchar(150) );

The source might look like this:

source :in, { :database => "my_app", :target => :operational_database :table => "people" }, [ :first_name, :last_name, :email ]

In versions before 0.9, a common pattern is to grab the configuration YAML from ActiveRecord::Base”s configuration Hash and merge that with the other options to be passed to the database source:

operational_db = ActiveRecord::Base.configurations["operational"].symbolize_keys source :in, operational_db.merge({ :database => "my_app", :table => "people" }), [ :first_name, :last_name, :email ]

As of 0.9 this will no longer work. Rather you should use the :target option to select a named connection that is defined in database.yml as shown in the first example.

The options for :join, :select, :group and :order are useful for tailoring the query to get only the results you need to push through the ETL pipeline:

source :in, { :database => "my_app", :target => :operational_database :table => "people", :join => "addresses on people.address_id = addresses.id", :select => "people.email, addresses.city, addresses.state, people.created_at" :order => "people.created_at" }, [ :email, :city, :state ]

4.3. Enumerable Extraction

The enumerable source provides a means of injecting records into the ETL pipeline from any object that mixes in the Ruby Enumerable module. For example:

source :default, { :type => :enumerable, :enumerable => [ { :first_name => "Bob", :last_name => "Smith", :email => "bsmith@foo.com" }, { :first_name => "Jane", :last_name => "Doe", :email => "jdoe@bar.com" } ]}

This feature is often used to inject a default record into the head of the ETL pipeline when populating a dimension. For example, if you had records in your operational database tied to people and if it is possible for the record to have a null reference in it”s person_id foreign key, then you might want an “Unknown” person:

source :default, { :type => :enumerable, :enumerable => [ { :first_name => "Unknown", :last_name => "Unknown", :email => "Unknown" } ]}

4.4. Read Locally

When the read-locally option is specified on the command line during the etl execution the source will bypass querying the actual source and will instead use the last successfully downloaded source cache file in source_data.

5. Transform

The transformation portion of the ETL pipeline converts data in all kinds of ways to help standardize the way the data is recorded in the target system. ActiveWarehouse ETL supports two ways to transform, either by field transforms or by row processors. There are quite a few built in transforms and processors and making your own is a matter of using a block or creating a custom transform class or row processor class.

5.1. Field Transforms

Field-level transforms are only applied to a single field in each row. Any transform of this type must implement the method transform(name, value, row) where name is the field name, value is the current value of the field and row is the entire row (in case the transformation depends on other fields in the row). The transform must return the value for the field after the transform is complete

Field transforms are defined in one of two ways. For built-in transforms they are defined by name:

transform :ssn, :sha1, {}

The first argument is the field name, the second is the name of the transform and the third is an options Hash.

The second form uses a block or class:

transform(:ssh){|n,v,r| v.substring(0,24) }

The block takes three arguments, the name, the value and the row. The block must return a single value that is the new, transformed value.

Note that you can transform a field as many times as you want. Also note that the order in which transforms appear is the order in which they will be executed.

For example:

transform :ssh, :sha1 transform :ssh, :default, :default_value => "Unknown" transform(:ssh){|n,v,r| v.substring(0,24)}

5.1.1. Decode Transform

Uses an external decode file to convert from coded value into something that is consistent and understandable without needing to know how coded values are decoded.

For example:

transform :sex, :decode, :decode_table_path => "sex_decode.txt"

5.1.2. Default Transform

This transform will return the given default value if the value passed in is blank, according to the rules for blank?, otherwise it will just return the original value.

For example:

transform :name, :default, :default_value => "No Name"

If the value is either nil or an empty String then the row value would be changed to “No Name”, otherwise the original value would be returned.

5.1.3. Date-to-String Transform

This transform converts a date into a formatted String. Options:

:format
A format string using the built in Ruby date formatting syntax.

For example:

transform :created_at, :date_to_string, :format => "%m/%d/%Y"

5.1.4. String-to-Date Transform

This transform converts a String into a Ruby Date instance.

5.1.5. String-to-Time Transform

This transform converts a String into a Ruby Time instance.

5.1.6. Foreign Key Transform

Transform a natural key into its foreign key equivalent.

5.1.7. Hierarchy Lookup Transform

Resolve a field up its hierarchy until a non-null value is found. This transform expects the operational system have the notion of a parent_id field which can be used to walk up the tree.

Configuration options:

:target
The target connection (REQUIRED)
:table
The table to search (REQUIRED)
:parent_id_field
The parent ID field, defaults to :parent_id

For example:

transform :fearless_leader, :hierarchy_lookup, :target => :operational_database, :table => people’, :parent_id_field => :p_id

If :parent_id_field is not specified then :parent_id will be used.

5.1.8. Ordinalize Transform

Convert a number into its ordinalized form. For example:

transform :creation_order, :ordinalize

If creation_order is 12 then it would be changed to 12th.

5.1.9. SHA-1 Transform

Pass the value through a one-way SHA-1 Hash.

For example:

transform :ssn, :sha1

5.1.10. Trim Transform

Trim the value using Ruby’s trim method.

For example:

transform :name, :trim

5.1.11. Type Transform

Convert the value to a specific type.

For example:

transform :age, :type, :type => :number

The supported types are:

5.2. Row Processors

Row processors are like transforms except they work on an entire row rather than just a single field. Row processors accept the current row as an argument and may return 0 to N number of rows after the transformation. This means that row transforms can be used to filter out rows which should not be kept, modify multiple dependent fields in the row at once, add or remove fields from the row or even take a single row and return many rows (which is used when exploding hierarchies).

Row processors can be applied either directly after the data is read from the source by specifying the processor using after_read or directly before the data is written to the destination by using before_write.

5.2.1. Check Exist Processor

Row processor that checks to see whether or not the row already exists in the destination data store.

Options:

:table
The name of the table to check for existence of a row (REQUIRED).
:target
The target connection (REQUIRED).
:skip
A symbol or array of symbols that should not be included in the existence check. If this option is not specified then all of the columns will be included in the existence check (unless :columns is specified).
:columns
An array of symbols for columns that should be included in the query conditions. If this option is not specified then all of the columns in the row will be included in the conditions (unless :skip is specified).

If there are no rows in the target table then this processor will be bypassed. If there are rows then each row will be checked against the target database table. Rows included in the skip list will not be checked.

Note that this transform actually goes to the target database each time through so it will be a significant bottleneck in processing.

As an example, given a product dimension with the attributes Brand and Category:

before_write :check_exist_processor, :target => :data_warehouse, :table => "product_dimension"

This would check each row with a query:

SELECT * FROM product_dimension WHERE brand = “the_brand” AND category = “the_category”

Where the_brand and the_category are the brand and category, respectively, for each row.

If you decided that only the brand name is a natural key, then you could do:

before_write :check_exist_processor, :table => "product_dimension", :columns => [:brand]

Would result in the query:

SELECT * FROM product_dimension WHERE brand = “the_brand”

5.2.2. Check Unique Processor

Row processor that checks to see whether or not the row is unique. This processor stores a cache of compound keys and checks to see if the compound key exists in the cache. If it does then it will not return the row, thus removing duplicates.

Options:

:keys
An array of keys to use as the compound key.

For example:

before_write :check_unique, :keys => [:first_name, :last_name, :email]

If the row {:first_name => "John", :last_name => "Smith", :email => "jsmith@foo.com"} appeared twice during the ETL process only the first instance would be written to the destination.

5.2.3. Copy Field Processor

The copy field processor will copy the value of a field to another field.

For example:

after_read :copy_field, :source => "name", :dest => "full_name"

There is also a shortcut for this:

copy :name, :full_name

Note that the copy shortcut executes as an after_read processor. If you need to copy a field after applying other transforms then you will need to use the :copy_field processor explicitly.

5.2.4. Rename Field Processor

The rename field processor renames the specified field.

For example:

after_read :rename_field, :source => "name", :dest => "customer_name"

There is also a shortcut for this:

rename :name, :customer_name

Note that the rename shortcut executes as an after_read processor. If you need to rename a field after applying other transforms then you will need to use the :rename_field processor explicitly.

5.2.5. Hierarchy Exploder

This processor will convert a Rails-style acts_as_tree hierarchy into a dimensional bridge which is common to data warehouse systems.

Options:

:target
The target connection
:table
The database table to use
:id_field
The name of the id field, defaults to :id
:parent_id_field
The name of the parent id field, defaults to :parent_id

5.2.6. Print Row Processor

A debugging processor that will print the row.

5.2.7. Sequence Processor

Generate the next number in a context-sensitive sequence. The context can be specified as a configuration option, thus allowing independently incrementing sequences.

5.2.8. Surrogate Key Processor

The surrogate key processor is used to generate surrogate keys for dimension records.

:target
The target connection
:table
The table to find the initial surrogate key.
:column
The column name for find the initial surrogate key.
:destination
The field that will hold the surrogate key (defaults to :id)

5.2.9. Require Non Blank Processor

This processor requires that the specified fields are not blank, otherwise it will throw away the current row.

before_write :require_non_blank, :fields => [:first_name, :last_name]

6. Load

The final part of the ETL process is the loading of the data into the destination system. ActiveWarehouse ETL supports two types of destinations: file and database. In almost every case you should use the file destination to write out the processed data and then use a bulk loader via a post processor to load the data.

All destinations have some common options:

:buffer_size
The amount of rows to buffer before flushing. The default is 100.
:condition
A block that must return true in order for the row to be written. The block argument is the row itself.
:append_rows
Set to true to append to the output file rather than overwrite it.

6.1. File Destination

A file destination will be used if the destination configuration includes a :file element.

destination :out, { :file => people_dimension.txt }, { :order => [:id, :first_name, :last_name, :email] }

6.2. Database Destination

A database destination will be used if the destination configuration includes a :database element. The configuration options are:

:database
The database to connect to.
:target
The target connection.
:table
The table to write to.
:truncate
Set to true to truncate prior to storing the data.
:unique
Set to true to only store unique records.
:append_rows
Array of rows to append

6.3. Virtual Fields

Virtual fields can be added to the final output by specifying the :virtual Hash in the destination configuration. For example:

destination :out, { :file => people_dimension.txt }, { :order => [:id, :first_name, :last_name, :email, :loaded_at], :virtual => { :id => :surrogate_key, :loaded_at_ => Time.now }, }

Another example, this time specifying the query for the initial surrogate key:

:virtual => { :id => ETL::Generator::SurrogateKeyGenerator.new( :query => SELECT MAX FROM product_dimension ) }

6.4. Slowly Changing Dimensions

Slowly changing dimensions require special treatment in a data warehouse. There are 3 types of slowly changing dimensions. Type 1 replaces old values when an update to a record occurs. Type 2 creates a new row and changes the expiration_date for the original record to the date the change occurred. Type 3 slowly changing dimensions has one column for the original value and one column for the current value.

Currently ActiveWarehouse ETL supports type 1 and type 2 slowly changing dimension implementation. Note that type 2 slowly changing dimensions will be relatively slow because they must delete records during execution to avoid duplicates.

7. Pre Processors

Pre-processors are executed before the ETL pipeline begins.

7.1. Truncate Pre-Processor

Truncate the target table before the ETL process starts.

8. Post Processes

Post-processes execute after all of the data has been processed through the ETL pipeline and has been written to the destination(s).

8.1. Bulk Import

The bulk import post processor uses adapter extensions to load data using native bulk load tools. Currently MySQL, PostgreSQL and MSSQL Server are supported.

The following is an example of executing a bulk import:

post_process :bulk_import, { :file => person_dimension.txt’, :columns => [:first_name, :last_name, :email], :field_separator => ,’, :target => :data_warehouse, :table => person_dimension }

The common idiom of merging the database information is also often used here:

dbconfig = ActiveRecord::Base.configurations[‘production‘].symbolize_keys output_columns = [:first_name, :last_name, :email] post_process :bulk_import, { :file => person_dimension.txt’, :columns => output_columns, :field_separator => ,’, :target => :data_warehouse, :table => person_dimension }

8.2. Encode

The encode processor uses the Ruby Iconv library to convert a file from one encoding to another, line by line.

Here is an example:

post_process :encode, { :source_file => date_dimension.txt’, :source_encoding => utf-8’, :target_file => date_dimension_latin1.txt’, :target_encoding => latin1 }

9. Screens

Screens provide a means for checking data validity after data has passed through the ETL pipeline and before it is loaded into the production system. Screens can be configured to either report a warning message, raise an error that causes just the current job to terminate, or exit the system completely if the error is fatal.

Testing inside screen blocks is accomplished using the assertion methods available in Test::Unit::Assertions from the Ruby core library.

An example of a screen:

screen(:warning) { op_conn = ETL::Engine.connection(:operational_database) dw_conn = ETL::Engine.connection(:data_warehouse) assert_equal op_conn.select_value("SELECT count() FROM people").to_i + 1, dw_conn.select_value("SELECT count() FROM person_dimension").to_i }

In the example above, two connections are retrieved, one to the operational database and one to the data warehouse. Then a SQL query is executed against each connection and the results are tested for equality. Note that the result from the operational database is incremented by 1, which would be common if you include an “Unknown” person in the person_dimension to handle cases of people missing from the dimension but referenced in a fact record.

10. Execution Engine

The ETL process is controlled by the execution engine. The engine coordinates the execution of the control file and captures metadata about the execution. Some of the metadata in only known during the execution while other metadata is persisted beyond the current ETL execution.

10.1. Statistics

The ETL engine collects and reports some interesting statistics each time it runs. These statistics include:

These statistics are output each time the ETL process is executed and can provide valuable information on the location of bottlenecks in the ETL process.

Here is an example of output from an ETL process:

Using AdapterExtensions Starting ETL process Processing upload_fact.ctl Source: localhost/reports/products ..................................... Executing post processes Post-processing complete

Read 37705 lines from sources Wrote 37705 lines to destinations Completed upload_fact.ctl in 1 minutes, 35 seconds with 0 errors. Processing average: 424.78251320913 rows/sec) Avg after_reads: 69589.5558703258 rows/sec Avg before_writes: 65219.6208387736 rows/sec Avg transforms: 534.787733802788 rows/sec Avg writes: 10479.4134304288 rows/sec Avg time writing execution records: 0 ETL process complete

10.2. Persistent Job History

Batch and job data is persisted automatically to the ETL execution database.

11. Command-Line Tool

When you install the ActiveWarehouse ETL Gem the etl script will automatically be installed in your bin path. This script is used to execute ETL control files and has several options for customizing how the control file is executed.

11.1. Command-Line Options

—help or -h
Display a help message.
—config or -c
Specify the database.yml config file to use for the ETL execution engine’s persistent data. This will default to database.yml if it is not specified.
—limit N or -l N
Specify the limit. This is used to process only N records before exiting and is useful for debugging issues with a particular data source. Note that for database sources this will actually modify the SQL query so the limit is part of the query.
—offset N or -o N
Specify the offset. This is used to skip N records before beginning the processing and is useful when debugging issues with a particular data source. Note that for database sources this will actually modify the SQL query so it starts at the specified offset.
—newlog or -n
Force a new etl.log file to be created. If this option is not specified then log entries will be appended to etl.log.
—skip-bulk-import or -s
Do not execute any of bulk import processes. This is useful if you are debugging and do not want to actually load the data into the target database (assuming you are loading using a bulk loader).
—read-locally
Read from the last available locally stored source file rather than from the actual source.

11.2. Execution Example

An example of executing an ETL control file:

etl person_dimension.ctl

12. Patches

In order to get proper logging in ActiveSupport you may need to apply the active_support_logger.patch included in the release. This patch has been applied to the ActiveSupport library and will be included in the next release, however this patch is important for backwards compatibility.

13. License

Copyright (c) 2006-2007 Anthony Eden

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software 
without restriction, including without limitation the rights to use, copy, modify, merge, 
publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons 
to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or 
substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR 
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE 
FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
DEALINGS IN THE SOFTWARE.