ETL definition
ETL is a three-step data integration process that extracts, transforms, and loads raw data from a source or multiple sources to a data warehouse, data mart, data lake, or database. Through the ETL process, data is properly formatted, normalized and loaded into these types of data storage systems to create a single, unified data view.
An acronym for extract, transform, load, ETL is used as shorthand to describe the three stages of preparing data. It became a common method of data integration in the 1970s as a way for businesses to use data for business intelligence. Organizations today use ETL for the same reasons: to clean and organize data for business insights and analytics.
ETL is also used to describe the commercial software category that automates the three processes.
How ETL works
Describing each step of the extract, transform and load process is the best way to understand how ETL works.
Extract
Extraction is the first step in the ETL process. Raw structured or unstructured data is extracted either by being exported or copied from one or many data sources. This data is temporarily stored in a staging area. Data sources can include but are not limited to:
- APIs
- SQL
- ERP and CRM systems
- CSV files
- Web pages
- XML
- JSON
Transform
Raw data is then transformed within the staging area. Processing data often involves some of the following functions:
- Filtering
- Cleansing
- Formatting
- Implementing schema
- Optimizing for quality
Load
Once data transformation is completed, data is loaded from the temporary staging area into the target data repository. Often, data is loaded in batches. This is done to automate the process, reduce repetitive tasks and manage large amounts of data more efficiently.
Learn more about Extract, Transform, Load (ELT) and the difference between ELT and ETL.