Automating Data Pipelines using Apache Airflow

7 min readFeb 25, 2022

Objective

There is a ride-hailing company which generates large scale of data daily. The company wants to design a system which can load the data in structured format and analyze through automated data pipelines. Our task is to enable an efficient OLAP (Online analytical processing) system on the trips and bookings data for the company. We need to design and schedule a data pipeline to generate the following insights:

Get the car types involved with the highest number of trips for each city
Get the throughput of the trips (Number of trips/Number of bookings) for each city

The data provided is in the form of the following two tables:

Booking
Trip

The schema for these tables can be seen below:

Bookings:

Trips:

Solution Approach:

Bring data from MySQL to HDFS via Sqoop
Create necessary directories in HDFS
Create Hive tables on the imported data
Construct partitions in the Hive table
Filter invalid records using Spark
Run the analysis to generate aggregated result using Spark
View the result

The DAG for the same can be seen below:

Prerequisites:

mysql.dump (MySQL queries for table creation and record )
etl_dag.py(the code for the DAG explained in the video)
filter_trip.py (Spark application)
generate_trip_throughput.py (Spark application)
filter_booking.py (Spark application)
generate_car_with_most_trips.py (Spark application)

Codes:

All the codes can be downloaded from here.

Creating Tables and inserting records:

We’ll use the file mysql.dump; this is the file that will create our tables and insert the records into them. It is available in the same folder.

DAG Construction:

There are multiple steps involved in the DAG. We’ll use the file etl_dag.py for this activity. Let’s describe the steps in two parts.

Part 1:

Below is the list which describes the tasks related to table extraction and location created in the DAG and their definition:

extract_booking_table: Extracts the booking table from MySql to HDFS using Sqoop
extract_trip_table: Extracts the trip table from MySql to HDFS using Sqoop
create_raw_booking_location: Creates a location /data/raw/booking using Bash Operator which stores the raw data (original data as it is) for booking table
create_filtered_booking_location: Creates a location /data/refined/booking using Bash Operator which stores the refined data for booking table
create_raw_trip_location: Creates a location /data/raw/trip using Bash Operator which stores the raw data (original data as it is) for trip table
create_filtered_trip_location: Creates a location /data/refined/trip using Bash Operator which stores the refined data for trip table
create_result_trip_throughput_location: Creates a location /data/output/trip_throughput using Bash Operator which stores the result of throughput
create_result_car_with_most_trips_location: Creates a location /data/output/car_with_most_trips using Bash Operator which stores the results of cars with most trips

Part 2:

create_hive_database: Creates a database events using Hive Operator
create_booking_raw_table: Creates a table booking_raw using Hive Operator
create_trip_raw_table: Creates a table trip_raw using Hive Operator
add_partitions_for_booking_table: Creates partition for booking_table on the booking_dt column
add_partitions_for_trip_table: Creates partition for trip_table on the trip_dt column
create_filter_booking_table: Creates a table booking for refined data using PARQUET format and location hdfs:///data/refined/booking
create_filter_trip_table: Creates a table trip for refined data using PARQUET format and location hdfs:///data/refined/trip
filter_booking_table: Filters invalid records from booking_table using SparkSubmitOperator
filter_trip_table: Filters invalid records from trip_table using SparkSubmitOperator
create_car_with_most_trips_table: Creates table for KPI with PARQUET format and location hdfs:///data/output/car_with_most_trips
generate_car_with_most_trips: Generates result for cars with most number of trips
create_trip_throughput_table: Creates table for KPI with PARQUET format and location hdfs:///data/output/trip_throughput
generate_trip_throughput: Generates result for cars with throughputs

Spark Applications

Spark application is used to filter booking table, trip table, generate trip throughput and trip frequency. We’ll use below 4 files available here:

filter_booking.py
filter_trip.py
generate_trip_throughput.py
generate_car_with_most_trips.py

Setting Task Dependencies

Now, we will work on task dependencies among the tasks that are defined in our DAG. Below is the glimpse of the dependencies:

Last part of the file etl_dag.py contains all the task dependencies.

Running DAG

1. Login to your EC2 instance.

2. Activate the Python virtual environment using the following command:

source /home/ec2-user/airflow/bin/activate

3. We’ll place the mysql.dump file in /tmp folder in EC2 machine using WinSCP tool.

4. Run the command to execute the SQL commands in /tmp/mysql.dump and create our tables :

mysql -u root p<password> /tmp/mysql.dump

5. Next, we need to set up the different connections using the Airflow UI. Create/Edit(if they already exist) the following connections:

Sqoop:

Conn Id: sqoop_default
Conn Type: Jdbc Connection (Select from the drop-down )
Connection URL: jdbc:mysql://localhost/events
Login: root
Password: ***

Hive:

Conn Id: hiveserver2_default
Conn Type: Hive Server 2 Thrift (Select from the drop-down )
Host: localhost
Schema: default
Login: root
Password: ***
Port: 10000

Spark:

Conn Id: spark_default
Conn Type: Spark (Select from the drop-down )
Host: localhost

6. Create a directory called ride in the airflow_codes using the following command:

mkdir /home/ec2-user/airflow_codes/ride

7. Now place the following file inside the ride directory we just created:

filter_trip.py
generate_trip_throughput.py
filter_booking.py
generate_car_with_most_trips.py

8. Now we need to place the etl_dag.py file in the /home/ec2-user/airflow/dags directory. (We can use WinSCP or create a new file and paste the code in that file).

9. To ensure that the file there are no issues/errors with the file is it considered good practice to compile the program using the following command:

python etl_dag.py

10. We can also use the following command to list the dags in our instance:

airflow list_dags

11. Once we have made sure that your dag file has no issues, we can go to the Airflow UI which is hosted in the URL: your_public_ip:8080/admin/

Note: You can find your_publin_ip in your AWS EC2 dashboard (IPv4 Public IP)

12. In case we are re-running this DAG, we will have to delete the target_dir of the sqoop task to avoid error. Also, clear the task/DAG before re-running it.

13. Switch ON the DAG (etl_dag) by hitting the button as mentioned in the below screenshot:

Note: The DAG might take a while to show up on the UI. Keep refreshing and wait Patiently.

14. Click on the etl_dag and go to the graph view. We will find the task is running. If it is not reflecting, wait for some time.

15. Keep refreshing the page, we’ll find that eventually, all tasks will have successfully completed.

16. Once the DAG execution gets completed, the output will be generated in the tables trip_troughput and car_with_most_trips inside the events database.

17. At the last, we can switch off your DAG if we don’t want it to run anymore.

Result

We can view the results in the CLI by following the steps below:

Command for trip_throughtput:

select * from events.trip_throughput;

As per the above screen, we can see below observations:

On 2020–09–16, Mumbai has 75%, Chennai 80% and Bangalore 80% of trip throughput.
On 2020–09–17, Mumbai has 75%, Chennai 80% and Bangalore 80% of trip throughput.

Command for car_with_most_trips:

select * from events.car_with_most_trips;