Create Connections and Variables in Apache Airflow
Traducciones al EspañolEstamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
What is Apache Airflow?
Airflow is an open source platform that you can use to automate, orchestrate, and monitor workflows and data pipelines. One of Airflow’s greatest features is that you can create and execute workflows with code. When you use workflows that are powered by code, you can version control, collaborate on, and debug your workflows.
Airflow refers to workflows as Directed Acyclic Graphs (DAGs). A DAG includes the sequence of tasks to execute along with the relationship between tasks and their dependencies. You can execute an ETL process (extract, load, and transform data) with Airflow and also automate emails with CSV attachments, and create Machine Learning (ML) workflows.
You can connect your Airflow data sources to a central data warehouse so your data analysts have access to all relevant data, which prevents data silos from developing across an organization. Similarly, transparent and reproducible code-driven workflows reduce bottlenecks, because anyone with access to the workflow’s code can debug it.
Airflow provides a Python application programming interface (API) that you can use to code your DAGs and call any connection scripts you create.
In this Guide
This tutorial provides an introduction with basic examples to two fundamental Airflow concepts, Variables and Connections. You can incorporate the ideas covered in this guide into more sophisticated Python scripts when creating your DAGs and data pipelines.
In this Apache Airflow tutorial, you learn how to:
- store Airflow values in variables using the Airflow command-line interface (CLI)
- automate connecting to your data sources using a simple script and the Airflow CLI
Airflow Variables and Connections
Airflow needs to access data from external sources, like databases, APIs, and servers. You use Airflow Connections to create connections to your data sources. Your connections form the building blocks of your Airflow DAGs, because they define your data’s sources, staging area, and destination.
You use Airflow variables to store reusable values, like URIs, database usernames, configurations, and any other values required by your DAGs. The variables are stored in Airflow’s metadata database.
The Airflow CLI
You can use the Airflow CLI to manage your DAGs and create, edit, and delete Airflow objects like connections and variables. You can incorporate CLI commands into scripts to automate your frequently used Airflow CLI commands. In this guide, you learn how to leverage the Airflow CLI to automate creating your Airflow Variables and Connections.
Automate Creating Airflow Variables and Connections
Create Your DAG Variables
Using a JSON file to load Airflow variables is a more reproducible and faster method than using the Airflow graphical user interface (GUI) to create variables. This section uses a simple example to demonstrate how to create and store Airflow variables using the Airflow CLI.
Using a text editor, create a new JSON file to store key-value pairs of any values you need to reuse in your DAGs. The example file includes connection information for a MySQL database.
- File: ~/example_vars.json
1 2 3 4 5 6 7
{ "my_prod_db": "dbname", "my_prod_db_user": "username", "my_prod_db_pass": "securepassword", "my_prod_db_uri": "mysql://192.0.2.0:3306/" }
Issue the following command to load all your variables. Replace the path with the location of your
example_vars.json
file.airflow variables --import /home/username/example_vars.json
To retrieve a variable value from Airflow, use the following command:
airflow variables -g my_prod_db
Airflow returns the value of the
my_prod_db
variable.dbname
Note
Airflow saves the passwords for connections and any variable values in plain text within the metadata database. See the A Recommended Workflow for Sensitive Variables section for ways to keep your variables secure.
Create Your Connection Script
The Airflow CLI can be used to create your Connections to any external system required by you DAGs. This section shows you how to create a simple connection with a reusable bash script that you can adopt for your own Airflow Connections. The example below includes a connection for a MySQL database.
Create a new file named
connection.sh
. Replace the values with your own values or expand on the script to create the Connections required by your DAGs.- File: connection.sh
1 2 3 4 5 6
#!/usr/bin/env bash airflow connections -d --conn_id db_conn airflow connections -a --conn_id db_conn --conn_type mysql --conn_host 'mysql://192.0.2.0:3306/' --conn_schema 'dbname' --conn_login 'username' --conn_port '3306' --conn_password 'securepassword'
The third line of the script deletes any connections that the script may have created previously to maintain idempotency. This means your script can be run as many times as desired with the same expected result.
Ensure that you can execute your Connections script:
chmod u+x /home/username/connection.sh
Load your connections by executing your completed script:
bash /home/username/connection.sh
Use the Airflow CLI to verify that your new Connection was created. Replace
db_conn
with the name of your Connection.airflow connections --list | grep 'db_conn'
A Recommended Workflow for Sensitive Variables
If you use a JSON file to store sensitive connection variables or if you use a script to automate your Airflow Connections, you should develop a workflow for encrypting and decrypting sensitive values. Airflow saves the passwords for connections in plain text within the metadata database. A workflow for your sensitive connection data ensures that these values are never exposed in a raw string format. The section below includes a sketch for a workflow you can consider to keep your sensitive variables secure.
Encrypt: You can use tools like Ansible Vault to encrypt sensitive values before storing them in a remote repository, like GitHub. Another popular tool for storing sensitive values is HashiCorp Vault. The Python Crypto package is another tool that you can use to enable encryption for passwords.
Decrypt: In order to run any automation scripts containing your encrypted variable values, you must include a decryption step before executing them. Airflow needs the decrypted values in order to run your DAGs. Both Ansible Vault and HaschiCorp Vault include mechanisms for providing decrypted variable values to Airflow.
Load: Once your values are decrypted, execute your scripts.
Encrypt: After your automated infrastructure loads both Airflow Variables and Connections, encrypt your sensitive values. This way sensitive data is only exposed through an encrypted Airflow Database that only Airflow can access.
Conclusion
Automating creating your Airflow Variables and Connections is a fundamental step towards transparent data for quick experimentation, prototyping, and analysis. Having all of your relevant data connections in a central and secure repository sets up your organization for collaboration. This allows you to spend less time on the extract and load steps of workflows and more time on the transformation step.
After completing this tutorial, learn how to build a data pipeline using Python using Airflow’s example Pipeline tutorial.
More Information
You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.
This page was originally published on