visit
I originally wrote this post for the .
Photo by on In our , we mentioned how it has taken the data engineering ecosystem by storm. We also talked about how we’ve been using it to move data across our internal systems and explained the steps we took to create an internal workflow. The ETL workflow (e)xtracted PDFs from a website, (t)ransformed them into CSVs and (l)oaded the CSVs into a store. We also touched briefly on the breadth of ETL use cases you can solve for, using the Airflow platform. In this post, we will talk about how one of Airflow’s principles, of being ‘Dynamic’, offers as a powerful construct to automate workflow generation. We’ll also talk about how that helped us use Airflow to power DISHA, a national data platform where Indian and monitor the progress of 42 national level schemes. In the end, we will discuss briefly some of our reflections from the project on today’s public data technology.
An
As you can observe, the PythonOperator
can be instantiated by specifying the name of the function containing your Python code using the python_callable
keyword argument. Multiple instantiated operators can then be linked using Airflow API’s set_downstream
and set_upstream
methods.
In the DAG file above, the extract
function makes a GET request to , with a query parameter. Web services can vary in their request limit (if they support multiple requests at the same time), query parameters, response format and so on. Since writing custom Python code for each web service would be a nightmare for anyone maintaining the code, we decided to build a Python library (we call it Magneton, since it is a magnet for data), which takes in the JSON configuration describing a particular web service as input and fetches the data using a set of pre-defined queries. But that solved only half of our problem.
In our and in the example DAG file above, we could link operators together by writing static calls to the set_downstream
and set_upstream
methods since the workflows were pretty basic. But imagine a DAG file’s readability with 1,000 operators defined in it. You would have to be a savant to infer the relationships between operators. Moreover, everyone in your team (including people who don’t work with Python as their primary language) wouldn’t have the know-how to write a DAG file, and writing them manually would be repetitive and inefficient.
This makes it easy for us to now write a single DAG file that can take in a bunch of these YAML configurations and build DAGs dynamically, by linking operators which have the same identifiers (in this example, we have used a number, 1, for the sake of simplicity). Moreover, anyone in your team who wants to create a workflow can just write a YAML, which makes it easy for a human to define a configuration that is machine-readable. Once you’ve figured out a way to create DAGs based on configurations, you can build a interface to let users build a DAG without writing configurations, making it easy for anyone looking to create a workflow!
For DISHA, we needed to (E)xtract scheme data from source systems via web services and then follow that with the T and L. At an atomic level, our workflows could be broken down into:The Airflow web interface lets the project stakeholders manage complex workflows (like the one shown above) with ease, since they can check the workflow’s state and pinpoint the exact step where something failed, look at the logs for the failed task, resolve the issue and then resume the workflow by retrying the failed task. Making tasks is a good practice to deal with retries. (Note: retries can be automated within Airflow too.)
The District Development Coordination and Monitoring Committee (DISHA) was . The goal was to coordinate between the Central, State and Local Panchayat Governments for successful and timely implementation of key schemes (such as the National Rural Livelihoods Mission, Pradhan Mantri Awaas Yojana and Swachh Bharat Mission). To monitor the schemes and make data-driven implementation decisions, stakeholders needed to get meaningful insights about the schemes. This required integrating the different systems containing the scheme data. Last year, we partnered with the Ministry of Rural Development (MoRD) and National Informatics Centre (NIC) to create the DISHA dashboard, which was . The DISHA Dashboard helps Members of Parliament (MPs), Members of Legislative Assembly (MLAs) and District Officials track the performance of flagship schemes of different central ministries in their respective districts and constituencies.‘DISHA is a crucial step towards good governance through which we will be able to monitor everything centrally. It will enable us to effectively monitor every village of the country.’ — Narendra Modi, Prime Minister of India
Back in October 2017, the dashboard had data for 6 schemes, and it was updated in August 2018 to show data for a total of 22 schemes. In its final phase, the dashboard will unify data from 42 flagship schemes to help stakeholders find the answer to life, universe and everything. For the first time, data from 20 ministries will break silos to come together in one place, bringing accountability to a government budget of over Rs. 2 lakh crores! The DISHA meetings are held , where the committee members meet to ensure that all schemes are being implemented in accordance with the guidelines, look into irregularities with respect to implementation and closely review the flow of allocated funds. Workflows, like the one showed above, have automated the flow of data from scheme databases to the DISHA Dashboard, updating the dashboard regularly with the most recent data for a scheme. This is useful for the committee members since they can plan for the meeting agenda by checking each scheme’s performance and identifying priorities and gap areas.
Watch the Prime Minister speak about how he uses the DISHA dashboard to monitor the progress of Pradhan Mantri Awas Yojna .
As we move towards a Digital India, we need a fundamental shift from the Excel-for-everything mindset and how today’s public technology is set up. We need a standardized data infrastructure across public services that will help ministries and departments share data with each other quickly, and with the public. A with Silicon Valley–grade technical chops need to be trained and hired. There’s already a Chief Economic Advisor to the government and a Chief Financial Officer for the RBI. It’s high time a Chief Data Officer is appointed for India!‘A small team of purpose-driven public technologists, leveraging advances in low-cost device, data and decision-making and the right kind of support is all it takes to build and maintain public, digital infrastructures.’ — Varun Adibhatla in