visit
What is Luigi?
According to the Luigi documentation it ‘helps you build complex pipelines of batch jobs’.
Much like setting up a domino set such that n number of branches appear at a certain point and once the falling tiles reach that point the branches created fall parallel to each other, luigi allows you to parallelize things that need to be done in a given task. In a given pipeline you may have tasks- which are the domino tiles, you also have task parameters, which is the input the task takes. So for example when you create a function in python, the function may require an argument or n number of arguments. I would liken Luigi parameters to these arguments. To explain how these parameters are assigned its important to explain that there are typically three functions in a task/class:Pipeline
As explained, the pipeline is multiple tasks that have been stitched together using Luigi. The process starts off by taking input in the form of the csv file extracted from my data source, looks for the unique signup states and creates a separate csv for each unique state identified- these are four in total.Each csv has data specifically related to activities that occurred under that state. For the purposes of this assignment I named this task ‘’ as it performs its namesake.Since I want to find out whether a given user moved from one state to another, the next task, ‘’, is dependent on the output of the ‘separate_csv’ task.Because of this, the parameter value for
separate_csv
is assigned in state_to_state_transitions.
If you recall, the ‘requires’ function runs first, therefore when state_to_state_transitions
runs it first runs that requires function that assigns original data csv to separate_csv. This logic is built into all my tasks. The purpose of
state_to_state_transitions
is to create a sequence of marketing sources a given user clicked before completing the signup process, given the change in the objective of the model, this is more of a nice to have table showing the sequence of marketing channels engaged before signup.The next task, ‘’ then the 4 unique state files created and for each unique user checks whether that user moved from the first state to the second state all through the final state returning boolean values dependent on whether or not the transition occurred. The output produced are three files representing the transitions from one state to another.After this part of the pipeline I needed to get the probability distribution for each transition. For this I used the , which is used to get an estimate of the probability distribution of a given dataset using a density estimator. This, ‘’ task produces three pickle files with the probability distribution for each transition.Since I was dealing with relatively large amounts for rows/user activity I then had to get samples from each pickle file and save each sample as a separate csv. From the samples produced in the ‘’ task, I created a visualization to compare whether the distribution of transitions observed in these sample files matched the distribution in the population data. Since this was a task requiring more visual output, I did not include this in my pipeline.import pickle
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
files = ['Sessiontolead+sampleprobabs','leadtoopportunity+sampleprobabs','opportunitytocomplete+sampleprobabs']
pickles = ['Sessiontoleadprobabs','leadtoopportunityprobabs','opportunitytocompleteprobabs']
def func(sims, tag):
file_path = 'C:\\Users\\User\\Documents\\GitHub\\Springboard-DSC\\AttributionModel\\Data\\ModelData\\original\\'
sims = pd.read_csv(file_path+sims+'.csv')
path = 'C:\\Users\\User\\Documents\\GitHub\\Springboard-DSC\\AttributionModel\\Data\\ModelData\\pickles\\'
actuals = pd.read_pickle(path + tag + '.pck')
y = np.linspace(start=stats.norm.ppf(0.1), stop=stats.norm.ppf(0.99), num=100)
fig, ax = plt.subplots()
ax.plot(y, actuals.pdf(y), linestyle='dashed', c='red', lw=2, alpha=0.8)
ax.set_title(tag + ' Stats v Actual comparison', fontsize=15)
# sims plot
ax_two = ax.twinx()
# simulations
ax_two = plt.hist(sims.iloc[:, 1])
return fig.savefig(path + str(tag) + '.png')
for x,y in zip(files,pickles):
func(sims=x,tag=y)
Output:
ate to state based on different filters, for example if the user was using a particular device, started signing up on a particular day, time of the day or based on a marketing campaign they may have clicked (with the right data). Each simulation creates a new file and once the value of a preceding transition is
0/False
, there will also be a 0/False
for the following states. Finally, the pipeline ends with the ‘’ that ties every task together by assigning parameter values to the state_to_state_machine task, joining together all files created through the simulation into a single file and running the entire pipeline.For a more detailed breakdown of the model, visit my .