visit
Batch layer is computed by applying a function to the whole historical dataset, to answer some high level questions which cannot be answered by either speed layer or serving layer.The computations typically take hours or days to run, and the results are stored usually in a distributed file system (although this is not a requirement). For example, the queries that might need to be answered would range from the beginning of the dataset to now, or in our case, till date how many cabs have served how many passengers, or what is the total distance driven by all the cabs. In this article I will try to answer questions like these based on the dataset that I have. The code for the article can be found .
batch view= function (all data)
Batch layer, like serving layer, satisfies requirements of big data systems, like:For example, to compute total number of records in the master dataset, you can either use a recomputation alogrithm to determine updated count, or use an incremental algorithm to determine new rows, and add it to the old count.
The thing to keep in mind is that incremental algorithm uses less resources, and is faster, whereas recomputation algorithm is more resource extensive, and slower. In terms of fault tolerance, recomputation algorithm might be a better option. If you make a mistake while running a recomputation algorithm, all you have to do is fix the mistake and run the algorithm again. It consumes more resources, but the fix is simple. But with incremental algorithm, if you make that mistake, you’d have to find the records that have been affected by that mistake, and go back and fix those errors. It might consume less resources, but is extremely time consuming. Incremental algorithms are usually tailor made for specific use cases, and mostly, recomputation algorithms are preferred. With these things in mind, I will try to answer some questions that will require computation on the batch dataset that i have, in the form of visualizations. I will not be storing it in a database, just flat files for the purpose of visualization.medallion pickup_datetime dropoff_datetime trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
740BD5BE61840BE4FE3905CC3EBE3E7E, 2013–10–01 12:44:29, 2013–10–01 12:53:26, 536, 1.2, -73.974319, 40.741859, -73.99115, 40.742424
Based on this data, the goal of the task at hand is to determine:Tech stack that I have used is Spark for aggregation, Python for data manipulation and plotting. Since we are working with batch data there was no need for a streaming engine, and since we are plotting data from flat files there was no need for a NoSQL database, although we could have used it.
a) Comparison of months based on distance driven
b) Comparison of months based on trip time
c) Comparison of months based on trips taken
d) Comparison of Months based on Distance: HeatMap
e) Comparison of Months based on time spent driving: HeatMap
e) For any month, top 5 rows based on distance driven for each driver
Medallion, Distance Driven, Rank
06EAD4C8D98202F1E2D7057F2899CFE5, 9.90, 13, 1
06EAD4C8D98202F1E2D7057F2899CFE5, 9.80, 11, 2
06EAD4C8D98202F1E2D7057F2899CFE5, 9.70, 16, 3
06EAD4C8D98202F1E2D7057F2899CFE5, 9.60, 11, 4
06EAD4C8D98202F1E2D7057F2899CFE5, 9.50, 17, 5
0F621E366CFE63044BFED29EA126CDB9, 9.99, 1, 1
0F621E366CFE63044BFED29EA126CDB9, 9.95, 1, 2
0F621E366CFE63044BFED29EA126CDB9, 9.94, 1, 3
0F621E366CFE63044BFED29EA126CDB9, 9.91, 2, 4
0F621E366CFE63044BFED29EA126CDB9, 9.90, 1, 5
f) Total distance driven is 210 million miles by all drivers
g) Total time spent driving is 39000 hours
h) Total trips made by all drivers is 98 million
There are two main type of algorithms that batch layer implements, recomputation and incremental algorithms. The output of batch layer can either be flat files, or it can be saved in NoSQL database.Then we went through a real world study of how batch layer functions. We took a taxi dataset, and determined the total distance driven, time spent driving and total trips made. We also made a comparison of driving statistics between each month. Read about Speed layer here//gzht888.com/lambda-architecture-speed-layer-real-time-visualization-for-taxi-part-1-a31931trAnd about Serving layer here//gzht888.com/lambda-architecture-speed-layer-real-time-visualization-for-taxi-part-2-8i1p31q7References: