visit
SELECT
date.fromtimestamp(purchase_ts) AS purchase_date,
sum_agg(price * quantity) AS total
FROM csv('my_purchases.csv')
WHERE department.upper() == 'IT' and purchase_ts is not Null
GROUP BY 1
ORDER BY 1
TO json
$ pip3 install -U spyql
and check its version:
$ spyql --version
spyql, version 0.8.1
$ pip3 install -U matplotcli
$ wget //raw.githubusercontent.com/dcmoura/blogposts/master/spyql_cell_towers/sample.csv
This CSV file contains data about cell towers that were added to the database on 2022 September 10 (OCID-diff-cell-export-2022-09-10-T000000.csv
file from the project redistributed without modifications under the ).
$ head -3 sample.csv
radio,mcc,net,area,cell,unit,lon,lat,range,samples,changeable,created,updated,averageSignal
GSM,262,2,852,2521,0,10.948628,50.170324,15762,200,1,1294561074,1662692508,0
GSM,262,2,852,2501,0,10.940241,50.174076,10591,200,1,1294561074,1662692508,0
You could do this same operation with SPyQL:
$ spyql "SELECT * FROM csv LIMIT 2" < sample.csv
or
$ spyql "SELECT * FROM csv('sample.csv') LIMIT 2"
Notice that we are telling you to get 2 rows of data and not 3 rows of the file (where the first is the header).
One advantage of SPyQL is that we can change the output format easily. Let’s change the output to JSON and look at the first record:
$ spyql "SELECT * FROM csv('sample.csv') LIMIT 1 TO json(indent=2)"
{
"radio": "GSM",
"mcc": 262,
"net": 2,
"area": 852,
"cell": 2521,
"unit": 0,
"lon": 10.948628,
"lat": 50.170324,
"range": 15762,
"samples": 200,
"changeable": 1,
"created": 1294561074,
"updated": 1662692508,
"averageSignal": 0
}
$ spyql "SELECT count_agg(*) AS n FROM csv('sample.csv')"
n
45745
Notice that aggregation functions have the suffix _agg
to avoid conflicts with Python’s functions like min
, max
and sum
. Now, let’s count how many cell towers we have by radio type:
$ spyql "SELECT radio, count_agg(*) AS n FROM csv('sample.csv') GROUP BY 1
ORDER BY 2 DESC TO pretty"
radio n
------- -----
GSM 31549
LTE 12996
UMTS 1182
CDMA 16
NR 2
Notice the pretty printing output. We can plot the above results by setting the output format to JSON and piping results into :
$ spyql "SELECT radio, count_agg(*) AS n FROM csv('sample.csv') GROUP BY 1
ORDER BY 2 DESC TO json" | plt "bar(radio, n)"
How easy was that? :-)
$ spyql "SELECT mcc, count_agg(*) AS n FROM csv('sample.csv') GROUP BY 1 ORDER BY 2 DESC LIMIT 5 TO pretty"
mcc n
----- -----
262 24979
440 5085
208 4573
310 2084
311 799
MCC stands for Mobile Country Code (262 is the code for Germany). The first digit of the MCC identifies the region. Here’s an exert from :
0: Test networks
2: Europe
3: North America and the Caribbean
4: Asia and the Middle East
5: Australia and Oceania
6: Africa
7: South and Central America
9: Worldwide
Let’s copy and paste the above list of regions and create a new file named mcc_geo.txt
. On the mac, this is as easy as $ pbpaste > mcc_geo.txt
, but you can also paste this into a text editor and save it.
$ spyql "SELECT * FROM csv('mcc_geo.txt') TO pretty"
col1 col2
------ -------------------------------
0 Test networks
2 Europe
3 North America and the Caribbean
4 Asia and the Middle East
5 Australia and Oceania
6 Africa
7 South and Central America
9 Worldwide
SPyQL detected that the separator is a colon and that the file has no header. We will use the colN syntax to address the Nth column.
Now, let’s create a single JSON object with as many key-value pairs as input rows. Let the 1st column of the input be the key and the 2nd column be the value and save the result to a new file:
$ spyql "SELECT dict_agg(col1,col2) AS json FROM csv('mcc_geo.txt') TO json('mcc_geo.json', indent=2)"
We can use cat
to inspect the output file:
$ cat mcc_geo.json
{
"0": "Test networks",
"2": "Europe",
"3": "North America and the Caribbean",
"4": "Asia and the Middle East",
"5": "Australia and Oceania",
"6": "Africa",
"7": "South and Central America",
"9": "Worldwide "
}
We aggregated all input lines into a Python dictionary and then saved it as a JSON file. Try removing the AS json
alias from the SELECT
to understand why we need it :-)
Now, let’s get the statistics by region instead of by MCC. For this, we will load the JSON file that we just created (with the -J
option) and do a dictionary lookup:
$ spyql -Jgeo=mcc_geo.json "SELECT geo[mcc//100] AS region, count_agg(*) AS n
FROM csv('sample.csv') GROUP BY 1 ORDER BY 2 DESC TO pretty"
region n
------------------------------- -----
Europe 35601
Asia and the Middle East 5621
North America and the Caribbean 3247
Australia and Oceania 894
Africa 381
South and Central America 1
We do an integer division by 100 to get the 1st digit of the MCC and then we look up this digit on the JSON we just created (which is loaded as a Python dictionary). This is how we do a JOIN in SPyQL, via a dictionary lookup :-)
$ pip3 install -U h3
Then, we can convert latitude-longitude pairs into H3 cells, count how many towers we have by H3 cell, and save the results into a CSV:
$ spyql "IMPORT h3 SELECT h3.geo_to_h3(lat, lon, 5) AS cell, count_agg(*) AS n
FROM csv('sample.csv') WHERE mcc//100==2 GROUP BY 1 TO csv('towers_by_h3_res5.csv')"
Visualizing these results is fairly simple with Kepler. Just go to and open the above file. You should see something like this:
Also Published