Data source
Every pipeline starts somewhere. Our pipeline is going to start with the data source available in a form of HTTP APIs. There are several of them and they all represent different pieces of the puzzle. We would have to download, shape and combine data from all of them to be able to get full picture at our disposal. So let's analyse what this data source looks like.
Population
This one is probably the most self explanatory. It provides population statistics broken down by NSW postal codes. Pay attention to "postal codes", we are going to get back to them later. For now, let's see what this API has to offer:
curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/population.json
[
{
"POA_NAME16": 2006,
"Combined": "THE UNIVERSITY OF SYDNEY",
"Tot_p_p": 1259
},
{
"POA_NAME16": 2007,
"Combined": "BROADWAY,ULTIMO",
"Tot_p_p": 8845
},
{
"POA_NAME16": 2008,
"Combined": "CHIPPENDALE,DARLINGTON,GOLDEN GROVE",
"Tot_p_p": 11712
},
...
POA_NAME16
- looks like a postal code valueCombined
- comma separated names of suburbs combined under the same postal codeTot_p_p
- total population
Cases
Next is the core statistical data across the state. It represents a combination of parameters related to COVID cases:
curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_cases2.json
{
"data": [
{
"Recovered": 5,
"POA_NAME16": "2536",
"Deaths": 0,
"Cases": 5,
"Date": "12-Jul"
},
...
Recovered
- total recovered casesPOA_NAME16
- postal codeDeaths
- total fatal casesCases
- total casesDate
- measurement date
Tests
This one is given as a separate API, but could have been combined with the previous into one call. It represents different parameters around number of test across postal codes.
curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_tests.json
{
"data": [
{
"Recent": 796,
"POA_NAME16": "2260",
"Number": 4178,
"Date": "12-Jul"
},
...
Recent
- total amount of tests for a short time interval in the pastPOA_NAME16
- postal codeNumber
- total amount of testsDate
- measurement date
Post codes
Finally, the API that combines them all. This one is a bit different though. It's postal codes but in a form of GeoJSON. It's a format for encoding geographic data structures. In our case, that would be NSW suburbs. The reason this API is formatted this way is so we can visualize it using any web map viewer capable of reading the format.
curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/nswpostcodes_final.json
{
"type":"FeatureCollection",
"features": [
{
"type":"Feature",
"geometry":{
"type":"Polygon",
"coordinates":[[[130.85017131100005,-12.453012270999977],...]]
"properties": {"POA_CODE16":"0800","POA_NAME16":"0800","AREASQKM16":3.1734 }
}
}
...
geometry
- describes the shape of every postal code area on the mapcoordinates
- the shape itselfproperties
- a bag of key-value pair that could represent any information not related specifically to how the shape is rendered (metadata)POA_NAME16
- postal code
As for the data source, these were all the parts. A few important things to note here. Postal code is the parameter that unites all the APIs, combining them into one data source. Date
is another parameter where most most of the measurements intersect.
Population
andPost codes
APIs have noDate
measurement. It is based on the assumption that their results do not change within considered time frames.
Another interesting observation comes out of inspecting API results as a whole and for a few days in a row. The data source represents a slice of the data in time. All the APIs return a sliding window starting about half of the month ago till the present day.
time
----------------------------------->
| | 01.XX.2020
start end
-------------|--------------
Set of points
----------------------------------------->
| | 02.XX.2020
start end
--------------|---------------
Set of points
----------------------------------------------->
| | 03.XX.2020
start end
--------------|---------------
Set of points