Skip to main content

Data source

Every pipeline starts somewhere. Our pipeline is going to start with the data source available in a form of HTTP APIs. There are several of them and they all represent different pieces of the puzzle. We would have to download, shape and combine data from all of them to be able to get full picture at our disposal. So let's analyse what this data source looks like.

Population

This one is probably the most self explanatory. It provides population statistics broken down by NSW postal codes. Pay attention to "postal codes", we are going to get back to them later. For now, let's see what this API has to offer:

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/population.json
[
{
"POA_NAME16": 2006,
"Combined": "THE UNIVERSITY OF SYDNEY",
"Tot_p_p": 1259
},
{
"POA_NAME16": 2007,
"Combined": "BROADWAY,ULTIMO",
"Tot_p_p": 8845
},
{
"POA_NAME16": 2008,
"Combined": "CHIPPENDALE,DARLINGTON,GOLDEN GROVE",
"Tot_p_p": 11712
},
...
  • POA_NAME16 - looks like a postal code value
  • Combined - comma separated names of suburbs combined under the same postal code
  • Tot_p_p - total population

Cases

Next is the core statistical data across the state. It represents a combination of parameters related to COVID cases:

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_cases2.json
{
"data": [
{
"Recovered": 5,
"POA_NAME16": "2536",
"Deaths": 0,
"Cases": 5,
"Date": "12-Jul"
},
...
  • Recovered - total recovered cases
  • POA_NAME16 - postal code
  • Deaths - total fatal cases
  • Cases - total cases
  • Date - measurement date

Tests

This one is given as a separate API, but could have been combined with the previous into one call. It represents different parameters around number of test across postal codes.

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/data_tests.json
{
"data": [
{
"Recent": 796,
"POA_NAME16": "2260",
"Number": 4178,
"Date": "12-Jul"
},
...
  • Recent - total amount of tests for a short time interval in the past
  • POA_NAME16 - postal code
  • Number - total amount of tests
  • Date - measurement date

Post codes

Finally, the API that combines them all. This one is a bit different though. It's postal codes but in a form of GeoJSON. It's a format for encoding geographic data structures. In our case, that would be NSW suburbs. The reason this API is formatted this way is so we can visualize it using any web map viewer capable of reading the format.

curl https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/nswpostcodes_final.json
{
"type":"FeatureCollection",
"features": [
{
"type":"Feature",
"geometry":{
"type":"Polygon",
"coordinates":[[[130.85017131100005,-12.453012270999977],...]]
"properties": {"POA_CODE16":"0800","POA_NAME16":"0800","AREASQKM16":3.1734 }
}
}
...
  • geometry - describes the shape of every postal code area on the map
  • coordinates - the shape itself
  • properties - a bag of key-value pair that could represent any information not related specifically to how the shape is rendered (metadata)
  • POA_NAME16 - postal code

As for the data source, these were all the parts. A few important things to note here. Postal code is the parameter that unites all the APIs, combining them into one data source. Date is another parameter where most most of the measurements intersect.

Population and Post codes APIs have no Date measurement. It is based on the assumption that their results do not change within considered time frames.

Another interesting observation comes out of inspecting API results as a whole and for a few days in a row. The data source represents a slice of the data in time. All the APIs return a sliding window starting about half of the month ago till the present day.


time
----------------------------------->
| | 01.XX.2020
start end
-------------|--------------
Set of points

----------------------------------------->
| | 02.XX.2020
start end
--------------|---------------
Set of points

----------------------------------------------->
| | 03.XX.2020
start end
--------------|---------------
Set of points

References

Geo JSON