Skip to main content

Data wrangling

Once we have the data source, next step is to convert it into a form that fits our use case (Data wrangling). This step is probably the most important one. By design our pipeline should feed a static web site. In this case, a natural data representation would imply JSON format. As we saw earlier, data source APIs are already providing data in this format. So what is left, is to represent data in a form that fits best both statistical analysis and visualizations. In this case fits best means mostly performance of computations.

Data accumulation

As we discovered before, we never get a dataset as a whole sequence of observations. (only a sliding window). So we have to combine and store all the sliding window results into one place every time we get a new set of points/measurements from the source. We are going to store the results into the file system next to the web application itself, so they could be served by the browser as simle as invoking a fetch request to a particular measurement JSON file (case, tests, etc).

Download from source

Here is how we download a file from the source:

const https = require('https')

const download = (url) => new Promise((resolve, reject) => {
let data = '';

https.get(url, (resp) => {
resp.on('data', (chunk) => {
data += chunk;
});

resp.on('end', () => {
resolve(data);
});

})
.on("error", (err) => {
console.log(reject(err));
process.exit(1)
});
})

const population = await download('https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/population.json')
const postCodes = await download('https://nswdac-covid-19-postcode-heatmap.azurewebsites.net/datafiles/nswpostcodes_final.json')

We could also use the same method to download previously stored information:

const cases = await download('https://raw.githubusercontent.com/maxgherman/nsw-corona-virus/gh-pages/cases-total.json')
const tests = await download('https://raw.githubusercontent.com/maxgherman/nsw-corona-virus/gh-pages/tests-total.json')

Writing a file into a file system is also strait forward:

const fs = require('fs')
const path = require('path')
const util = require('util')

const writeFile = util.promisify(fs.writeFile)

await writeFile(path.resolve(baseDir, 'build/population.json'), population)
await writeFile(path.resolve(baseDir, 'build/post-codes.json'), postCodes)

Merge

Population and post codes results could be stored as is, but cases and tests have to be downloaded and merged with the previously downloaded results before they could be saved. Since data source fragmentation happens on a basis of time and post codes, we have to merge parts together using time and location principals: Date, POA_NAME16 measurements. To speed things up, we are using a Hash table, where key represents a measurement day and value is another Hash table with the post code key and the measurement itself as a value.

Map { '01-10' => Map { '2053' => { Cases: 11, ... }, ... }, ... }

This way we traverse both data sets (previously merged and newly downloaded) in O(n) time. We also sort final result by the date. It saves us a headache in future during data visualization.

const arrangeByDate = (data, store) => data.data.reduce((acc, curr) => {
const currentYear = new Date().getFullYear();
const date = curr.Date;

const parsedDate = /^[0-9]{2}-[a-z,A-Z]{3}$/.test(date) ? `${date}-${currentYear}` : curr.Date;
curr.Date = parsedDate;

const entry = acc.has(parsedDate) ? acc.get(parsedDate) : new Map();
entry.set(curr.POA_NAME16, curr);
acc.set(parsedDate, entry);

return acc;
}, store);

const mergeCases = async (casesUrl, baseCasesUrl, activeCasesUrl) => {
const cases = await download(casesUrl).then(JSON.parse);
const baseCases = await download(baseCasesUrl).then(JSON.parse);
const activeCases = await download(activeCasesUrl).then(JSON.parse);

const caseStore = arrangeByDate(cases, new Map());
const mergedCases = arrangeByDate(baseCases, caseStore);

activeCases.data.forEach((activeCase) => {
Object.keys(activeCase).forEach((activeCaseKey) => {
if (activeCaseKey === 'POA_NAME16') {
return;
}

if (daysMap.has(activeCaseKey)) {
const dateKey = daysMap.get(activeCaseKey);

if (mergedCases.has(dateKey)) {
const casesByDate = mergedCases.get(dateKey);

if (casesByDate.has(activeCase.POA_NAME16)) {
const caseItem = casesByDate.get(activeCase.POA_NAME16);
const active = activeCase[activeCaseKey];
caseItem.Recovered = caseItem.Cases - active;
}
}
}
});
});

const maps = [...mergedCases.values()]
.map((item) => [...item.values()])
.flat();

maps.sort(sort);

return {
data: maps
};
};

We can run this procedure by a schedule (say once a day) and get all the historical results stored along with the web site static assets, ready to be served.

Data preparation

At this stage, our transformed data source is not much different from the original one, except historical data accumulation. It is time to start thinking of how to use it within the web application itself (web browser front-end) so it could participate in visualizations and analysis. Optimization is important here. Data structures chosen to represent data in the memory directly affect data manipulation / traversal procedures and, as a result, might degrade performance. The key aspect is to understand the most important points of the data set. In our case, the most important measurements, from where the rest of the measurements could be inspected/traversed. Think of the data set as a graph and points as a vertexes. Essentially, we are looking for strongly connected components within a graph. Strong connection guarantees that we can traverse the data structure quicker by finding shorter paths, just because there are more possibilities of connected paths where components have higher degree of connection.

Earlier we noticed that time and location form a connected component. Most of the measurements are based on Date and postal code points.

We could also look at the data set as at two separate slices, time - as a summary of all post codes, and postal code - as a summary of chronological events.

So it makes sense to represent our data structures in the way where time and location are points achievable in one step. Sounds familiar? Yes, we've used this idea before to transform the origin data source. One Hash table is used to represent points in time, another - locations, together they give access to any another point.

Map {
'01-10' => Map {
'2053' => { Cases: 11, Active: 3, ... },
'2057' => { Cases: 3, Active: 0, ... },
...
},
'02-10' => Map {
'2053' => { Cases: 11, Active: 2, ... },
'2057' => { Cases: 5, Active: 0, ... },
...
},
...
}
const formatMap = (data) =>
data.data.reduce((acc, curr) => {
const rootEntry = acc.has(curr.Date) ? acc.get(curr.Date) : new Map();

acc.set(curr.Date, rootEntry);
rootEntry.set(curr.POA_NAME16, curr);

return acc;
}, new Map());

Population is slightly different because there is no time measurement, but for the sake of optimization, the same idea could be applied:

const formatPopulation = (data) =>
data.reduce((acc, curr) => {
const key = curr.POA_NAME16.toString();

const population = acc.population;
const entry = population.has(key) ? population.get(key) : curr;
population.set(key, entry);

acc.suburbs.push({ postCode: key, name: curr.Combined });

return acc;
}, {
population: new Map(),
suburbs: []
});

Here we are forming a population Map and a list of suburbs to be used later.

Geo Map

Since data we have represents state suburb spread of accumulated numbers, it's a good idea to represent it visually using a map. Recall that post-codes file is in a GeoJSON format. We just need a library to represent it as a NSW map in a browser. For the purpose of our application we're going to use Mapbox. It understands GeoJSON and has a free tier.

Post codes file could be rendered as is, but rest of the data set (cases, active, recovered, ...) needs to be merged into it manually. We can use GeoJSON Feature properties to add additional metadata for extra visualizations. Features would represent NSW suburbs where every feature is going to have a range of additional properties representing numerical data of various statistics such as:

  • Total cases - number of total cases
  • Total cases range - level of severity of the number of total cases (0, 10, 20, 30, 40, 50, > 50)
  • Active cases - number of active cases,
  • Active cases range - level of severity of the number of active cases (0, 1, 3, 8, 21, 51, > 51)
  • Recovered cases - number of recovered cases,
  • Recovered cases range - level of progression of the number of recovered cases (0, 2, 3, 8, 21, 51, > 51)
  • Tests - number of tests taken,
  • Tests range - level of progression of the number of tests (0, 250, 500, 1000, 2000, 3000)

As a data structure, feature properties is just bag of key - value pairs. That means, we are free to add any key - value to it:

export const mergeData = ({
postCodes: postCodesGeometry,
cases: casesInitial,
population: populationInitial,
tests: testsInitial
}) => {
if (!postCodesGeometry) {
return {};
}

const caseLevelKey = getLevelKey(caseLevels);
const testLevelKey = getLevelKey(testLevels);
const activeLevelsKey = getLevelKey(activeLevels);
const recoveredLevelsKey = getLevelKey(recoveredLevels);

const cases = formatMap(casesInitial);
const tests = formatMap(testsInitial);
const { population: populationByCode, suburbs } = formatPopulation(populationInitial);

const dates = Array.from(cases.keys());
const selectedDate = dates[dates.length - 1];

postCodesGeometry.features.forEach((feature) => {
dates.forEach((date) => {
const caseEntry = cases.get(date);
const testsEntry = tests.get(date);

const {
testsKey, activeKey, totalKey, recoveredKey, deadKey,
rangeTestsKey, rangeActiveKey, rangeRecoveredKey
} = bagKeys(date);

// initialize properties with zeroes
feature.properties[date] = 0;
feature.properties[testsKey] = 0;
feature.properties[activeKey] = 0;
feature.properties[totalKey] = 0;
feature.properties[recoveredKey] = 0;
feature.properties[deadKey] = 0;

// ranges
feature.properties[rangeTestsKey] = 0;
feature.properties[rangeActiveKey] = 0;
feature.properties[rangeRecoveredKey] = 0;

const hasCasesPOA = caseEntry.has(feature.properties.POA_NAME16);

if (hasCasesPOA) {
const caseEntryValue = caseEntry.get(feature.properties.POA_NAME16);
const total = parseInt(caseEntryValue.Cases, 10);
const recovered = parseInt(caseEntryValue.Recovered, 10);
const dead = parseInt(caseEntryValue.Deaths, 10);
caseEntryValue.Active = recovered === 0 ? 0 : total - (recovered + dead);

feature.properties[date] = caseLevelKey(total);
feature.properties[rangeActiveKey] = activeLevelsKey(caseEntryValue.Active);
feature.properties[rangeRecoveredKey] = recoveredLevelsKey(recovered);

feature.properties[totalKey] = total;
feature.properties[activeKey] = caseEntryValue.Active;
feature.properties[recoveredKey] = recovered;
feature.properties[deadKey] = dead;
}

if (testsEntry && testsEntry.has(feature.properties.POA_NAME16)) {
const testEntryValue = testsEntry.get(feature.properties.POA_NAME16);
const total = parseInt(testEntryValue.Number, 10);
const testRange = testLevelKey(total);

if (hasCasesPOA) {
const caseEntryValue = caseEntry.get(feature.properties.POA_NAME16);
caseEntryValue.Tests = testEntryValue.Number;
caseEntryValue.RecentTests = testEntryValue.Recent;
}

feature.properties[testsKey] = total;
feature.properties[rangeTestsKey] = testRange;
}
});

if (populationByCode.has(feature.properties.POA_NAME16)) {
const populationEntry = populationByCode.get(feature.properties.POA_NAME16);
feature.properties.population = populationEntry.Tot_p_p;
feature.properties.suburbName = populationEntry.Combined;
}
});

return {
postCodesGeometry,
suburbs,
cases,
populationByCode,
population: populationInitial,
tests,
selectedDate,
dates
};
};

The same bag of key - value properties will be used during map visualizations.

References

Data wrangling

Hash table

Strongly connected components within a graph

Mapbox: Join local JSON data with vector tile geometries

GeoJSON