Getting individual datapoints from government-published data is simple enough, but you need bespoke tools to really extract the real power of this resource.

In collaboration with:
RMIT ABC Fact Check,
ABC News
& others

Good journalism is defined by its access to information, and because most in the industry are limited to what they can gather manually, they miss out on the many, many insights hiding within the government's endless stream of data.

A spreadsheet here. A PDF there. It's great stuff, but when your analysis is confined to cutting and pasting, you don't get a handle on the bigger forces at play.

We have built a number of tools to retrieve and collate data stored in PDFs, across multiple complex spreadsheets, in dozens of HTML tables and the other analyst-unfriendly ways that governments have decided to publish their data.

By automating this data collection work, research can be massively scaled up: datasets containing dozens, hundreds or many thousands of points of information can be assembled, a process that would be well beyond the limits of human boredom to undertake by hand.

This takes research from an atomised to a birds' eye view, revealing both the bigger picture and the granularity. It also improves efficiency, giving journalists more time for analysis and reporting.

Fact checking a premier's spin via a PDF reader

During the state's recent election, the Victorian premier kept repeating a claim about the state's ambulance response times being at their record best just prior to the pandemic.

The available data needed to assess this claim was stored across multiple quarterly and annual report PDFs published by Ambulance Victoria, and in a complex and inconsistent series of spreadsheets released yearly by the federal Productivity Commission.

Gathering this data by hand would have been an onerous task, so instead, we built a script that would:

  • Convert each PDF and spreadsheet into analysable text strings
  • Locate the required information within each converted document
  • Extract relevant data points and collate findings into a dataframe
  • Visualise this information in order to easily assess the claim.

As each document contained subsets of previous data which may have been updated since originally being reported, this data collation had to be carried out in reverse chronological order - from finish to start - in order to capture only the latest published figure for each period.

Automating this data collection process was not only efficient but also meant that we could easily return to these documents to extract different datapoints at a later stage, with no time wasted.

The end result revealed nothing that could justify the premier's claim, and plenty that threw it into doubt.

Read the full story

Tabularising EPA prosecutions to see who got off the hook

Environmental crimes are of particular interest to us, and as part of research for ABC News, we looked into the performance of Victoria's Environment Protection Authority.

The EPA's website includes individual records of its prosecutions over a number of years - which is useful for finding single cases, but doesn't clarify the types of crimes it is and isn't prosecuting.

In order to get an overview of its activities, we built a tool to extract this court adjudication documentation from the Victorian EPA website, which then allowed us to explore it for trends.

We have built similar tools to explore and analyse data from the AusTender procurement site in order to analyse federal government spending.

This reporting contributed to the demand for change at the organisation that has since taken place.

Granular overviews through a modular table scraper

The Victorian Electoral Commission publishes an incredible amount of data for each election, but to get this data in an analysable format requires some difficulty - and time.

But all the data also sits within the markup code on its website, and to extract this, we built a modular HTML table scraper that is able to reproduce a faithful version of any HTML table, on this site or any other, regardless of its quirks.

That then opened up huge possibilities for collecting and collating data found across multiple individual pages - with each page storing snippets of data that made up a much bigger picture.

For example, by incorporating this tool within a script that visited the results page of every electorate, we were able to collect the two-party preferred voting data from every polling booth across the state at the previous election.

The graph at the top of the page shows an example of this data in action, while the graph above applied the same tool to grabbing live data as it was being published during the 2022 Victorian election.

This allowed us to test a claim that Greens had only won a party record four seats as a result of being elevated on the Liberal's how-to-vote cards.

Applying a basic split of preferences - which was roughly in line with the 2018 count - showed that this claim was likely incorrect, and that party leader Samantha Ratnam was likely right when she claimed that their success was achieved without Liberal support.


Betting odds as alternative data

Bookmakers can give an important insight into public sentiment - especially about the performance of government.

During a 2022 state election in Australia, for example, we collated and assessed the betting odds for all parties in the 24 hours before votes were cast.

Through the 'wisdom of the crowd', it showed the Labor Party was set to be returned with a large majority and the Greens were set to expand their position on the crossbench.

While at odds with much of the commentary in the media, this turned out to be close to the mark - and certainly more accurate than a number of opinion polls leading into election day.