AWS re:Invent 2017 - The UK Government Web Archive Project
February 13, 2018 • 6 min read
In 2017 we were invited to speak at the AWS re:Invent event in Las Vegas about our web archiving project for the UK Government National Archives.
During the presentation, our CTO, Philip Clegg walked the audience through MirrorWeb's story, how we thought about building our search architecture and the lessons and insights we have gained from this for our clients in the public sector. Watch the video below to learn more:
What Does MirrorWeb Do?
MirrorWeb provides web and social media archiving to the public sector and regulated industries such as the financial sector. In the US there are regulated bodies such as FINRA and there are similar bodies in the UK such as the FCA.
In this presentation, we'll be discussing the web archive that we've done for the UK National Archives, which is a publicly viewable archive available online. We also run the UK Parliament web archive, which uses similar technologies. However, we will be talking about the work that we did to ingest all of the data for the UK web archive.
What are Web Archives?
If you are familiar with the likes of the Internet Archive and the Wayback Machine - these are essentially what web archives are.
Website data is stored in an ISO standard WARC format file and needs to be indexed in order to be covered in playback. The reason we talk about CDX induction here is to explain how we did the indexing later - so what a CDX index is is a list of all of the details of the assets within the web archive, which would be things like HTML data, PDFs and anything else that was on that particular website. So we then create an index which in turn creates a text index of every asset in the file.
What is the UK Government Web Archive?
There are over 20 years of historical archives in the UK Government Web Archive which amounts to 120 TBs of data, as well as over 4,800 archived sites. A lot of these sites have been shut down and are no longer available on the public web - but they do exist in our archive. We also archive UK Government Twitter accounts and YouTube videos.
We won the UK web archive project tender last November  and needed to move this archive from the previous supplier, where it was stored in data centres in Paris. We then needed to collect the data and move it into Amazon.
We were lucky that the data had already been moved to the National Archives. However, it was stored on 72 2TB hard drives and we had to go there with two Snowballs and two machines that we had built that we were able to connect eight drives at once and ingest the data as fast as possible - but it still took two weeks.
The next phase of the project we had to develop was a public-facing website that was capable of serving over 75 million visitors a month and we had to provide a full replay of all the archives like the Wayback Machine. Then we had to do a full-text search across the entire archive.
The first three aspects we were quite familiar with - we didn't provide any of our public or financial clients with full-text search at that time nor did we index any video or any images. We also had a requirement from our client that we were not to provide any duplicate results. So we had a bit of learning to do.
How Did We Choose the Search Technology?
There were two choices that we looked at when it came to choosing the search technology which was popular at the time: Solr and Elasticsearch.
Elasticsearch was the winner as we've seen from the slides previously. It's very popular at the moment but also Amazon offered an Elasticsearch service. As we were a small team and we didn't have a lot of search experience, it made sense to have a look at running it on Amazon.
We chose the Elasticsearch service for a number of reasons. Being able to scale was incredibly important, as you'll see later, and we spun up a very large cluster to do the initial ingest and then scaled it back down to an affordable level when we went live by using their services. This meant we didn't have to employ lots of Elasticsearch experts. A terminal Elasticsearch cluster did help manage any access rights to Elasticsearch and could be managed by IAM. This is integrated into the Amazon environment nicely where we could monitor it with CloudWatch and provide alerting along with all of our other alerts.
Why Traditional Tools Didn't Work
Traditional tools in indexing include Hadoop and Spark, as well as a number of open source projects on the internet for indexing web archives. However, none of them actually pushed into Elasticsearch so we were forced to look at writing our own. The benchmark was set by the British Library as they run the UK web archive which is a UK domain rather than a government one - they set a record just around the time we were writing our software of ingesting 10 million records per hour so we set that as our benchmark.
We didn't have experience of Hadoop so we hired a Hadoop contractor and they quote for one to two weeks of work to move this data into Elasticsearch. However, two weeks later they told us it would take six to eight weeks. We didn't realise that while Hadoop is great for batch data processing on small amounts of large files, we had 1.2 million 100 MB files.
It was then that we decided to think outside the box and some smart members of our development team came up with what we called WarpPipe which was, in a way, a rewrite of Hadoop for the Cloud.
What Was the Result?
We managed to index 1.4 billion documents in 10 hours and we were averaging 146 million documents an hour. It was very fast and it actually went quicker at times and it meant that we didn't have to spend too much on the EC2 instances and the Elasticsearch cluster. We also had this requirement to deduplicate data, which we managed to achieve.