The UK Government Web Archive (UKGWA) archives each and every UK Government website on a predefined schedule.
As their primary communications channel, all UK government department social media accounts are captured in near real-time to ensure accurate records are kept.
The current data set is in excess of 150TB, amounting to billions of documents, which is steadily growing. MirrorWeb indexes huge volumes of data at speed to deliver faceted full text search across the whole of the UKGWA.
The National Archives is the archive of the UK government and the sector lead for all archives across the UK.
Watch our video interview with Digital Director John Sheridan to find out how MirrorWeb enables the long-term project to archive all UK government web and social media communications.
The way the government uses the web is changing, and as the size and complexity of the UK Government Web Archive (UKGWA) has grown, so has the expectation of the users who now demand a reliable, comprehensive and intuitive search service as well as access to social media records.
In 2016, The National Archives (TNA) realised they needed to update their web and social archiving provision. They were looking for someone to:
After a lengthy selection process, TNA awarded the contract to MirrorWeb based on their expertise in the cloud, partnership with AWS and their ability to archive social media as well as website data.
We needed to collect and move the archive from the previous supplier, who used a data centre in Paris, over to Amazon. The data was stored on 72 2TB hard drives which meant we required two Snowballs and two custom-built machines that allowed us to connect eight drives simultaneously and ingest the data as quickly as possible - which we accomplished within two weeks.
The next phase of our project was to develop a public-facing website which enabled the full replay of all archives as well as full-text search, capable of serving over 75 million visitors per month. We decided the best way to deliver on TNA’s objectives was to build an indexing and search solution from scratch where we could provide both video and image replay functionality as part of the service.
We ended up choosing Elasticsearch as our primary search technology for the following reasons:
We also knew that, as traditional tools such as Hadoop wouldn’t work for the project, we had to think outside of the box and develop our own tool instead - and we did. Our team came up with WarpPipe - essentially a rewrite of Hadoop for the cloud.
“MirrorWeb have increased the range of what we were harvesting, particularly around social media.”
MirrorWeb managed to index 1.4 billion documents in 10 hours - averaging around 146 million documents per hour - and we managed to achieve our client’s requirements of deduplicating their data. We introduced a lot of new capabilities for TNA, so that they could capture more of what the Government is doing on the web as well as other places where Government content is made available online.
“Improving search for users has been one of the biggest things that MirrorWeb have been able to achieve.”
MirrorWeb’s capabilities have given TNA, for the first time, the ability to index the whole of the web archive, which has helped them improve the search facility for users. A whole raft of content was able to be indexed by the search facility and offers users the ability to narrow their search to a particular site that was archived.
John Sheridan, Digital Director of The National Archives, said:
“What I’ve been most impressed with MirrorWeb is their creative use of cloud computing technologies.
“For example, to index the entirety of our 120TB collection by spinning 1000 node plus cluster of computers to process that collection in just a couple of days, and to see that effective use of cloud computing has been hugely impressive.
“MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available.”
We are really excited about our upcoming work with TNA. We are currently looking at new ways of harvesting content so that we are able to harvest more of that content in different ways in order to improve the web archive service for end users - improving the overall search and user navigation as a result.
Intuitive front-end website and intelligently designed management portal
We give a 99.999999999% availability guarantee on all data stored with us
MirrorWeb is ISO9001 and ISO27001-certified so TNA know they are getting the best service
All data is replicated across multiple availability zones and regions, and multi-factor authorisation is standard
We have 24/7/365 UK-based support to ensure that any problems are dealt with as soon as they are reported