Archiving Over 2000 Websites Per Year

The UK Government Web Archive (UKGWA) archives each and every UK Government website on a predefined schedule.

Archiving Hundreds of Social Media Accounts

As their primary communications channel, all UK government department social media accounts are captured in near real-time to ensure accurate records are kept.

Big Data & Full Text Search

The current data set is in excess of 150TB, amounting to billions of documents, which is steadily growing. MirrorWeb indexes huge volumes of data at speed to deliver faceted full text search across the whole of the UKGWA.

The National Archives

“The home to 1000 years of British history”

The National Archives is the archive of the UK government and the sector lead for all archives across the UK.

Watch our video interview with Digital Director John Sheridan to find out how MirrorWeb enables the long-term project to archive all UK government web and social media communications.

Watch Video

This alt tag is in use so populate it with something relevant to the video

The National Archives’ Challenge

The way the government uses the web is changing, and as the size and complexity of the UK Government Web Archive (UKGWA) has grown, so has the expectation of the users who now demand a reliable, comprehensive and intuitive search service as well as access to social media records.

In 2016, The National Archives (TNA) realised they needed to update their web and social archiving provision. They were looking for someone to:

  • Take their existing archives and modernise how they were managing the capturing and storing of content
  • Help them capture new content
  • Improve how they captured the government’s use of social media such as Facebook and Twitter

After a lengthy selection process, TNA awarded the contract to MirrorWeb based on their expertise in the cloud, partnership with AWS and their ability to archive social media as well as website data.

The National Archives

MirrorWeb’s Challenge

We needed to collect and move the archive from the previous supplier, who used a data centre in Paris, over to Amazon. The data was stored on 72 2TB hard drives which meant we required two Snowballs and two custom-built machines that allowed us to connect eight drives simultaneously and ingest the data as quickly as possible - which we accomplished within two weeks.

The next phase of our project was to develop a public-facing website which enabled the full replay of all archives as well as full-text search, capable of serving over 75 million visitors per month. We decided the best way to deliver on TNA’s objectives was to build an indexing and search solution from scratch where we could provide both video and image replay functionality as part of the service.

We ended up choosing Elasticsearch as our primary search technology for the following reasons:

  • Scalability was incredibly important - we spun a very large 1000 node plus cluster to do the initial ingest of data and then we were able to scale it down to a more affordable level when we went live
  • We didn’t need to employ lots of Elasticsearch specialists
  • We could integrate it into the Amazon environment and monitor it with CloudWatch

We also knew that, as traditional tools such as Hadoop wouldn’t work for the project, we had to think outside of the box and develop our own tool instead - and we did. Our team came up with WarpPipe - essentially a rewrite of Hadoop for the cloud.

MirrorWeb logo

What we achieved

“MirrorWeb have increased the range of what we were harvesting, particularly around social media.”

MirrorWeb managed to index 1.4 billion documents in 10 hours - averaging around 146 million documents per hour - and we managed to achieve our client’s requirements of deduplicating their data. We introduced a lot of new capabilities for TNA, so that they could capture more of what the Government is doing on the web as well as other places where Government content is made available online.

“Improving search for users has been one of the biggest things that MirrorWeb have been able to achieve.”

MirrorWeb’s capabilities have given TNA, for the first time, the ability to index the whole of the web archive, which has helped them improve the search facility for users. A whole raft of content was able to be indexed by the search facility and offers users the ability to narrow their search to a particular site that was archived.

John Sheridan, Digital Director of The National Archives, said:

“What I’ve been most impressed with MirrorWeb is their creative use of cloud computing technologies.

“For example, to index the entirety of our 120TB collection by spinning 1000 node plus cluster of computers to process that collection in just a couple of days, and to see that effective use of cloud computing has been hugely impressive.

“MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available.”

The Future

We are really excited about our upcoming work with TNA. We are currently looking at new ways of harvesting content so that we are able to harvest more of that content in different ways in order to improve the web archive service for end users - improving the overall search and user navigation as a result.

Map of the UK

The MirrorWeb UKGWA Service is:

Like

Easy to Use

Intuitive front-end website and intelligently designed management portal

Award

Reliable

We give a 99.999999999% availability guarantee on all data stored with us

Lightning

Super Fast

By using AWS Route 53 and AWS Cloudfront we deliver a lightning-fast service

Certificate

Compliance Certified

MirrorWeb is ISO9001 and ISO27001-certified so TNA know they are getting the best service

Certified

Secure

All data is replicated across multiple availability zones and regions, and multi-factor authorisation is standard

Support

Great Customer Support

We have 24/7/365 UK-based support to ensure that any problems are dealt with as soon as they are reported

Read our case studies

The National Archives

The National Archives

Read More
Parliament

Bank of England

Read More