Case Study

The National Archives

Watch our interview with Digital Director John Sheridan to find out how MirrorWeb archives all UK government web and social media communications.

MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available.

John Sheridan
John Sheridan, Digital Director, The National Archives

Introduction

The home to 1000 years of British history.

One of the UK’s largest global asset managers, overseeing billions in AUM, faced a series of challenges to remain compliant. They needed to archive all websites across the organisation, covering 70+ countries along with a network of intermediaries and third-party brokers. Additionally, there was a need to archive webpages the moment they're published, ensuring digital content is captured instantly to meet record-keeping requirements.

The National Archives' Challenge

The way the government uses the web is changing, and as the size and complexity of the UK Government Web Archive (UKGWA) has grown, so has the expectation of the users who now demand a reliable, comprehensive and intuitive search service as well as access to social media records. In 2016, The National Archives (TNA) realised they needed to update their web archiving and social archiving provision. They were looking for someone to:

  • Take their existing archives and modernise how they were managing the capturing and storing content.
  • Help them capture new digital content including the government's social media channels.

MirrorWeb’s Challenge

Firstly we needed to collect and move the archive from the previous supplier. This data was stored on 72 2TB hard drives which meant we required two custom-built machines to connect the drives simultaneously and ingest the data - this was accomplished within two weeks. The next phase was to develop a public-facing web archive that was searchable and allowed archives to be replayed whilst serving over 75 million visitors per month. We utilised Elasticsearch as our primary search tech for the following reasons:

  • Scalability - we spun a very large 1,000 node plus cluster to do the initial ingest of data and then scaled it down to an affordable level when we deploying live.
  • We could integrate it into the Amazon environment and monitor it with CloudWatch.

What we achieved

MirrorWeb’s capabilities have given TNA, for the first time, the ability to index the whole of the web archive, which has also significantly helped them improve searchability for users. A whole raft of digital content was able to be indexed by the search facility and offers users the ability to narrow their search to a particular site that was archived. John Sheridan, Digital Director of The National Archives, said: “MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available. Improving search for users has been one of the biggest things that MirrorWeb have been able to achieve.”

ISO-Certified & WORM Compliant Archives

Every archived file is time-stamped, immutable and stored in an ISO-compliant format to ensure authenticity and legal acceptance.

Automated Archiving

You define the frequency. Daily, weekly or monthly crawls for your website and social media channels.

Replayable Web & Social Content

Fully indexed and searchable WARC’s. Users can replay content and archived metadata at any time and curate collections in line with Dublin Core.

Long-Term Preservation Guaranteed

All digital information is preserved and protected, ensuring digital content is never lost or made obsolete.

Compliance Requirements Met

Our digital archiving technology answers the regulators' requirements, meaning they're source-proof, tamperproof and immutable.

Scaleable Cloud Technology

Our cloud-based platform can manage huge data-sets and is light touch, requiring no infrastructure costs or extra resource burdens on customers.

A Single Searchable Archive

All digital assets are fully indexed and searchable in the platform, making it easier than ever to find online records and content.

Discovery Support

All archived website content can be made available to academic professionals, researchers, government bodies and other third parties for required purposes.

Data Sovereignty

The National Archives are in total control of when, where and how their data is archived and complies with ISO standards.