The National Archives
Watch our interview with Digital Director John Sheridan to find out how MirrorWeb archives all UK government web and social media communications.
“MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available.”
The home to 1000 years of British history.
The National Archives' Challenge
The way the government uses the web is changing, and as the size and complexity of the UK Government Web Archive (UKGWA) has grown, so has the expectation of the users who now demand a reliable, comprehensive and intuitive search service as well as access to social media records. In 2016, The National Archives (TNA) realised they needed to update their web archiving and social archiving provision. They were looking for someone to:
- Take their existing archives and modernise how they were managing the capturing and storing content.
- Help them capture new digital content including the government's social media channels.
Firstly we needed to collect and move the archive from the previous supplier. This data was stored on 72 2TB hard drives which meant we required two custom-built machines to connect the drives simultaneously and ingest the data - this was accomplished within two weeks. The next phase was to develop a public-facing web archive that was searchable and allowed archives to be replayed whilst serving over 75 million visitors per month. We utilised Elasticsearch as our primary search tech for the following reasons:
- Scalability - we spun a very large 1,000 node plus cluster to do the initial ingest of data and then scaled it down to an affordable level when we deploying live.
- We could integrate it into the Amazon environment and monitor it with CloudWatch.
What we achieved
MirrorWeb’s capabilities have given TNA, for the first time, the ability to index the whole of the web archive, which has also significantly helped them improve searchability for users. A whole raft of digital content was able to be indexed by the search facility and offers users the ability to narrow their search to a particular site that was archived. John Sheridan, Digital Director of The National Archives, said: “MirrorWeb have brought some outstanding technical capabilities - in particular with data migration, cloud computing, search and new ways of harvesting and crawling content, as well as new ways of presenting that content and making it available. Improving search for users has been one of the biggest things that MirrorWeb have been able to achieve.”
ISO-Certified & WORM Compliant Archives
Every archived file is time-stamped, immutable and stored in an ISO-compliant format to ensure authenticity and legal acceptance.
You define the frequency. Daily, weekly or monthly crawls for your website and social media channels.
Replayable Web & Social Content
Fully indexed and searchable WARC’s. Users can replay content and archived metadata at any time and curate collections in line with Dublin Core.
Long-Term Preservation Guaranteed
All digital information is preserved and protected, ensuring digital content is never lost or made obsolete.
Compliance Requirements Met
Our digital archiving technology answers the regulators' requirements, meaning they're source-proof, tamperproof and immutable.
Scaleable Cloud Technology
Our cloud-based platform can manage huge data-sets and is light touch, requiring no infrastructure costs or extra resource burdens on customers.
A Single Searchable Archive
All digital assets are fully indexed and searchable in the platform, making it easier than ever to find online records and content.
All archived website content can be made available to academic professionals, researchers, government bodies and other third parties for required purposes.
The National Archives are in total control of when, where and how their data is archived and complies with ISO standards.