Why MirrorWeb Use AWS for Website & Social Media Archiving
August 30, 2018 • 8 min read
We built MirrorWeb on the cloud, and initially on three different cloud platforms to provide a high level of availability, but these solutions did not provide the level of reliability we wanted for our customers.
That’s why we now run the entire MirrorWeb architecture on Amazon Web Services (AWS). This is the same virtual infrastructure used by brands like Netflix and Airbnb.
We’re going to take you through the benefits of AWS and how we use it to provide fast, secure, reliable, and searchable web and social media archiving for our customers.
Why do we use AWS?
One of the benefits of AWS over the traditional data centre model is that we don’t have to worry about data storage. In the past, using storage arrays (dedicated storage hardware) meant there was a finite storage capacity. This was in the gigabytes, but AWS allows us to think in the petabytes.
At the moment, we store approximately 400 TB of data. And we know we can increase that into the petabytes due to the scalability of AWS. To put this into context, a single petabyte would take over 745 million floppy disks or 1.5 million CD-Rom discs.
Secures sensitive data
MirrorWeb archives some data that extends beyond public information. We have to ensure that’s held securely and encrypted.
We are able to achieve this with AWS and encryption within Amazon S3 (object storage built to store and retrieve any amount of data from anywhere). It gives us the ability to provide that backup and assurance for a secure solution.
MirrorWeb CTO, Phil Clegg, discusses the benefits of digital archiving using AWS over the traditional data centre model.
The cloud market has matured and is no longer considered an inherent security risk. In fact, it is often deemed more secure than on-premise data storage.
That said, you want to consider the stability of the service provider. One of the reasons why MirrorWeb use AWS is that it is a global cloud service provider with a strong business model. So, it isn’t going anywhere any time soon.
This assures the availability of your data. Whereas smaller providers may not offer the same long-term data retention.
Meets data sovereignty requirements
For many organisations, it’s important that they choose a website and social media archiving platform that has the capability to store their archives in local data centres.
AWS offers data centres across many regions. This gives us the capability to meet customer latency and data sovereignty needs.
Through AWS Elasticsearch we are able to provide fast, scalable, flexible, and reliable search functionality of an archive.
It also supports a plethora of search features integral to providing effective search for students, researchers and others users, including:
- Adjustable ranking
By using AWS, we can cut down on a lot of overheads that come with looking after the infrastructure needed for a data archive. This includes outgoings on space for local servers, power, etc. and hardware upgrade cycles.
This means our customers don’t need to work hard just to keep the lights on. We can help them focus on adding to and improving the user features of an archive.
These range from simple user interface improvements to more advanced capabilities, such as transferring large amounts of data for large-scale research projects.
AWS case study: The National Archives
One of the best ways of showing how AWS helps us provide a fast, secure and reliable digital archiving solution is by looking closer at one of our case studies.
In this case, we are going to look at how we moved, indexed and provided search functionality for the UK government archive of websites and social media communications, which included:
- Over 120 TB of data
- Over 4,800 websites from 1996 (20 years’ worth of historic archives)
- Government social media accounts
- Thousands of archived YouTube videos
Moving the data with AWS
The UK Government’s digital archive was stored by the previous supplier in a data centre in Paris across 72 USB-3 hard drives. So, it wasn’t difficult getting our hands on the data itself.
We used devices called AWS Snowballs; they connect to your local network, copy and encrypt your data to internal hard drives, and can then be shipped to AWS data centre for transfer into the cloud.
With the ensemble of 72 hard drives, two custom-built PCs we brought along with us and two AWS Snowballs, we were able to move the entire 120 TB digital archive to the cloud in two weeks. Not bad for more than two decades of internet history.
Using AWS Snowballs helped us get around the problems typically associated with large-scale electronic data transfers. These include high network costs, long transfer times and security concerns.
Indexing the data with AWS
To index the data, we used AWS Lambda. So, as the data hit Amazon S3, as the file gets submitted, we ran a Lambda function against it.
In the end, we ran over 1.2 million Lambda executions as the data transferred into S3. We indexed the data running 50,000 functions per hour, in 24 hours.
Previously, this would have had to have been done with Hadoop (an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems).
But doing it using AWS Lambda enabled us to achieve our goal in rapid time and at an affordable price.
John Sheridan, Digital Director of the UK National Archives, discusses the AWS cloud-native digital archiving solution provided by MirrorWeb.
Making the data searchable with AWS
The next step was making the data accessible for researchers, students and members of the public who need to use it. This meant our next job was to build a public-facing website archive where visitors could view archived websites and social media content in the website in their original form, as well as search for content on specific topics.
The latter was a particular challenge for us. Search may seem like one of the most basic web technologies, but it can be complex to implement. This is because search engines don’t scan an entire set of documents one by one. They use indexes, which helps them return useful results much faster.
MirrorWeb was tasked with writing a complete replacement for the UK Government Web Archives’ previous search functionality. This meant we needed to index 1.4 billion documents from scratch, but we struggled to find an existing tool that would meet our specific need for indexing a large number of small files.
The search functionality itself is provided by AWS Elasticsearch. This improves on The National Archives’ previous search engine in speed, scalability, flexibility, and reliability.
In terms of technical capability, Elasticsearch gave us the ability to scale the cluster without downtime, it reduced load on our DevOps team, we could manage access rights, it integrated with Amazon CloudWatch for monitoring purposes, and failed nodes could automatically be replaced.
We now update the index once a month rather than once a quarter, for example, so it’s much faster for new archive content to show up in search results.
The MirrorWeb Platform
MirrorWeb delivers AWS cloud-native, state-of-the-art and ISO-compliant website and social media archiving. This allows organisations to create permanent, unalterable records of all online communications.
It also enables organisations to meet compliance obligations and ensures information of commercial, cultural or historical value is never lost.
By partnering with MirrorWeb, you’ll be using an AWS partner and a trusted and secure archiving service provider. We have extensive experience in understanding your requirements and deliver unparalleled service:
- State-of-the-art: offering support for web and social media data at large scale, as well as indexing for search and big data initiatives.
- Cloud-native: as an AWS partner, we offer near-unlimited capacity and scalability with complete control over data storage.
- ISO-compliant: we are ISO9001 and ISO27001-certified and archive our data in the secure, date and time-stamped ISO28500 standard WARC file (WORM) format.
- UK-based: we offer UK-based support 24/7/365. All archives are stored in local territories to meet data protection and compliance requirements.
- User-friendly: our best-in-class client portal puts you in control of your archives, allowing you to control archiving frequency, search and replay content, and view reports and notifications.
- Cost-competitive: we give you and your team full access to the MirrorWeb portal at all times, with no seat fee and no setup and maintenance fees.