Back to Blog

Insights from the AWS Public Sector Summit in Brussels

Marketing Team

MirrorWeb recently attended the AWS Public Sector Summit in Brussels, which covered topics on Amazon Web Services (AWS) and the cloud.

Phil Clegg, Chief Technical Officer of MirrorWeb, was in attendance and took part in the session Accelerate Your Migration: How Customers are Approaching Large-Scale Migrations and Data Center Exits.

If you missed it, don’t worry. Below are excerpts from the session in which Phil discusses:

 

You can also watch the full video at the bottom of the article, which is accompanied by a transcript.

MirrorWeb and The National Archives project  

 

  • MirrorWeb provide web and social media archiving to a number of organisations, including the UK Government Web Archive
  • Moving the UK Government Web Archive from a traditional data centre into the cloud

 

How MirrorWeb moved the UK Government Web Archive to AWS 

 

  • Experiences using AWS Lambda with The National Archives
  • UK Government Web Archives encompassed 20 years of web archives and over 4,800 websites
  • Challenges of moving the UK Government’s 120 TB web archive, and how we did it in two weeks at low cost

 

The Value of Experimentation with AWS 

 

  • We offered a full depth, full faceted search across the entire 20 years of archives
  • Processed 1.4 billion documents in 10 hours, averaging 144 million documents an hour

 

The Cost, Scalability and Storage Benefits of AWS 

 

  • The cost of creating an index of 120 TB of data and running 1,000 servers for 10 hours was significantly low
  • Don't worry about storage thanks to the capabilities of AWS
  • About 500 million users a month using the UK Government Web Archive

 

Managing Public Sector Web and Social Media Archiving Projects 

 

  • All of our public sector contracts are fixed price
  • One of the ways we've cut costs is improving our knowledge of what capacity is needed to run the website of the UK Government Web Archive
  • Social media archiving done at low cost

 

Trusting Data in the Cloud 

 

  • We transfer the data to the UK National Archives, as this is the way they operate, but for other clients we do trust the cloud
  • Ensures copies and security protocols are in place to protect data

 

If you would like to see how state-of-the-art cloud-native archiving can benefit your organisation, download our free eBook 'A Guide to Website Archiving' today.

{{cta('42466a38-336b-44d8-a1e7-d1d39f93a90d')}}

 

The Full Video, and Transcript 


Experience Using Lambda

To give some context, the work we did with The National Archives was the migration of the UK Web Archive from the previous supplier into AWS. The UK Government Web Archive is 20 years’ worth of web archives which is over 4,800 websites. These archives were stored in 100 MB web archive files - so roughly 1.2 million files, however we didn’t have an index of what those files contained - just the raw files.

The first challenge we had was moving the data the previous supplier had with The National Archives, in a period of about six months; transferring the data to The National Archives onto 2 TB hard drives - amounting to 72 2 TB USB hard drives in total. So that was our first challenge - “how do we get all that data into the cloud?” The first obvious thought was Snowball - Snowballs are pretty good, but getting the data from 2 TB hard drives into a Snowball was a challenge. We ended up loading two machines with eight channels of USB, that could run at full speed - meaning we had 16 drives going at once and actually, it worked and it took two weeks.

We transferred the data, connecting all the drives up to the machines remotely. After two weeks we had two Snowballs that contained all the data. All this data then sat there, and during this time, Amazon had launched the Snowball Edge which we we were quite excited about, but we couldn’t get it. So how could we index this data in a clever way - and we thought we could use lambda functions - so as the data hits Amazon S3, as the file gets submitted, we can run a lambda function against it - which is what we did. We ended up running over 1.2 million lambda executions as the data transferred into S3. It was a bit of a test - but it worked and we indexed the data running 50,000 functions per hour, in 24 hours.

Previously, this would have had to have been done with a Hadoop job. But this way, we managed to do it in rapid time. And the cost of using lambda was astronomically low.

The Value of Experimentation

One of the things we love about Amazon, is that you could experiment with the idea of having a lambda function in the S3 ingest on the Snowballs - and we didn’t know it would work and we actually processed 1.1 millions files and there were only a few that weren’t processed due to the size on the EC2 instance. One of the reasons that we won the tender for the UK Government Web Archive is that we offered the full depth of faceted search across the entire 20 years’ worth of archives. And looking at the way people were doing it, generally full text search index in other archives is done with duplista, so not having much experience with duplistas, we contracted a Hadoop contractor, who initially quoted us two weeks to do the work - which was reasonable and budgeted for. But after those two weeks, the contractor came back to us and said he needed another eight weeks - but unfortunately we were launching in six weeks - so there was a problem.

With the experimentation we did with lambda, we didn’t move any data from S3, we processed it directly in S3. If you have a 100 MB WARC file, a web page is only small parts of it. So you don’t need to take the whole file and read it if you’re only indexing certain content. We had a list of file types to index, the most obvious being PDF, doc, text and HTML content - a lot of content within those were things we weren’t interested in, CSS, JS - quite a lot of data was out of scope.

We tested our team with some out of the box thinking - “can we leave the data in S3 and still process it?”. As it was, some very good members of our team came up with a way of doing just that. We already had an index of everything that was in those files, but we didn’t know what content was inside each file. So we used the index to create a filter job, using 350 EC2 instances which we spot purchased.

Once the filtering job had finished, it then passed onto an indexing job, which was about 750 servers. And the two jobs were running concurrently, which was another experiment. But within 10 hours, we did a full text search into an Elasticsearch of 120 TB of data - amounting to 1.4 billion documents in those 10 hours.

Only a few weeks before, our team had seen that the British Library had posted something that they had done with a Hadoop cluster on site, where they managed a record 10 million documents an hour. Well we averaged 144 million documents an hour - and even in some instances this number was much higher because it was being indexed so quickly. And this out of the box thinking was good, because Amazon gives us the ability to scope.

We’ve just requested 5,000 instances which we can set up and talk about it same time next year.

Facts & Figures

The astronomically low cost for creating the indexing 120 TB of data was $25.00 - that was for 1.2 million lambda executions. There is a free tier, which is why it was so low, but nevertheless it blew us away. We had to go back and look at the bill a few times to check if it was really the cost. Then when we moved onto the second experiment, it was also very very low. The cost of running 1,000 servers for 10 hours were at a 70% reduction, because they were spot-purchased (these were R4X-large servers) and these cost us $187.00 - which again, we had to double check the bill. We were paying about $0.03 - $0.33 an hour. We ended up contacting support and requesting more in order to see how far we could push it.

We did experiment a few times and it killed Elasticsearch many times - but that’s where Amazon Elasticsearch service came in really handy - we used 136 hours of R4X-large over the 10 hours and we just expanded the cluster until we found that we could hit it that hard and then we went for the full jump - but that only cost us $237.00 for the 10 hours at that cluster size and then we scaled it back down to the one that runs the archive.

When you’re doing a data processing this big, there’s lot of numbers we can pull out. Thinking back to past lives with data centres and nethubs and having limited storage - it’s not something we need to worry about anymore. The archive has grown by 30 TB in the last year - but it doesn’t matter because the fidelity of archives is better, we can go deeper. It’s this whole concept of throwing 120 TB of data into S3 in Ireland and it automatically appear in London as a backup and then we can make it go somewhere else into Glacier and so on. It’s this capability that allows a small devops team to manage really large datasets.

The hits that we get on the UK Government Web Archive, which is mainly Google, is about 70 million a month which average to around 500 million users a month using the service. A lot of the UK Government Web Archive sites are taken out and redirect through to the archive automatically. There are a lot of government departments which are all merged over the years into GOV.UK - so all of those sites are in there and people are still using them through our service.

We recently been in talks with the Welsh Government (a contract we’ve recently won), and they said since we launched the full text search across the UK National Archives - their requests for their archiving service had dropped because people were able to find stuff through the platform. So there’s many statistics, many successes and more exciting ideas that we want to try and play around with. And that’s the thing, you can play around with these ideas, and if they don’t work then it’s ok because you haven’t actually spent that much money. The whole concept of running 1.2 million functions and it only costing $125.00 is a bit crazy.

More from the Blog

Whatsapp Compliance, Self-Reporting, and Ripping off the Band-Aid

The SEC has incentivized firms to self-report on off-channel violations. We look into the process and its benefits.

Read Story

FINRA Report 2024: Recordkeeping Takeaways

Key recordkeeping teakeaways from the 2024 FINRA Annual Regulatory Oversight Report.

Read Story

How MirrorWeb Evolves with Demand

Adaptability is vital in the world of communications surveillance. This blog looks at MirrorWeb’s journey as a company, and why it's helped us be agile and reactive to a challenging regulatory landscape.

Read Story

See what we can do for you.

Let us show you why MirrorWeb is trusted by organizations across the globe for their compliance and digital preservation needs.