WEBSITE ARCHIVING

How to Archive your Website: The Definitive Guide

Learn how to archive your website, the different methods available, common misconceptions, and why it's important for compliance and digital preservation.

Welcome to How to Archive your Website: The Definitive Guide. This is a resource for those looking to understand exactly what website archiving is, how it's used to capture online information and how you can archive your website. This resource will be useful to a variety of business professionals across digital, compliance, marketing and information governance.
After reading this guide, you'll understand:

  • What website archiving is
  • How you can archive your websites
  • Why web archiving has become essential for businesses
  • How different industries benefit from web archiving
  • The challenges in capturing and archiving modern websites
  • Common misconceptions around archiving

What is website archiving?

Website archiving is the process of collecting websites and the digital assets they contain and preserving these in a digital archive. Web archiving can be compared to the traditional archiving of files and documents; the information is selected, stored, preserved and made available to people. Access is usually provided to the archived websites, for use by businesses, governments, universities, organisations, researchers, historians and the public.

The process involves ‘harvesting’ websites from their locations on the live web using crawl based software. A crawler travels across the web and within websites, extracting and saving the information as they go. As you may have guessed, the web crawler plays a huge role in just how accurate a website capture is.

Once a crawl of a website (or websites) is complete, the archived websites are made available in a single digital archive as a form of collection. These can be replayed and navigated through just how they would on the live web, but instead, are preserved as immutable records of what was published at a particular point in time.

The most common types of website archiving

We'll cover off some core technical details first (so bare with us), there are currently 3 main methods for archiving web content:

  • Client-side website archiving
  • Transaction based archiving
  • Server-side archiving

Client-side archiving is the most popular method due to it's simplicity and scalability, it can be carried out remotely and on a large scale. Transaction-based and server-side archiving are found to be more of a historic approach, this is because they require active collaboration with the server owners and need to be implemented on a case-by-case basis.

When an organisation or business needs to archive their websites, they usually require one of the following outputs:

1. A centralised private web archive
This is where a business will capture their websites on a pre-defined, automated schedule and store these records in a digital archive that's only accessible to them (or who they grant access to).

The archive is generally accessed through an online portal and provides technology that allows complete replay of archives from specific dates and times. There are also likely to be other tools built into the platform such as the ability to compare content, export archives as images and a variety of search tools to quickly and easily find information stored in the archive.

**2. A public-facing web archive
**This is used in two instances, for example, a large government-based organisation may want to provide public access to historical online information which has huge cultural significance. This may also be used for research purposes and created with long-term preservation in mind.

The second instance would be whereby an organisation or business wants to retire huge areas of their website, usually by placing them on a sub domain that exists as a form of online archive. This can also be considered a web continuity solution as these old or 'redundant' pages and websites are moved into an archive which visitors can still view. Businesses do this to remove website bloat, improve the speed of their primary website but ensure a continuous web experience when visitors access the archive portion of their website.

What are web crawlers?

A web crawler, sometimes called a 'spider' or 'spiderbot' (but often shortened to crawler), is an internet bot that systematically browses the World Wide Web, typically for the purpose of web indexing.

Web search engines and other applications use web crawling to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.

Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed.

If the number of internet pages is extremely large, even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code and they can also be used for web scraping.

Examples of existing web crawlers

The following is a non-exhaustive list of crawler architectures and open source crawlers currently available:

  • Googlebot: Googlebot is the web crawler software used by Google, which collects documents from the web to build a searchable index for the Google Search engine.
  • WebCrawler: WebCrawler is a web search engine, and is the oldest surviving search engine on the web today. For many years, it operated as a metasearch engine. WebCrawler was the first web search engine to provide full text search.
  • Heritrix: Heritrix is a web archival crawler, designed for archiving periodic snapshots of web. The first official release was in January 2004 and was developed by Internet Archive and the Nordic national libraries. However, due to limited ongoing development and the rapid evolution of the web, Heritrix is only capable of capturing older and more simple forms of web content.

What is a website archive?

Simply put, a web archive can defined by as a form of digital repository where archived files can be stored. These website records are typically archived in the industry standard format known as 'WARC'.

This is a specified method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format that has traditionally been used to store web crawls as sequences of content blocks harvested from the web.

WARC is recognised as the industry standard to follow for web archival and supports the harvesting, access, discovery and exchange needs required by organisations and businesses. Besides the visual (on page) content that's recorded, secondary content is captured, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.

Another key feature is that WARCs are immutable, meaning they are unalterable, this is a key distinction to make and allows them to be used for legal admissions. They can also be played through any supporting replay technology which follows the industry standard, this means your archives can be transferred from one archiving provider to another if necessary.

Misconceptions around website archiving

Screenshots are not a form of archiving

Many of us use screen capture technology to reference visual content, take notes, or manipulate images for a required purpose. It's for this reason that we're often asked how website archiving and taking screenshots are different from one another. Here are some important distinctions to make in terms of what's captured, how they can be used and what the implications may be for you.

Capturing a website requires technology that can both automate and initiate a crawl, collecting of all of the digital assets required to replay the website as it appeared on that date. Once completed, this record provides a fully functioning representation of your webpage/website that you can navigate through as if the website was still live.

This is vastly different to a screenshot which is a static image with no interactivity available. With website archiving, the collection process means that every archive is captured with a timestamp and digital signature, meaning the records are immutable and can also be used as a form of legal evidence.

It's typical for a web archiving platform to also index all of your archives, making them available for search so you can find specific content from a specific date. This is essential for many organisations who have complex websites or even need to archive third party sites that sit within their organisation.

When thinking about archiving and screenshots it's important to recognise the context behind archiving.

One of the biggest drivers for website archiving is to meet compliance requirements across a range of industries. For example, regulated firms in financial services must capture records of web channels to comply with record-keeping and financial promotions regulations. Whilst universities and higher education organisations archive their websites to adhere to the requirements set forth by the competition and markets authority (CMA). Finally, governments are mandated to archive all central government websites due to regulations such as the Freedom of Information Act.

With this in mind, organisations are unable to meet compliance requirements when taking screen captures, as this method fails to meet the compliance requirements for record retention. Simply put, screenshots are images that can be altered or changed, which means they can't be used as a valid form of evidence in a legal case or regulatory investigation.

The process of taking screen captures also drains significant resource from organisations and is prone to record-keeping errors. This is because it's dependant upon constant manual surveillance, which means any team tasked with this would have to be aware of every change across their digital content and be in a position to capture it daily without missing any content.

The lack of in-house technology, required record retention and inefficiency behind this method, means regulated businesses have turned to website archiving solutions as a standard to ensure they effectively manage their digital estate and remain compliant.

Why a Website Backup isn't an Archive

We've summarised the key differences between a backup and and an archive here. However, we'll cover all of the details below.

In short, a backup only stores data that’s needed for operational recovery.
This data copy is used to recover data that’s either been corrupted or lost and for this reason, its use and purpose is very different from a web archive:

  • A backup can be changed or altered, this means it doesn't meet compliance record retention requirements for regulated businesses.
  • Backups are difficult to retrieve and take time to reinstate, this is a big problem for organisations who need to provide third-party access on-demand - especially if there's a regulatory investigation or audit.
  • Website archives are held in a fully searchable platform allowing you to easily find content and then use content comparison tools to quickly evidence when changes were made.
  • Backups wouldn’t be classified as a record of what was publicly available at a specific point in time but instead a data-set that's helps ensure data recovery is possible.
  • Every website archive has a timestamp and digital signature, proving authenticity of the record

Time-stamping is a computer-readable date and time that the crawler applies to each file it harvests. This ensures that the archived website is a viable representation of the website at the time the website was archived. For websites which use only flat HTML, back-up copies are acceptable where they include dates of creation and changes within the back-up files.

As we've briefly touched upon, archiving websites can also give organisations the chance to provide access to legacy information that they may not necessarily want to keep on their ‘live’ website.

These can be retired to a form 'public archive' whereby the organisations main website is free from bloat. Two great examples of this include the UK Parliament Web Archive and The National Archives Web Archive which were created using MirrorWeb technology.

Evidence from the Web Continuity initiative at The National Archives shows a significant and ongoing user demand for access to older content that an organisation may consider out of date or unimportant.

The National Archives know all about this as holders of an archive that contains over 5,000 websites from 1996 to the present, including tweets and videos from government social media accounts. The data footprint of the archive is over 120TB – all of which needed to be moved to the cloud as a public facing archive.

Why is Website Archiving Important?

Businesses, governments and organisations create websites as part of their communication with the public because they are a powerful tool for marketing and sharing information. A website exhibits the brand, values and persona of a business, they document the public character of organisations and their interaction with their audiences and customers. In addition, information published on the web has become the primary place where we search and obtain information. Because of this, a website is considered a crucial public record of a business, organisation or individual at a specific point in time.

Because websites provide free access to information, they're regularly updated and constantly changing. This is one of the web’s great strengths, information is published and removed as required. However, this also means that the information published can disappear as quickly as it appeared.

The 'cyber cemetery' grows every day and on an even larger scale. If we cast our minds back to sites such as MySpace, Bebo, Aol, Geocities, Photo Bucket, Napster, and many more, they’re considered a thing of the past that nobody can revisit. This realisation is evidence of just how much has already passed through the web and disappeared.

So, if you’re a global asset manager with multiple websites (along with intermediaries), it’s easy to see how obtaining accurate web records is a challenge when content is being published, changed and removed day-to-day.

The History of the Web

Much of the early web and the information it once held has now disappeared forever; from early online content in the early 1990s to around 1997, very little web information survived. This was before the recognition of the ongoing value of legacy information published online, and before the first web archiving activities which began in 1996.

Since the 1990s, as well as becoming culturally significant, the web has become a significant as a hub of information. As a result, the web has become integrated into other activities, such as research, referencing and quotation. These activities that used to rely on physical records now increasingly use and link to pages and documents held on websites. Therefore, web archiving is a vital process to ensure that people and organisations can access and re-use knowledge in the long-term, and comply with the needs of retrieving their information.

Web archives should be harvested in their original form and be capable of being delivered as they were on the live web, providing a record of web content as it was available at a specific date and time. When a website is archived, the context of the information it provides is maintained, meaning that users can view the information in the context in which it was originally presented.

For Financial Services and Insurance

After the 2008 financial crisis, the financial services industry was shaken up to protect consumer interests and improve transparency. The result meant that regulated firms must operate under a set of strict regulatory rules (which continue to evolve and change), some of which, tie directly into web archiving. Regulators across the world such as FINRA (Financial Industry Regulatory Authority), SEC (U.S. Securities and Exchange Commission), ESMA (European Securities and Markets Authority) and the FCA (Financial Conduct Authority) all require firms to capture accurate web records as a result of record-keeping and financial promotions regulation.

To give an example, the FCA (Financial Conduct Authority) defines the internet as a vehicle for marketing financial promotions (which have strict record-keeping rules) found in PERG 8.22 The Internet of the FCA Handbook:

The Internet is a unique medium for communicating financial promotions as it provides easy access to a very wide audience. At the same time, it provides very little control over who is able to access the financial promotion.

The test for whether the contents of a particular website may or may not involve a financial promotion is no different to any other medium. If a website or part of a website, operated or maintained in the course of business, invites or induces a person to engage in investment activity or to engage in claims management activity, it will be a financial promotion. The FCA takes the view that the person who caused the website to be created will be a communicator.

A record of every financial promotion must be retained and available in the event of a customer complaint or regulatory investigation. This is not the only requirement set forth by the FCA either, within CONC 3.3.1, firms are required to be able to evidence what was published on a webpage at a specific point in time.

FCA COBS 4.2.1 states that a firm must ensure that a communication or a financial promotion is fair, clear and not misleading. Without a legally admissible website record or archive, these online promotions are at risk of non-compliance.

As mentioned, many of the same record-keeping and regulatory requirements are also imposed by regulators across the world. In the US the SEC and FINRA regulate the market and across europe you'll find the European Council and ESMA.

Many investment banks, asset managers and large financial services firms communicate through hundreds of websites (due to third-party brokers or intermediaries), with all of them actively promoting their services and products. With every firm publishing new articles and promotions daily, real-time changes to funds information and sensitive documents such as policies or terms, It's easy to see how this has quickly become a compliance nightmare for firms - how do you capture accurate and legally admissible records of websites that are constantly changing and publishing a multitude of content every day?

Whilst there are no free or open source tools that can solve this requirement, there are commercial web archiving solutions available to answer these compliance requirements (the MirrorWeb Platform being one of them).

For Brands

The world's leading brands are now creating and publishing a huge amount of online content in addition to traditional brand assets, such as printed ads. This has led the drive for brand archiving to not only preserve the brand legacy but also to capture accurate records of what was communicated to customers at a specific point in time. Brands often utilise archives in other ways too, for example, a searchable archive that contains these digital records can be used to inspire the next generation of marketers, allowing them to revisit a piece of their digital heritage.

Because a web archive (WARC) file is immutable, the records can also be used as a form of evidence (in the case of legal admissions) for when a product, patent, or brand message was marketed across their web channels. This helps protect the brand from any potential damage or risk in relation to their intellectual property.

Finally, for many brands there's untapped potential to draw insights from the data held across their web channels, this unstructured data can be diced up and analysed to uncover historical trends or to perform a keyword analysis, for example, what keywords or topics were we most vocal about between the years 2016 to 2017 in relation to content published.

Customer-led businesses and iconic brands also trust in commercial web archiving solutions. This is to ensure they can access legal records of what was communicated to customers on their website at a specific point in time, in essence this becomes their very own 'digital truth and proof'. For many brands, this technology is used to ensure their digital legacy isn't lost as their brand evolves and changes. Pernod Ricard are a great example of a brand who are currently archiving for this very reason

For Public Sector

Numerous national archives, libraries, governments and universities archive website data to preserve all records of cultural and historical significance. This need is also driven by legislation such as the UK Public Records Act 1958, 5 and more recently, the Freedom of Information Act 2000.

As the public sector invests and utilises more digital channels, organisations are looking for ways to evolve their website archiving capabilities, taking advantage of:

  • Cloud-based archiving: To allow for more efficient and flexible storage of the large data sets and deploy technology that's future-proof.
  • Indexing and search capabilities: To make data useful to researchers, civil servants, students and members of the public (including public-facing portals such as The UK National Archives).
  • The ISO standard WARC file format: An official standard which helps organisations store born-digital or digitised materials.

How to Archive Your Website

Whilst there are a variety of archiving vendors in the commercial space, none truly specialise in web archiving. There are multiple reasons for this, for example, many  vendors are focused around capturing other forms of communication such as email or SMS, and as a result their business and archive capabilities are focused around that technology.

Web archiving is much more complex due to the ephemeral nature of the web.
It's constantly evolving and changing and this means the technology developed around it needs to also evolve. Due to this challenge, many vendors have created inefficient solutions, resulting in web records that are broken, inaccurate or unusable.

At MirrorWeb, we've built our own in-house crawler known as 'Electrolyte' which is capable of capturing complex Javascript and emulating a real visitor's browsing experience.

The ability to archive websites accurately and at scale has been central to MirrorWeb's mission since day one, it's the reason we're trusted by the UK Government for web archiving. Our platform was built to automate the entire process for marketing and compliance teams.

The MirrorWeb Platform

The MirrorWeb Platform is built to capture and archive your website and social channels, no matter the size, no matter the complexity. Through a unified combination of bulletproof crawl technology and an in-house auto QA system, we create the most accurate web archives available.

Businesses utilising location based instances of their website can also be captured from the required local jurisdiction, ensuring accuracy of records, data sovereignty and the ability to retrieve exactly what was seen by the customer at a specific point in time. Once a web channel has been crawled, it's indexed and instantly available, ensuring your audit-ready at all times.

You own your data, control data sovereignty and enjoy the benefits of a cloud-based SaaS platform, meaning there's no software to configure or install.

MirrorWeb has a proven track record in helping clients across a range of sectors, including financial, public sector and even brand archivists, to meet their requirements in capturing immutable, web records of their organisation’s websites.
Our solution is able to improve an organisation’s operational efficiency and improve compliance with features that include:

  • Complete archives - We archive all website content, this includes digital assets from internal and external sources, images, video and meta-data.
  • Cloud-native solution - MirrorWeb are partnered with AWS to deliver a turnkey, scalable and future-proof solution in a fully secure AWS S3 environment.
  • Captured in original format - Every archive is captured in real-time, as it was on the day it was published.
  • Full text search - With Elasticsearch technology, all archives are indexed and searchable.
  • Sophisticated user portal - Through the platform, users are able to replay content from the day it was archived and manage archive crawl frequency and parameters. Crawl reports are also available along with a break down of MIME types. Users also have the ability to create access privileges through groups and policies.
  • Public portal - Where there is a need for public access to archives, specifically within government and national archives, we’ve developed a proprietary portal that integrates with the user portal to provide access for sharing records of cultural and historic significance with the general public.
  • Meet compliance requirements - All archived records are stored in the ISO-standard WARC format, including date and timestamps. This means firms always have on-demand access to their ‘digital truth’.
  • Data sovereignty - Through utilising cloud technology, archives are stored in local territories being ISO9001 and ISO27001-certified and GDPR compliant.

To take a look at the platform, simply request a demo and a member of the team will be in touch shortly!

THE SOLUTION

A platform to capture and archive web channels at scale.

Using MirrorWeb’s archiving platform, this firm can now capture fully compliant records of their websites. Every daily web archive is captured based on geo-location, device and includes dynamic content such as personalisation. Once archived, the records can be replayed, searched, and filtered through at any time in the platform.

ISO-Certified & WORM Compliant Archives

Every archived file is time-stamped, immutable and stored in an ISO-compliant format to ensure authenticity and legal acceptance.

Automated Archiving

You define the frequency. Daily, weekly or monthly crawls for your website and social media channels.

Replay Your Websites

Our crawl tech beats the rest. Replay your websites and social media channels with full pixel-for-pixel accuracy.

Download Your Archives

All of your archives are available as a downloadable PNG or PDF to support your record-keeping processes.

A Single Searchable Archive

All digital assets are fully indexed and searchable in the platform, making it easier than ever to find online records and content.

Cloud RegTech Solution

Our cloud-based platform is light touch, requiring no infrastructure costs or extra resource burdens on customers.

Content Comparison

Identify specific content in your archive and review changes with our content comparison tool.

eDiscovery Support

All archived web and social content can be made available to eDiscovery professionals, litigators and other third parties for investigative purposes.

Data Sovereignty

Stay in total control of your data by choosing where it's archived, ensuring full compliance with ISO standards.