Website archiving: everything you need to know
Welcome to Website archiving: everything you need to know. This is a resource for those looking to understand exactly what website archiving is, how it's used to capture online information and why it's important. This resource will be useful to professionals and academics who are new to the concept of web archiving along with those who are more knowledgeable.
After reading this guide, you will understand:
- What website archiving is
- Why and how web archiving has become important
- How you can archive your website
- How different industries benefit from web archiving
- The challenges behind capturing and archiving modern websites
- Why a website screenshot or back-up is different to web archiving
This guide will be useful to a variety of professionals in roles across marketing, digital, compliance, research, archives, and records management - essentially any individual who's responsible for the organisations digital web presence, record-keeping compliance or long-term preservation. Further on in this guide, we dive into how website archiving can benefit different industries and job functions.
What is website archiving?
Website archiving is the process of collecting websites and the information they contain from the World Wide Web and preserving these in an archive. Web archiving is a similar process to traditional archiving of paper or parchment documents; the information is selected, stored, preserved and made available to people. Access is usually provided to the archived websites, for use by businesses, governments, universities, organisations, researchers, historians and the public.
As the web contains a massive amount of websites, digital assets and information, digital teams and web archivists typically use automated processes to collect websites. The process involves ‘harvesting’ websites from their locations on the live web using crawl based software. A crawler travels across the web and within websites, extracting and saving the information as they go. As you may have guessed, the web crawler plays a huge role in just how accurate a website capture is. Due to the complexity of modern websites, capturing a website with pixel-for-pixel accuracy has become a challenge for all archiving vendors.
Once a crawl of a website (or websites) is complete, the archived websites and information they contain are made available as part of a web archive collection. These can be replayed and navigated through just how they would be on the live web but instead are preserved as records of what was published at a particular point in time.
The most common types of web archiving
There are 3 main technical methods for archiving web content:
- Client-side web archiving
- Transaction based web archiving
- Server-side web archiving
Client-side archiving is the most popular method due to its simplicity and scalability, it can be carried out remotely and on a large scale. Transaction-based and server-side archiving are found to be more of a historic approach, this is because they require active collaboration with the server owners and need to be implemented on a case-by-case basis.
When an organisation or business is looking to archive their website, they are usually looking for one of the two following outputs:
1. A centralised private web archive
This is where an individual, organisation or business will archive their websites on a pre-defined schedule and store these in a private archive only accessible to them. The archive would be accessed through a portal that provides the ability to replay archives from specific dates and times and offer a variety other features including search and filter tools to find information.
2. A public-facing web archive
This is used in two instances, for example, a large government-based organisation may want to provide public access to historical online information which has huge cultural significance. This may also be used for research purposes and created with long-term preservation in mind.
The second instance would be whereby an organisation or business wants to retire huge areas of their website, usually by placing them on a sub domain that exists as a form of online archive. This can also be considered a web continuity solution as these old or 'redundant' pages and websites are moved into an archive which visitors can still view. Businesses do this to remove website bloat, improve the speed of their primary website but ensure a continuous web experience when visitors access the archive portion of their website.
What are web crawlers?
A Web Crawler, sometimes called a 'spider' or 'spiderbot' and often shortened to crawler, is an internet bot that systematically browses the World Wide Web, typically for the purpose of web indexing.
Web search engines and other applications use web crawling to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently. Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed.
If the number of internet pages is extremely large, even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code and they can also be used for web scraping.
Examples of general-purpose and open-source crawlers
The following is a non-exhaustive list of crawler architectures and open source crawlers currently available:
- Googlebot: Googlebot is the web crawler software used by Google, which collects documents from the web to build a searchable index for the Google Search engine.
- WebCrawler: WebCrawler is a web search engine, and is the oldest surviving search engine on the web today. For many years, it operated as a metasearch engine. WebCrawler was the first web search engine to provide full text search.
- Heritrix: Heritrix is an archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. The first official release was in January 2004 and was developed by Internet Archive and the Nordic national libraries.
- Apache Nutch: Apache Nutch is a highly extensible and scalable open source web crawler software project. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
Defining a web archive
Simply put, a web archive can defined by an archiving format known as 'WARC'. This is a specified method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format that has traditionally been used to store web crawls as sequences of content blocks harvested from the World Wide Web.
WARC is recognised as the industry standard to follow for web archival and supports the harvesting, access, and exchange needs required by organisations and businesses. Besides the visual content that's recorded, secondary content is captured, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.
Screen captures are usually recorded as images such as JPEG's or PNG's as opposed to the WARC format. Due to the very nature of WARCs, it's easy to identify the authenticity of a web record due to the timestamp and digital signature which makes them usable for legal admissions. Another key feature is that WARCs are immutable, meaning they are unalterable, this is a key distinction to make and renders them available to be used in legal admissions. They can also be replayed through any technology which follows the industry standard meaning your archives can be transferred from one archive provider to another if necessary.
We'll cover off the different ways you can go about archiving your website later in this guide including the pros and cons of each approach.
Misconceptions around Web Archiving
Why screenshots aren't classified as archives
Individuals and small businesses may use a variety of tools to capture records of their web content, for example, they might take screen captures periodically and store these somewhere. Whilst these records act as a visual representation of a web page, they are not classified as a 'web archive'.
An archive requires technology that can initiate a crawl and then collect of all of the digital assets required to visually represent that website or page exactly as it appeared. Once completed, this web archive would provide a fully functioning representation of your webpage/website that you can navigate through, whereas a screenshot is a static image which cannot behave in the same way as a website.
A web archiving platform will also store and index all of your archives, making them available for search so users can find specific content from a specific date, giving them full confidence of what was published on the web at that time.
The most universally known public website archive is known as 'the Wayback Machine' which was created by The Internet Archive who are an American digital library with the stated mission of 'universal access to all knowledge'.
Due to the increasing need to archive, there has been a large growth in commercial web archiving solutions. For example, regulated firms across the financial services space must capture accurate records of their web channels to comply with a set of record-keeping and financial promotions regulations. Universities and higher education organisations archive their websites to adhere to the requirements set forth by the competition and markets authority (CMA), and governments are mandated to archive all central government websites.
Because the requirement to archive is tied to regulation, you'll often find businesses and organisations source a commercial web archiving solution to ensure they remain compliant and follow the standards.
With this in mind, it's easy to understand how taking manual screenshots of website pages can be wrought with problems. Websites are updated multiple times daily, meaning the process of capturing every page across the entire year (whilst changes are happening) is manual, time-consuming and impossible to maintain. Other challenges include the output itself, screenshots are susceptible to image manipulation which means they cannot be used for legal admissions.
Customer-led businesses and iconic brands also trust in commercial web archiving solutions. This is to ensure they can access legal records of what was communicated to customers on their website at a specific point in time, in essence this becomes their very own 'digital truth and proof'. For many brands, this technology is used to ensure their digital legacy isn't lost as their brand evolves and changes. Pernod Ricard are a great example of a brand who are currently archiving for this very reason.
Why a website back-up isn't an archive
Back-up copies of websites do not always result in viable web archives, especially where websites use active scripts. Back-up copies where websites use active scripts would just contain the programming code and are not harvested from the web and time-stamped. Time-stamping is a computer-readable date and time that the crawler applies to each file it harvests. This ensures that the archived website is a viable representation of the website at the time the website was archived. For websites which use only flat HTML, back-up copies are acceptable where they include dates of creation and changes within the back-up files.
As we've briefly touched upon, archiving websites can also give organisations the chance to provide access to legacy information that they may not necessarily want to keep on their ‘live’ website. These can be retired to a form 'public archive' whereby the organisations main website is free from bloat. Two great examples of this include the UK Parliament Web Archive and The National Archives Web Archive which were created using MirrorWeb technology.
Evidence from the Web Continuity initiative at The National Archives shows a significant and ongoing user demand for access to older content that an organisation may consider out of date or unimportant. The National Archives know all about this as holders of an archive that contains over 5,000 websites from 1996 to the present, including tweets and videos from government social media accounts. The data footprint of the archive is over 120TB – all of which needed to be moved to the cloud as a public facing archive.
Why is website archiving important?
Businesses, governments and organisations create websites as part of their communication with the public because they are a powerful tool for marketing and sharing information. A website exhibits the brand, values and persona of a business, they document the public character of organisations and their interaction with their audiences and customers. In addition, information published on the web has become the primary place where we search and obtain information. Because of this, a website is considered a crucial public record of a business, organisation or individual at a specific point in time.
Because websites provide free access to information, they're regularly updated and constantly changing. This is one of the web’s great strengths, information is published and removed as required. However, this also means that the information published can disappear as quickly as it appeared.
The 'cyber cemetery' grows every day and on an even larger scale. If we cast our minds back to sites such as MySpace, Bebo, Aol, Geocities, Photo Bucket, Napster, and many more, they’re considered a thing of the past that nobody can revisit. This realisation is evidence of just how much has already passed through the web and disappeared.
So, if you’re a global asset manager with multiple websites (along with intermediaries), it’s easy to see how obtaining accurate web records is a challenge when content is being published, changed and removed day-to-day.
The history of the web
Much of the early web and the information it once held has now disappeared forever; from early online content in the early 1990s to around 1997, very little web information survived. This was before the recognition of the ongoing value of legacy information published online, and before the first web archiving activities which began in 1996.
Since the 1990s, as well as becoming culturally significant, the web has become a significant as a hub of information. As a result, the web has become integrated into other activities, such as research, referencing and quotation. These activities that used to rely on physical records now increasingly use and link to pages and documents held on websites. Therefore, web archiving is a vital process to ensure that people and organisations can access and re-use knowledge in the long-term, and comply with the needs of retrieving their information.
Web archives should be harvested in their original form and be capable of being delivered as they were on the live web, providing a record of web content as it was available at a specific date and time. When a website is archived, the context of the information it provides is maintained, meaning that users can view the information in the context in which it was originally presented.
For Financial Services and Insurance
After the 2008 financial crisis, the financial services industry was shaken up to protect consumer interests and improve transparency. The result meant that regulated firms must operate under a set of strict regulatory rules (which continue to evolve and change), some of which, tie directly into web archiving. Regulators across the world such as FINRA (Financial Industry Regulatory Authority), SEC (U.S. Securities and Exchange Commission), ESMA (European Securities and Markets Authority) and the FCA (Financial Conduct Authority) all require firms to capture accurate web records as a result of record-keeping and financial promotions regulation.
To give an example, the FCA (Financial Conduct Authority) defines the internet as a vehicle for marketing financial promotions (which have strict record-keeping rules) found in PERG 8.22 The Internet of the FCA Handbook:
The Internet is a unique medium for communicating financial promotions as it provides easy access to a very wide audience. At the same time, it provides very little control over who is able to access the financial promotion.
The test for whether the contents of a particular website may or may not involve a financial promotion is no different to any other medium. If a website or part of a website, operated or maintained in the course of business, invites or induces a person to engage in investment activity or to engage in claims management activity, it will be a financial promotion. The FCA takes the view that the person who caused the website to be created will be a communicator.
A record of every financial promotion must be retained and available in the event of a customer complaint or regulatory investigation. This is not the only requirement set forth by the FCA either, within CONC 3.3.1, firms are required to be able to evidence what was published on a webpage at a specific point in time.
FCA COBS 4.2.1 states that a firm must ensure that a communication or a financial promotion is fair, clear and not misleading. Without a legally admissible website record or archive, these online promotions are at risk of non-compliance.
As mentioned, many of the same record-keeping and regulatory requirements are also imposed by regulators across the world. In the US the SEC and FINRA regulate the market and across europe you'll find the European Council and ESMA.
Many investment banks, asset managers and large financial services firms communicate through hundreds of websites (due to third-party brokers or intermediaries), with all of them actively promoting their services and products. With every firm publishing new articles and promotions daily, real-time changes to funds information and sensitive documents such as policies or terms, It's easy to see how this has quickly become a compliance nightmare for firms - how do you capture accurate and legally admissible records of websites that are constantly changing and publishing a multitude of content every day?
Whilst there are no free or open source tools that can solve this requirement, there are commercial web archiving solutions available to answer these compliance requirements (the MirrorWeb Platform being one of them).
The world's leading brands are now creating and publishing a huge amount of online content in addition to traditional brand assets, such as printed ads. This has led the drive for brand archiving to not only preserve the brand legacy but also to capture accurate records of what was communicated to customers at a specific point in time. Brands often utilise archives in other ways too, for example, a searchable archive that contains these digital records can be used to inspire the next generation of marketers, allowing them to revisit a piece of their digital heritage.
Because a web archive (WARC) file is immutable, the records can also be used as a form of evidence (in the case of legal admissions) for when a product, patent, or brand message was marketed across their web channels. This helps protect the brand from any potential damage or risk in relation to their intellectual property.
Finally, for many brands there's untapped potential to draw insights from the data held across their web channels, this unstructured data can be diced up and analysed to uncover historical trends or to perform a keyword analysis, for example, what keywords or topics were we most vocal about between the years 2016 to 2017 in relation to content published.
For Public Sector
Numerous national archives, libraries, governments and universities archive website data to preserve all records of cultural and historical significance. This need is also driven by legislation such as the UK Public Records Act 1958, 5 and more recently, the Freedom of Information Act 2000.
As the public sector invests and utilises more digital channels, organisations are looking for ways to evolve their website archiving capabilities, taking advantage of:
- Cloud-based archiving: To allow for more efficient and flexible storage of the large data sets and deploy technology that's future-proof.
- Indexing and search capabilities: To make data useful to researchers, civil servants, students and members of the public (including public-facing portals such as The UK National Archives).
- The ISO standard WARC file format: An official standard which helps organisations store born-digital or digitised materials.
How to archive your website
Whilst there are a variety of archiving vendors in the commercial space, none truly specialise in web archiving. There are multiple reasons for this, for example, many vendors are focused around capturing other forms of communication such as email or SMS, and as a result their business and archive capabilities are focused around that technology.
Web archiving is much more complex due to the ephemeral nature of the web.
It's constantly evolving and changing and this means the technology developed around it needs to also evolve. Due to this challenge, many vendors have created inefficient solutions, resulting in web records that are broken, inaccurate or unusable.
The ability to archive websites accurately and at scale has been central to MirrorWeb's mission since day one, it's the reason we're trusted by the UK Government for web archiving. Our platform was built to automate the entire process for marketing and compliance teams.
The MirrorWeb Platform is built to capture and archive your website and social channels, no matter the size, no matter the complexity. Through a unified combination of bulletproof crawl technology and an in-house auto QA system, we create the most accurate web archives available.
Businesses utilising location based instances of their website can also be captured from the required local jurisdiction, ensuring accuracy of records, data sovereignty and the ability to retrieve exactly what was seen by the customer at a specific point in time. Once a web channel has been crawled, it's indexed and instantly available, ensuring your audit-ready at all times.
You own your data, control data sovereignty and enjoy the benefits of a cloud-based SaaS platform, meaning there's no software to configure or install.
MirrorWeb has a proven track record in helping clients across a range of sectors, including financial, public sector and even brand archivists, to meet their requirements in capturing immutable, web records of their organisation’s websites.
Our solution is able to improve an organisation’s operational efficiency and improve compliance with features that include:
- Complete archives - We archive all website content, this includes digital assets from internal and external sources, images, video and meta-data.
- Cloud-native solution - MirrorWeb are partnered with AWS to deliver a turnkey, scalable and future-proof solution in a fully secure AWS S3 environment.
- Captured in original format - Every archive is captured in real-time, as it was on the day it was published.
- Full text search - With Elasticsearch technology, all archives are indexed and searchable.
- Sophisticated user portal - Through the platform, users are able to replay content from the day it was archived and manage archive crawl frequency and parameters. Crawl reports are also available along with a break down of MIME types. Users also have the ability to create access privileges through groups and policies through our public portal.
Where there is a need for public access to archives, specifically within government and national archives, we’ve developed a proprietary portal that integrates with the user portal to provide access for sharing records of cultural and historic significance with the general public.
- Meet compliance requirements - All archived records are stored in the ISO-standard WARC format, including date and timestamps. This means firms always have on-demand access to their ‘digital truth’.
- Data sovereignty - Through utilising cloud technology, archives are stored in local territories being ISO9001 and ISO27001-certified and GDPR compliant.
To take a look at the platform, simply get in touch below and one of the team will be in touch shortly!