The National Archives faces challenges converting the EU's enormous library of laws into a publicly accessible UK archive ahead of Brexit. The Archives’ digital director, John Sheridan, explains how...
Almost as soon as the EU referendum result was known, John Sheridan (Digital Director, The UK National Archives) could see that transforming the way the UK publishes legislation after Brexit was going to be a big job.
John bears the statutory responsibility for archiving and publishing UK law. That work has two prongs: publishing all current UK law on the government’s public site, legislation.gov.uk, and incorporating all EU law into its own historical archives for future generations to consult.
Both present technical challenges. Besides the sudden considerable expansion of the corpus of both current and archived law that the UK’s departure from Europe brings about, John Sheridan will have to solve the question of how to import all the necessary data from Europe’s law archive, EUR-Lex, and convert it to the formats used by the UK.
The Archives has managed large conversation projects before, but John says: “Even for us, EUR-Lex has been a challenge because of the sheer size of the website and because it’s the first thing we’ve ever done that is inherently multilingual.”
Under the Withdrawal Act, at 11pm on 29 March 2019, all EU legislation will be repealed and will cease to have any legitimacy in the UK. The corpus of EU law will be imported into UK law, which the act formally requires the Archives to publish in its role as the Queen's Printer.
Over time, the two bodies of law will diverge as the European parliament modifies and extends the EU’s laws and the UK parliament amends and supersedes the legislation it is inheriting. So simply linking from Legislation.gov.uk, web style, to texts on EUR-Lex is not an option.
Three difficult problems faced John and his team of three at the Archives, augmented by three or four more at the UK-based specialist archiving company MirrorWeb, which won the bid to operate the web archive starting in 2017.
Overcoming the challenges ahead
When they began work within months of the referendum, the shape of the eventual withdrawal legislation was unknown, so whatever the Archives built had to be flexible enough for various scenarios.
Second was understanding somebody else’s data and working out how to convert complex data in a difficult domain into a format that can underpin good decisions.
The third problem was carrying out the actual conversion, because EUR-Lex codes its documents using one variant of legal XML (Formex, developed by the European Commission), and the Archive uses another (Crown Legislation Markup Language, or CLML).
John has previous experience in large-scale conversion projects: he has overseen the legislation database move from the early 1990s programming language LISP to web precursor SGML (Standard Generalised Markup Language), to HTML, XML, and finally CLML.
Still, it took nine months to establish that Formex and CLML are compatible enough for translation. Small differences in drafting styles matter when these documents are represented as data.
For example, European legislation uses more arbitrary sub-dividers, more complex annexes, and much more scattered annotation, footnotes and boxes. This print-oriented design poses a persistent challenge for markup languages, says John, “and that’s definitely true of European legislation” – even though the EU began designating the signed electronic version, rather than print, as the definitive version in 2014.
Even without these variations, legislation is “difficult content to work with”, says John, adding that legislation.gov.uk “combines both native XML technologies for documents with RDF (linked data) for rich metadata about the documents, including all the information about the amendments”.
The site’s technical architect was Jeni Tennison, now CEO of the Open Data Institute.
In total, the Archives will bring across about 150,000 pieces of legislation, including all the legislation content in the Official Journal of the European Union, case law from the European Court of Justice – in English, French and German – plus all the underlying data, metadata and index pages – a total of tens of millions of documents.
To cover this effort, the Treasury allocated an extra £1.2m this year; last year’s work cost about £465,000.
The Withdrawal Act has made at least one aspect of the job easier – for the first time in British history, the law puts the internet at the centre. The requirement laid on the Archives is to “publish” the legislation, not to print it.
One significant change in the conversion is that where EUR-Lex assigns a unique identifier to each legal document – known as a Celex Number – CLML gives a unique identifier to each legislative article.
John’s group found that the usual method of gathering a website’s content – the Heritrix crawler the Internet Archive developed for its Wayback Machine – was not suitable for EUR-Lex, which is search-based, rather than browse-based.
It proved easier to compile and verify a giant list of the documents’ Celex numbers by identifying URL patterns to harvest, using tools provided by the EUR-Lex web services and the EU’s open data portal SPARQL access point.
The power of the cloud for scalability
Both the web archive and Legislation.gov.uk are hosted in the cloud, giving the Archives flexibility to handle the expansion without requiring new hardware. Similarly, John’s group has been able to adapt its existing software tools.
The group also worked with Washington, DC-based legal software specialist Juris Datum to write a sophisticated set of transformation routines to convert Formex into CLML.
By mid-August 2018, well over 99% of the 150,000 pieces of legislation had been successfully converted for publication on Legislation.gov.uk. John expects to complete the first full capture of the EUR-Lex content by the end of August, including performing both automatic and manual checks and patching any gaps or issues they find.
After that, the project will perform daily incremental captures until Exit day, to ensure that a complete snapshot is ready.
One of John's most important goals is that it should always be possible to look at any law and trace the process that brought it into being and understand what underpins its legitimacy (see box above).
Therefore, he says, each piece of converted CLML data on Legislation.gov.uk will link back to the archived Formex data from the European law archive. Together with publicly available logs of the harvesting work, the link “will give a complete picture of how this data became part of the UK’s legislation database”, he says.
Provenance will be represented using a specification, PROV-O, developed by the World Wide Web Constortium, says John. “We see this project as an opportunity to show provenance being done really well on the web,” he adds. “The provenance really matters here, as this is law that we are publishing.”