Open Data

An account by Conrad Taylor of the March 2019 meeting of the Network for Information and Knowledge Exchange.

Speaker — David Penfold is a veteran of the world of electronic publishing, and participates in ISO committees on standards for graphics technology. He has been a lecturer at the University of the Arts, London (the part formerly known as the London College of Printing), and currently teaches Information Management in a publishing context.

Defining ‘data’ and ‘openness’

There are two aspects to Open Data requiring definition — the ‘data’ part and the ‘openness’ part. To start with data – David presented us with the text of a quote from T.S. Eliot’s Choruses From The Rock:

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

How might we interpret these words? Perhaps we no longer concern ourselves with wisdom because we are too worried about amassing knowledge; perhaps what gets in the way of acquiring knowledge is a worry about dealing with too much information. This might caiuse us to think of so-called ‘DIKW’ model, which places Data as the lowest and most basic level of a pyramid, with Information as the next step up, Knowledge as something that derives in turn from the processing of Information, with Wisdom at the pinnacle. David suggested the most important thing for us to recognise is Data as the foundation and validation of Information.

Waypoints in a history of data

David offered some ‘checkpoints’ in the history of data. One could be during the Siege of Plataea, in the opening years of the 27-year Peloponnesian War. In 429 BCE the Spartan army laid siege to Plataea. Unable to break through the Plataean defences, the Spartans built an encircling wall to bottle up the defenders instead. By 427 BCE, the Plataeans were desperate. About half of their force decided to try to break out through the Spartan wall. To achieve this quickly and with an element of surprise, they needed to prepare ladders with which to scale the imprisoning wall. But how tall should these ladders be?

Thucydides tells us that the Plataeans counted the layers of brick of which the containing wall was constructed. The counting was done in parallel by many soldiers, and the average (more strictly, the mode) of these estimates was multiplied by the height of one brick course. Thus they estimated the height of the walls, which decided the length of the scaling ladders. (Of the 220 who attempted the breakout, all but 15 made it. As for those who stayed behind, when Plataea surrendered, the Spartans summarily slaughtered them.)

This incident, said David, was one of the first recorded uses of numerical data for solving a problem. Numerical data had been used already for thousands of years in accountancy, land-division, inventory management, taxation, etc.

Another waypoint: in the 13th century, the Franciscan friar Roger Bacon produced his Opus Majus, sent to Pope Clement IV in 1267. Bacon asserted that science needed to be based on data, on experimental results. In this he was following the lead of Aristotle, Latin translations of whose works were then percolating into Europe via Muslim scholars.

Some centuries later, Johannes Kepler developed his laws of planetary motion. These were later validated when Tycho Brahe measured the positions of many stars and the planets. Closer to the present day, quantum theory, relativity and much besides all developed because the data that people were measuring didn’t fit the predictions that earlier theoretical frameworks suggested. A principle of experimental science is that if the data from your experiments don’t fit the predictions of your theories, it is the theories which must be revisited and reformulated.

Classification and relationships in data

In parallel with these developments, people had been looking at data classification and organisation. David illustrated this trend with the ‘Tree of Porphyry’, Carl Linnaeus’ classification of plant and animal species, and Roget’s Thesaurus. Leibnitz, Kant, Whitehead and Wittgenstein all produced theories about how data and information should be handled. Peirce’s existential graphs (see https://en.wikipedia.org/wiki/Existential_graph) are surprisingly relevant now in an era of Linked Data and the Resource Description Framework.

David expanded on some of these classificatory approaches. To Charles Sanders Peirce we can attribute the origin of what we might call ‘the triple’, central to RDF. In a triple you have an entity, plus a concept or property, plus a value. This three-element method of defining things is essential to the implementation of Linked Data.

Unless you can establish relationships between data elements, they remain meaningless, just bare words or numbers. A number of methods have been used to associate data elements with each other and with meaning: the Relational Database model is one such. Spreadsheets are based on another model, and the Standard Generalized Markup Language (and subsequently XML) was an approach to giving structure to textual materials. Finally, the Semantic Web and the Resource Description Framework have developed over the last two decades.

Open-ness of data

Moving on to what it means for data to be ‘open’ — David referred to the words ‘open sesame!’ from the tale of Ali Baba and the Forty Thieves. This magic phrase opened a cave-full of resources to the young protagonist. Maybe that is a good mnemonic for remembering what Open Data is about.

There are however various misconceptions about what Open Data means. It doesn’t mean Open Access, a term used within the worlds of librarianship and publishing to mean free-of-charge access, mainly to academic journals and books. This issue has been rumbling on for thirty years or more, campaigned on by university librarians struggling to provide academics at their institutions with access to reading materials, given their limited subscription budgets. Recently, the University of California declared that they would no longer take books and journals from Elsevier, because Elsevier would not agree to their terms on open access.

We are also not talking about Open Archiving, which has a close relationship to the Open Access concept. Much of the effort in Open Archiving goes into developing standardised metadata so that archives can be shared.

In a way, all publications are collections of data, but they are not what we generally mean when we talk of Open Data; by which we mean the following:

  • It is freely available.
  • It is often from government, but could be from other bodies and networks, and even private companies.

Nigel Shadbolt’s’s take

David then showed us a short excerpt of a video of a presentation made in 2012 by Sir Nigel Shadbolt, a founder of the Open Data Institute, who with Tim Berners-Lee, and at the request of Prime Minister Gordon Brown, set up the open data portals for the UK government.

In this video, Nigel explains how government publication of open data, in the interests of transparency, is now found in many countries, and at national, regional and local level. Later in the video he talked about the benefits — improved accountability, better public services, improvement in [public] participation, improved efficiency, creation of social value, and innovation value to companies.

Examples of Open Data

Network Rail publishes open data and benefits through improvements in customer satisfaction – for example, through alerts when there are problems on the line. Network Rail says its open data generates technology-related jobs around the rail sector, and saves costs in information provision when third parties invest in building information apps based on that data. They reckon that the boost to the overall economy from making their data public is worth £130m a year.

Who uses Network Rail open data feeds? Commercial users are a significant chunk, as are the rail industry and Network Rail itself. The data is accessed also by individuals, and by academia. Oddly, government use is minuscule.

Ordnance Survey open data is important within the economy and in governance. David demonstrated an application which he uses as Chair of the Parish Council in the East Sussex village of Pett. Through their District Council they have a licence to use the software he showed us, called Parish Online. Once logged in as an authenticated user, you first see the basic Ordnance Survey view of the area, but there are also optional layers of geospatialised Open Data he can apply to it. He showed what happens if you add the Historic England data layer, which reveals the route of the Royal Military Canal, built in the Napoleonic Wars period. There’s also an Environment Agency layer showing sites of special scientific importance (SSSI) and an area of outstanding natural beauty (AONB).

Parish Online has different levels of login, granting different degrees of access. The Admin login allows all manner of facilities, including being able to add to the data. The Parish Council at Pett is minded, once they have added some extra local data, to open the most basic browse-only level of access to anyone in the parish.

Data in the Semantic Web

After a tea-break, David resumed by showing us three clips from a video of a presentation by Tim Berners-Lee. One clip showed what Hans Rosling of the Karolinska Institut in Sweden (recently deceased) did with global data about quality of health and other demographics; in a second, he explains about the role of linked Data in the Semantic Web.

(To see a dramatic presentation recorded with special effects by Hans Rosling, see his ‘200 countries, 200 years in 4 minutes’ at https://www.youtube.com/watch?v=jbkSRLYSojo.)

David then elaborated how the Semantic Web works. It’s based on four concepts: (a) metadata; (b) structural relationships; (c) tagging; (d) the Resource Description Framework method of coding, which in turn is based on XML. The syntax of the coding is based on rules, which like Peirce’s graphs gives us three parts — a resource, a property, and a value of that property; examples would be ‘John Lennon [was a member of] The Beatles’ — another would be ‘The British Dental Association [has as postcode] W1G 8YS’.

The formal and standard way of coding these triples, recommended by RDF, is what makes this information processable by computers. RDF uses URIs (Uniform Resource Identifiers), which may be URLs (‘Web addresses’), but not necessarily if there is some other kind of documentary resource online – an example would be ISBN numbers for publications.

Open Data sources

Data.gov.uk is the Government’s open data web resource, with over 48,000 datasets. The US version, data.gov, has over 264,000 datasets, and many other countries have published datasets as well. Many applications have been built using these as a resource. As an example, David showed a video about a Web site that draws on Ofsted data about schools, and on Zoopla data about house prices, to help you find a suitably priced house in the catchment area of a desirable school.

A couple of concerns about using Open Data are, the ethics of how it is used, and the question of where data comes from – how it has been obtained, and how validated. When it comes to truly huge collections of data, what we call Big Data, it’s impossible for humans alone to analyse that: we need AI techniques to deal with it, and that raises its own ethical issues, which NetIKX had looked at in a previous seminar [link].

The Data Ethics Canvas

The Open Data Institute has developed an ‘ethics canvas’, and David wanted us to take a look at it and decide what we thought about it. It is basically a list of fifteen issues which may be of ethical concern, which are: (1) data sources; (2) Limitations of your data sources; (3) sharing this data with other organisations; (4) relevant legislation and policies; (5) rights over data sources; (6) existing ethical frameworks; (7) your reason for using this data; (8) communicating your purpose; (9) positive effects of people; (10) negative effects on people; (11) minimising negative impact; (12) engaging with people; (13) communicating risks and issues; (14) reviews and iterations; (15) your actions.

David asked us to discuss this Ethics Canvas in our table groups, and then called on people to report back.

Discussion

Conrad referred to the schools data website David had shown. He is involved with an initiative in Lambeth to see what can be done about knife crime, and one factor the group has identified is the competition between schools to shine in the Ofsted stats, such that they massage the figures by ‘off-rolling’ (permanently excluding) troublesome children and slow learners. That has a deleterious effect on those children’s lives, and makes it more likely they will be drawn into gangs. It could be said that data comparators of this sort can make such situations of inequality worse.

Stuart Ward had asked people around the table he was at, if their organisations had any data that they would be willing to put into the public realm. The answer is No, because the data is commercially valuable. And is there data which those organisations could make use of? Again the answer is No, not if you have to pay for it. So how much does Open Data figure in the life of real organisations? Where might it have value? For commercial organisations, the answer is, if it can help them improve sales, provide intelligence so they can change to match their environment… Stuart thought there were no reasons for commercial organisations, as compelling as the examples David had given in the public sector. Naomi Lee noted that some of the lead on Open Data is coming from academia, where the data that is being generated has been publicly funded.

A part of the discussion sought to clarify what standards there might be for Open Data. The answer — this is quite well known, we have discussed it at other NetIKX seminars — is that there are all kinds of information releases which would qualify as ‘open data’, including publishing an Annual Report as PDF. However, there are gradations, with five levels identified by Tim Berners-Lee: to publish as an Excel file makes data more accessible and useful than publishing it as PDF; to publish in a non-proprietary format such as .csv is even better, and best of all is if you can transform the data into the triples form so it can join the Linked Open Data cloud.

Graham Robertson wondered about the relationship between data and IT. He remembered an event at the British Computer Society, of the Data Management Specialist Group, addressed by Keith Gordon. (September 2007; audio of that talk is available from http://www.conradiator.com/kidmm/kgordon-sept2007.html.) Does that group still exist, Graham wondered? And does that deal with the IT behind data management?

It does, said Conrad: see https://www.bcs.org/category/17607). DMSG is concerned primarily with the organisation of data, but there is a link between that and computing capability. Codd’s relational database model could not easily be implemented in hardware until the hard disk was invented, making it possible for a machine to quickly skip around its data store in a way that tape storage could not provide.

We can’t say that RDF is equivalent to semantic technology. Semantic technologies such as generic markup (GenCode, GML, SGML etc) considerably pre-date RDF. We agreed that RDF is one layer of the Semantic Web model; there are other aspects we didn’t have time to go into, such as OWL (the Web Ontology Language).

A third table workgroup, which Melanie Harris described as ‘the government table’, had shared experiences of how government information is being made available. It seems, ironically, that while many government department are good at publishing data that others can use, government isn’t so good at using its own data! Amnd they thought, with so much data being generated, it should be presented with ease-of-use and meaningfulness in mind, including clear and understandable classifications.

Leaking data – privacy vs utility?

Someone mused how, at the same time as there was some drift towards data openness and sharing, there is an even larger drift towards commercial monopolisation of data – often about us – in private and potentially exploitative data environments (e.g. Facebook, Amazon). Whenever we agree to ‘log into [a third-party] site with Facebook’, we are handing Facebook yet another chunk of our personal behaviours to analyse!

Various people noted that we ‘leak’ a lot of data. Using your OysterCard builds a rich dataset about your movements; so does carrying a mobile phone. TfL now proposes to build a pattern of Underground service use by tracking the WiFi data generated by people’s devices.

Transport for London is a leader when it comes to giving access to live data, which can be accessed from a variety of apps and devices. These days you can see where all the buses are, and when your bus will arrive at a local stop!

This has a funny side: a friend who lectures in Computer Science at the University of Dundee takes the bus to work each day from Coupar; its punctuality is rather erratic. But she always turns up to the bus stop one minute before the bus comes, because she tracks its progress on her mobile phone. Some weeks ago she was verbally abused by a lady who noted this co-incidence between Karen arriving at the bus stop and the bus arriving, and in a classic bit of post hoc ergo propter hoc reasoning, accused Karen of making the bus late somehow! Magical thinking – I suggested that if it happened again, she should threaten to turn the lady into a frog.

— Conrad Taylor, June 2018