Roger Clarke's Web-Site
© Xamax Consultancy Pty Ltd, 1995-2017
|Identity Matters||Other Topics||Waltzing Matilda||What's New|
Draft of 22 July 2016
Roger Clarke **
© Xamax Consultancy Pty Ltd, 2016
Available under an AEShareNet licence or a Creative Commons licence.
This document is at http://www.rogerclarke.com/DV/DSEF.html
The unbounded enthusiasm for 'big data', 'big data analytics' and now 'data science' is giving rise to modest numbers of real-world applications, and larger numbers of experiments. Given the very considerable risks that such undertakings involve, the publication of a 'data science ethical framework' by the UK Cabinet Office is a timely contribution. Scrutiny of the document in light of available knowledge about the privacy and other risks inherent in big data activities unfortunately shows the framework to fall a very long way short of being effective or appropriate. Urgent revision and re-publication is essential if the document is to have any credibility with the public.
In May 2016, the UK Cabinet Office published a document entitled 'Data Science Ethical Framework' (UKCO 2016). Its declared intention is to "give civil servants guidance on conducting data science projects, and the confidence to innovate with data". It is unclear what process was adopted in developing the document, but the announcement included a request to "the public, experts, civil servants and other interested parties to help us perfect and iterate". (Remarkably, however, and in breach of the principles of open government, the document has been made available only in a form that precludes extraction of quotations from it, and precludes text-search within it).
The document has relevance not only for the UK civil service, but also in the UK private sector, and in other countries. It is therefore important to subject it to scrutiny. This paper evaluates the Framework, drawing on the considerable existing literature relevant to the matter.
The paper first outlines the document's origins and nature, and then considers the conceptions it embodies of 'data science' and of the negative impacts that may arise from it. The ethical component and the presentations of privacy impact assessments (PIAs) and of relevant laws are then examined. This lays the foundation for comments on specific substantive problems within the document's six segments.
No information is provided about the process whereby the document came into being. Any initiative that has significant potential for negative impacts needs to be the subject of an evaluation process. It is even more important that a document intended to stimulate and guide many such initiatives be itself the subject of an evaluation process.
A first step was to consider the document against a consolidated set of evaluation meta-principles (APF 2013). The initial process appears to have fallen well short of the need. The drafting does not seem to be the result of consultation with the affected public and their representative and advocacy organisations. As will be shown shortly, it neither reflects nor provides access to existing literature. It lacks any sense that initiatives must be justified and that negative impacts must be proportionate to the anticipated benefits. It overlooks the vital need for controls. And there is no mention of audit.
Some of these inadequacies are capable of being addressed through the iterative process that the Office has announced. On the other hand, some deficiencies may be difficult to address. This is because the framing has already been done, with the scope and the objectives declared. Furthermore, the initial version is already in the hands of civil servants ("It is designed to be iterated as it is used" - p.3), and they are being urged by the Cabinet Office and their Ministers to join in the fervour for big data.
The scope is declared to be 'data science'. The sense in which that term is intended is discussed in the following section. The expressed objective is one-sided: 'to give civil servants the confidence to innovate'. This is then qualified by mention of "respect for privacy", but the qualification is equivocal. In addition, the document specifically contemplates the use of data science to reach decisions and take actions that are intended to have negative consequences for people.
The stated objective should embody and juxtapose both positive and negative aspects. For example, a more reasonable formulation would be 'to balance the need to reap benefits through innovation in the use of data against other factors, including existing legal and policy restraints, and the potentially negative impacts of data expropriation and re-purposing followed by the application to it of analytical procedures, many of which are experimental in nature'.
The dominance of the project-facilitation objective plays through into the process. This statement is made: "Fundamentally, the public benefit of doing the project needs to be balanced against the risk of doing so" (p.3). But the idea is not carried through into the principles or the process. At no stage is there any act of weighing up the pros and cons and making a go/no-go decision. The assumption throughout the document is that there will be public benefit that will justify the intrusions. The purpose of the process appears to be to get the project through, avoiding harm where that can be done, and educating the public where it cannot.
The body of the document presents a list of "six key principles". However these principles aks double as a series of steps in what purports to be a PIA process. This is intellectually and structurally unsound, because 'principles' may have application at various stages of an assessment process, whereas 'process' needs to be driven by the practicalities of acquiring and analysing information, and by its nature is not simply sequential but also includes branching paths and iterations.
The announcement and the document itself both adopt the hyperbole projected by interested parties and over-excited onlookers (notably Mayer-Schönberger & Cukier 2013): "Data science offers huge opportunities for government. Harnessing new forms of data with increasingly powerful computer techniques increases operational efficiency, improves public services and provides insight for better policymaking". Although the tone of much of the document is appropriately sober rather than unduly excited, the expectation has been set that the new wonder-drug is to be adopted.
The document itself does not provide any background on what it means by 'data science'. It instead links to a brief introduction (OPMT 2016). This refers to data science as being 'the analysis and visualisation of big and complex data to provide useful insight'. Some confusion is evident, however, in that the document suggests that "data science can help us collect ... data" (p.9), despite that activity being outside the declared scope.
As is the case with many discussions of 'big data' initiatives, much of what is encompassed is merely a re-badging of longstanding techniques in management information systems, decision support systems and data mining. The introduction is supported by a Glossary, but there are no references to any of the copious external sources, and no assistance is provided to enable the reader to drill down to necessary details.
The document contains a variety of indications of the possible downsides of data analytics for people. However, the examples fall short of a useful checklist that can guide readers who are evaluating possible applications.
It is important that much more detailed guidance be provided, reflecting contemporary knowledge about the pitfalls and the risks. See Jacobs (2009), boyd & Crawford (2011), Croll (2012), Oboler et al. (2012), Wigan & Clarke (2013), Buhl & Heidemann (2013), Bennett Moses & Chan (2014), Clarke (2016a), Chan & Bennett Moses (2016), Clarke (2016b).
The document contains elements that are consistent with expectations of an effective evaluation process, such as its points about mitigation measures, e.g. 'minimum intrusion' and caution in using "data that is voluntarily in the public domain" (p.9); and "a means of recourse through which people can challenge incorrect [sic] decisions" and choice or options rather than a single, fixed mechanism (p.15). However, there are many areas in which the document falls a long way short of appropriate guidance.
In one area of great significance, there is mention of de-identifying data (pp.4, 9), anonymisation (p.17) and (re-)identifiability (p.9). But the mentions are only that, the document provides no guidance, and it fails to provide access to appropriate references. See Sweeney (2002), Ohm (2010), Slee (2011), and most relevantly, ICO (2012).
The document uses the notion of ethics to consider "issues which sit outside the law" (p.3). Unfortunately, the invocation of the ethics is purely nominal, with the vague explanation that "Ethics are people's moral understanding of what is right". Moreover, ethics is confused with "public opinion ... how people would reasonably expect their personal data to be used, particularly if for a different purpose" (p.14). An alternative would have been to refer to 'public policy' questions, or to 'public acceptability'.
If the document is to justify its label as 'an ethical framework', a much more considered application of ethics is necessary. See UNSD (1985), Wright (2011), Brey (2012), SEP (2015), ASA (2016).
A particular aspect of the document that is clearly inconsistent with the nominal focus on ethics is the statement that "The exemptions within the Data Protection Act around crime, fraud and national security still apply" (p.3). Any ethical analysis would inevitably conclude that it is untenable to simply exempt such applications from the guidance in relation to data science. There are far greater uncertainties about conclusions drawn from loose analytical techniques than is the case with conventional data processing, and individuals can be subject to very serious consequences if they are wrongly suspected and even accused of trangression of "crime, fraud and national security". While some kinds of qualifications may be justifiable, exemption clearly is completely inappropriate.
The document fails to apply the depth of knowledge that exists about Privacy Impact Assessments, and fails to provide readers with the capacity to drill down into that literature. See, in particular, Warren et al. (2008), Clarke (2009), Clarke (2011), Wright & de Hert (2012).
The Information Commissioner's Office published two versions of a PIA Handbook in 2007 and 2009, and its current guidance document is a Code of Practice (ICO 2014). The Cabinet Office document claims that the ICO "has confirmed that the checklist can form the basis of" a PIA (p.3). This is, however, seriously misleading. Some aspects of the process described in the document are likely to generate information of relevance to the PIA process; but the process in the Framework document is not structured in an appropriate manner to satisfy the needs of a PIA, and in many specific aspects it falls far short of the requirements identified in the literature, and those expressed in the ICO's Code.
Even if BREXIT results in changes, existing British laws remain in force, and the referendum result in itself does not extinguish extant obligations under EU law. Of particular relevance are human rights and data protection statutes, the provisions of enabling legislation for individual agencies and programs, case law arising from those statutes, and interpretations and advice of the Information Commissioner.
The document makes a throwaway comment that "The law ... sets out some important principles about how you can use data ... Those working with data should be aware of these and always act within them" (p.3). However, the indication that the document 'brings together these laws and standards' is not fulfilled. Apart from occasional mentions, and a couple of links within the text to a segment of the Data Protection Act and definitions published by the Information Commissioner's Office, there is no legal analysis, there is no compendium of relevant generic laws, and there is no reference-list to enable civil servants to drill down to the specific laws that they are meant to be aware of and ensure compliance with.
An example of a specifically misleading statement is "Try and ensure that ... where possible people can view and extract their own personal data ..." (p.15). That is not a matter of choice as the text suggests, but rather a formal obligation under s.7 of the Data Protection Act.
Remarkably, the short table that is (mis)represented as 'acting as your Privacy Impact Assessment' (p.6) fails to even mention the necessity of an assessment of the initiative's compliance with applicable laws. A brief and shallow (although emphasised) passage in the Annex entry for Principle / Process-Item 2 states that "Data also has to be legally collected, stored, shared, processed and deleted. The Data Protection Act sets out guidance on this". No link is provided. In fact, the Act contains little guidance. (For example, none of the terms 'collect', 'store', 'share' and 'delete' appear in the Act itself). Rather than pointing only to the ICO's definitions page, the document should highlight and link to the guidance pages (ICO 2016).
This section considers the individual segments of the Framework in isolation. Each item has been labelled as both a Principle and an element of the assessment Process, in order to keep in view the incompatibility of the two roles.
This lays a seriously inadequate foundation for the evaluation process, the following ways:
Linked with this, consideration of both compatibility of purpose and public perceptions is deferred until Principle / Process-Item 4. It is a serious error to defer these critical aspects to a stage where the organisation has provisionally committed to the initiative, and many of the design decisions have been made.
This is a positive contribution, but as noted earlier a much clearer explanation of the nature, process and pifalls of data analytics is necessary, together with assistance in drilling down to greater detail.
This refers to data quality, but falls far short of providing the guidance needed by civil servants evaluating initiatives. It also largely overlooks three major sources of problems:
Common problems in data analysis are the assumptions that the technique makes about the nature of the data, including the scale on which the data was recorded, its accuracy and precision, and the handling of missing data. The statement is made (but seemingly only in relation to a small sub-set of data analytic techniques, viz. "machine learning algorithms") that "These tools are dependent on the input data and do have limits" (p.11). However, the discussion section on this complex topic lacks structure and clarity, and no references are provided.
It is particularly concerning to see a glib example provided, of the kind beloved of the proponents for big data analytics: "searching on Google for flu symptoms is a good indicator of having the flu" (p.11). This both conflates correlation with causality, and mistakes moderate correlation for high correlation. These are precisely the kinds of blunders that a Framework should be designed to prevent; yet this is presented as an example of appropriate use of a data-set. This highlights a critical omission - the failure to warn readers that:
The document badly needs to be augmented, and to provide access to deeper treatments of data quality, and data semantics, data compatibility, and process quality issues. See Clarke (2015, 2016b). In addition, inadequate though the Data Science Association Code of Professional Conduct is in relation to the protection of people, DSA (2016) includes content of relevance to the quality challenges involved in the conduct of data analytics.
This segment of the document is also seriously inadequate, in the following ways:
The phrasing of the text is such that it invites civil servants to avoid transparency, and even to jettison accountability, if respecting those vital requirements "would jeopardise the aim" (p.4). The document goes so far as to suggest that there are circumstances in which an agency "cannot talk about project aim" (p.5). This is a highly contentious suggestion, in direct conflict with principles of democracy and open government. It also contemplates projects being conducted that have no oversight or accountability (p.5). Once again, this would appear to be a serious breach of governance principles.
At the very least, the blase exemption on p.4 needs to be greatly reduced in scope, e.g. by replacing the highly permissive "unless" and limiting the statement's scope, e.g. "Be open except where, and only to the extent that, openness on matters of detail would jeopardise the aim". However, it may require more substantial re-consideration than editorial adjustments can deliver.
This segment overlooks the questions of shared data, and of open data. It is a common misconception that data that an organisation can legitimately gain access to can be put to whatever purpose the organisation determines. For example, data from electoral rolls escapes into many locations; but the uses to which the data can be put are circumscribed. This needs to be recognised not as an exception, but as a principle.
The UK Cabinet Office's 'Data Science Ethical Framework' was assessed on the basis of prior contributions in the refereed literature that provide relevant information about the nature and potential negative impacts of big data, how to evaluate proposals, how to conduct ethical analyses, and how to conduct PIAs.
The document is seriously deficient. It is so weak as to have the appearance of purely nominal guidance, designed not to filter out inappropriate applications of data analytics, but rather to provide a veneer of respectability, to pre-counter criticisms that government agencies are conducting big data activities on an ad hoc basis, and to thereby enable projects to proceed relatively unhindered.
In order to overcome the appearance of insincerity, it is essential that the Cabinet Office very promptly revise the document, considerably enhance it in order to address the long list of deficiencies noted in this paper, and ensure distribution of the revised version to all agencies that have downloaded the initial version.
APF (2013) 'Meta-Principles for Privacy Protection' Australian Privacy Foundation, March 2013, at http://www.privacy.org.au/Papers/PS-MetaP.html
ASA (2016) 'Ethical Guidelines for Statistical Practice' American Statistical Association, April 2016, at http://ww2.amstat.org/about/pdfs/EthicalGuidelines.pdf
Bennett Moses L. & Chan J. (2014) 'Using Big Data for Legal and Law Enforcement Decisions: Testing the New Tools' University of New South Wales Law Journal 37, 2 (2014) 643-678, at http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=2513564
boyd D. & Crawford K. (2011) `Six Provocations for Big Data' Proc. Symposium on the Dynamics of the Internet and Society, September 2011, at http://ssrn.com/abstract=1926431
Brey P.A.E. (2012) 'Anticipating ethical issues in emerging IT' Ethics and Information Technology 14, 4 (2012) 305-317
Buhl H.U. & Heidemann J. (2013) `Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?' Editorial, Business & Information Systems Engineering 2 (2013) 65-69, at http://www.bise-journal-archive.org/pdf/01_editorial_36315.pdf
Chan J. & Bennett Moses L. (2016) 'Is Big Data challenging criminology?' Theoretical Criminology February 2016 vol. 20 no. 1 21-39
Clarke R. (1997) 'Introduction to Dataveillance and Information Privacy, and Definitions of Terms' Xamax Consultancy Pty Ltd, August 1997, at http://www.rogerclarke.com/DV/Intro.html
Clarke R. (2009) 'Privacy Impact Assessment: Its Origins and Development' Computer Law & Security Review 25, 2 (April 2009) 123-135, PrePrint at http://www.rogerclarke.com/DV/PIAHist-08.html
Clarke R. (2011) 'An Evaluation of Privacy Impact Assessment Guidance Documents' International Data Privacy Law 1, 2 (March 2011), PrePrint at http://www.rogerclarke.com/DV/PIAG-Eval.html
Clarke R. (2015) 'Quasi-Empirical Scenario Analysis and Its Application to Big Data Quality' Proc. 28th Bled eConference, Slovenia, 7-10 June 2015, PrePrint at http://www.rogerclarke.com/EC/BDSA.html
Clarke R. (2016a) 'Big Data, Big Risks' Information Systems Journal 26, 1 (January 2016) 77-90, PrePrint at http://www.rogerclarke.com/EC/BDSA.html
Clarke R. (2016b) 'Quality Assurance for Security Applications of Big Data' Proc. European Intelligence and Security Informatics Conference (EISIC), Uppsala, 17-19 August 2016, PrePrint at http://www.rogerclarke.com/EC/BDQAS.html
Croll A. (2012) `Big data is our generation's civil rights issue, and we don't know it: What the data is must be linked to how it can be used' O'Reilly Radar, 2012
DSA (2016) 'Data Science Code Of Professional Conduct' Data Science Association, undated but apparently of 2016, at http://www.datascienceassn.org/sites/default/files/datasciencecodeofprofessionalconduct.pdf
ICO (2012) 'Anonymisation: managing data protection risk: code of practice' Information Commissioners Office, November 2012, at http://ico.org.uk/for_organisations/data_protection/topic_guides/~/media/documents/library/Data_Protection/Practical_application/anonymisation-codev2.pdf
ICO (2014) 'Conducting privacy impact assessments: code of practice' UK Information Commissioner's Office, February 2014, at https://ico.org.uk/media/for-organisations/documents/1595/pia-code-of-practice.pdfICO (2016) 'Guide to data protection' UK Information Commissioner's Office, June 2016, at https://ico.org.uk/for-organisations/guide-to-data-protection/
Ipsos Mori (2016) 'Public Dialogue on the ethics of data science in government' Ipsos MORI, May 2016, at https://www.ipsos-mori.com/Assets/Docs/Publications/data-science-ethics-in-government.pdf
Jacobs A. (2009) 'The Pathologies of Big Data' Communications of the ACM 52, 8 (August 2009) 36-44
Mayer-Schönberger V. & Cukier K. (2013) 'Big data: A revolution that will transform how we live, work, and think' Houghton Mifflin Harcourt, 2013
Oboler A., Welsh K. & Cruz L. (2012) `The danger of big data: Social media as computational social science' First Monday 17, 7 (2 July 2012), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3993/3269
Ohm P. (2010) 'Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization' 57 UCLA Law Review 1701 (2010) 1701-1711, at http://www.patents.gov.il/NR/rdonlyres/E1685C34-19FF-47F0-B460-9D3DC9D89103/26389/UCLAOhmFailureofAnonymity5763.pdf
OPMT (2016) 'Data science an introduction' Open Policy Making Toolkit, undated and unversioned but presumably of 2016, at https://www.gov.uk/guidance/open-policy-making-toolkit/a-z#data-science-introduction
SEP (2015) 'Computer and Information Ethics' Stanford Encyclopedia of Philosophy, October 2015, at http://plato.stanford.edu/entries/ethics-computer/
Slee T. (2011) 'Data Anonymization and Re-identification: Some Basics Of Data Privacy: Why Personally Identifiable Information is irrelevant' Whimsley, September 2011, at http://tomslee.net/2011/09/data-anonymization-and-re-identification-some-basics-of-data-privacy.html
Sweeney L. (2002) 'k-anonymity: a model for protecting privacy' International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002) 557-570, at http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyKA02.pdf
UKCO (2016) 'Data Science Ethical Framework' UK Cabinet Office, v.1, 19 May 2016, at https://www.gov.uk/government/publications/data-science-ethical-framework
UNSD (1985) 'Declaration of Professional Ethics' United Nations Statistical Division, August 1985, at http://unstats.un.org/unsd/dnss/docViewer.aspx?docID=93#start
Warren A., Bayley R., Charlesworth A., Bennett C., Clarke R. & Oppenheimer C. (2008) 'Privacy Impact Assessments: international experience as a basis for UK Guidance' 24, 3 (May-June 2008) 233-242
Wigan M.R. & Clarke R. (2013) `Big Data's Big Unintended Consequences' IEEE Computer 46, 6 (June 2013) 46 - 53, PrePrint at http://www.rogerclarke.com/DV/BigData-1303.html
Wright D. (2011) 'A framework for the ethical impact assessment of information technology' Ethics and Information Technology 13, 3 (September 2011) 199-226
Wright D. & De Hert P. (eds) (2012) 'Privacy Impact Assessments' Springer, 2012
Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in Cyberspace Law & Policy at the University of N.S.W., and a Visiting Professor in the Computer Science at the Australian National University. He has conducted privacy-related research, consultancy and advocacy since the 1970s. He was lead author of the ICO's initial PIA Handbook in 2007. He has been a Board member of the Australian Privacy Foundation since its formation in 1987, including 8 years as Chair. He was for a decade a member of the international Advisory Board of London-based Privacy International.
The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.
From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 50 million in early 2015.
Sponsored by Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer
Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916
Created: 21 July 2016 - Last Amended: 22 July 2016 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at www.rogerclarke.com/DV/DSEF.html