Roger Clarke's Web-Site


© Xamax Consultancy Pty Ltd,  1995-2017

Roger Clarke's 'The Big Risks in Big Data'

Big Data, Big Risks

Review Version of 20 September 2014

Roger Clarke **

© Xamax Consultancy Pty Ltd, 2014

Available under an AEShareNet Free
for Education licence or a Creative Commons 'Some
Rights Reserved' licence.

This document is at


The 'big data' literature, academic as well as professional, has a very strong focus on opportunities. Far too little attention has been paid to the threats that arise from re-purposing data, consolidating data from multiple sources, applying analytical tools to the resulting collections, drawing inferences, and acting on them. Through a blend of scenarios and analysis of quality factors in 'big data' and 'big data analytics', this paper draws attention to the moral and legal responsibility of computing researchers and professionals to temper their excitement, and apply reality checks to their promotional activities.


1. Introduction

As sensor technologies have matured, and as individuals have been encouraged to contribute data into organisations' databases, more transactions than ever before have been captured. Meanwhile, improvements in data-storage technologies have resulted in the cost of evaluating, selecting and destroying old data being now considerably higher than that of simply letting it accumulate. The glut of stored data has greatly increased the opportunities for data to be inter-related, and analysed. The moderate enthusiasm engendered by 'data warehousing' and 'data mining' in the 1990s has been replaced by unbridled euphoria about 'big data' and 'data analytics'. What could possibly go wrong?

This paper adopts a two-pronged approach to that question. The body of the paper presents a brisk analysis of data quality and decision quality factors. In support of that analysis, some quasi-empirical scenarios are provided in the Sidebar. These are not case studies of specific instances of big data at work. One reason is that they are generalised, in order to encompass a range of issues not all of which are likely to arise in each particular real-life application. Another is that the scenarios generally comprise elements that have been reported in the literature, but also elements that are hypothetical or speculative. The intention is to test the assumptions underlying the big data value proposition, not to pretend to be a substitute for deep case studies of actual experience.

Sidebar: Some Big Data Scenarios

(1) Precipitation Events

Historical rainfall data is acquired from many sources, across an extended period, and across a range of geographical locations. The collectors, some of them professional but mostly amateurs, used highly diverse collection methods, with little calibration and few controls. The data is consolidated into a single collection. A considerable amount of data manipulation is necessary, including the interpolation of data for empty cells, and the arbitrary disaggregation of long-period data into the desirable shorter periods. An attempt to conduct a quality audit against such sources as contemporaneous newspaper reports proves to be too expensive, and is curtailed. Analytical techniques are applied to the data. Confident conclusions are reached about historical fluctations and long-term trends. Climate-change sceptics point to the gross inadequacies in the database, and argue that climate-change proponents, in conducting their crusade, have played fast and loose with scientific principles.

(2) Creditworthiness

A financial services provider combines its transactions database, its records of chargebacks arising from fraudulent transactions, and government statistics regarding the geographical distribution of income and wealth. It draws inferences about the risks that its cardholders create for the company. It uses those inferences in its decision-making about individual customers, including credit-limits and the issue of replacement and upgraded cards. Although not publicised by the company, this gradually becomes widely known, and results in negative media comments and recriminations on social media. Questions are raised about whether it conflicts with 'redlining' provisions in various laws. Discrimination against individuals based on the behaviour of other customers of merchants that they use is argued to be at least immoral, and possibly illegal, but certainly illogical from an individual consumer's perspective. The lender reviews the harm done to its reputation.

(3) Ad Targeting

A social media services provider accumulates a vast amount of social transaction data, and some economic transaction data, through activity on its own sites and those of strategic partners. It applies complex data analytics techniques to this data to infer attributes of individual digital personae. It projects third-party ads and its own promotional materials based on the inferred attributes of online identities and the characteristics of the material being projected. The 'brute force' nature of the data consolidation and analysis means that no account is taken of the incidence of partial identities, conflated identities, and obfuscated and falsified profiles. This results in mis-placement of a significant proportion of ads, to the detriment mostly of advertisers, but to some extent also of individual consumers. It is challenging to conduct audits of ad-targeting effectiveness, and hence advertisers remain unaware of the low quality of the data and of the inferences. The first problems that arise to undermine this approach to business are related to inappropriate content appearing on childrens' screens.

(4) Foster Parenting

A government agency responsible for social welfare programs consolidates data from foster-care and unemployment benefits databases, and discovers a correlation between having multiple foster parents and later being chronically unemployed. On the basis of this correlation, it draws the inference that the longstanding practice of moving children along a chain of foster-parents should be discontinued. It accordingly issues new policy directives to its case managers. Because such processes lack transparency, and foster-children are young and largely without a voice, the new policy remains 'under the radar' for some time. Massive resistance then builds from social welfare NGOs, as it becomes apparent that children are being forced to stay with foster-parents who they are fundamentally incompatible with, and that accusations of abuse are being downplayed because of the forcefulness of the policy directions based on mysterious 'big data analytics'.

(5) Cancer Treatment

Millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission. Research funding agencies are excited by this development, and transfer resources to 'big health data analytics' and away from traditional systemic research into causes, pathways and treatments of disease. Pharmaceutical companies follow the trend by purchasing homeopathic suppliers and patenting herb genes. The number of doctoral and post-doctoral positions available in medical science drops sharply. After 5 years, enough data has become available for the conclusion to be reached that the health treatments 'recommended' by these methods are ineffectual. A latter-day prophet emerges who decries 'the flight from reason', fashion shifts back to laboratory rather than digital research, and medical researchers slowly regain their previous high standing. The loss of momentum is estimated to have delayed progress by 10-15 years and generated a shortage of trained medical scientists.

(6) Fraud Detection

A company that has large sums of money flushing through its hands is under pressure from regulators, knows that stock exchanges run real-time fraud detection schemes, and accepts at face value the upbeat claims made by the proponents of big data analytics. It combines fraud-detection heuristics with inferences drawn from its large transaction database, and generates suspects. It assigns its own limited internal investigation resources to these suspects, and refers some of them to law enforcement agencies. The large majority of the cases investigated internally are found to be spurious. Little is heard back from law enforcement agencies. Some of the suspects discover that they are being investigated, and threaten to take their business elsewhere and to initiate defamation actions. The investigators return to their tried-and-true methods of locating and prioritising suspicious cases.

(7) Insider Detection

A government agency receives terse instructions from the government to get out ahead of the whistleblower menace, with Macbeth, Brutus, Iago, Judas Iscariot, Manning and Snowden invoked as examples of trusted insiders who turned. The agency increases the intrusiveness and frequency of employee vetting, and lowers the threshold at which positive vetting is undertaken. To increase the pool of available information, it exercises powers to gain access to border movements, credit history, court records, law enforcement agencies' persons-of-interest lists, and financial tracking alerts. It applies big data analytics to a consolidated database comprising all internal communications, and all postings to social media gathered by a specialist external services corporation.

The primary effect of these measures is to further reduce employee loyalty to the organisation. To the extent that productivity is measurable, it sags. The false positives arising from data analytics explode, because of the leap in negative sentiments expressed on internal networks and in social media, and in the vituperative langauge the postings contain. The false positives greatly increase the size of the haystack, making the presumed needles even harder to find. The poisonous atmosphere increases the opportunities for a vindictive insider to obfuscate their activities and even to find willing collaborators. Eventually cool heads prevail, by pointing out how few individuals ever actually leak information without authority. The wave of over-reaction slowly subsides, leaving a bruised and dissatisfied workforce with a bad taste in its mouth.

[The inspiration for each of these scenarios is outlined in the Addendum, which is not intended for publication.]

2. Big Data Quality

The computing and information systems literatures contain a body of material relating to data quality factors. See, for example, Wang & Strong (1996) and the still-emergent ISO 8000 series of Standards. The primary factors are listed in Table 1. To date, however, there is limited evidence of the body of knowledge being applied by the 'big data' movement, either by practitioners or researchers.

Table 1: Intrinsic and Contextual Data Quality Factors

Each item of data, when it is gathered, represents a measurement against some kind of scale. Some data arises from measurement against a ratio scale, and is capable of being subjected to analysis by powerful statistical tools. All too frequently, however, data collected on cardinal, and even on mere ordinal scales (such as Likert-scale data) are blithely assumed to be on a ratio scale, in order to justify the application of statistical inferencing. Meanwhile, a great deal of data is collected against nominal scales (including text and images) that support only weak analytical tools, fuzzy matching and fuzzy logic. A further challenge arises where data that has been measured against different kinds of scale is consolidated and analysed. The applicability of analytical tools to mixed-scale data is more of a murky art than a precise science. All of these challenges are present in the ad targeting, fraud detection and insider detection scenarios.

The meaning of each individual item of data is capable of definition at the time it is gathered. Since the decline of the resource-intensive waterfall method of software development, however, it is much less common for a data dictionary to be even established, let alone maintained. As a result, data definitions may be unclear, ambiguous and even implicit. The lack of clarity about the original meaning increases the likelihood that the meaning will change over time, and that different and even mutually inconsistent interpretations of the same data-item will eventuate. In the fraud detection scenario, suspicious inconsistencies can arise from both attempts to deceive and semantic issues even within a single database let alone within a consolidation of multiple, inherently incompatible databases.

A further consideration when assessing the quality of data is that what is collected, against what scales, and with what trade-offs between data quality and collection costs, all reflect the purpose of collection and the value-judgements of the sponsor. Administrative actions in relation to fraud, and particularly prosecutions, are undermined by poor-quality evidence. In some contexts, commercial liability might arise (e.g. the creditworthiness scenario), while in others a duty of care may be breached (e.g. in the foster parenting scenario).

Particularly where data is collected frequently over time, data collection may also involve compression, through sampling, filtering and averaging. This problem arises in the precipitation events scenario. Further, where interesting outliers are being sought, compression is likely to ensure that the potentially most relevant data is absent from the collection.

To be useful, data needs to be associated with real-world entities, with the data currently held about any particular entity being that entity's digital persona (Clarke 1994, 2014). The reliability of the association depends on the attributes that are selected as identifiers, and the process of association has error-factors. In some circumstances the link between the digital persona and the underlying entity is challenging (pseudonymity), and in some cases no links can be achieved (anonymity). In some circumstances, in order to protect important interests and comply with relevant laws, the link may need to be broken (de-identification / anonymisation - UKICO 2012, DHHS 2012). On the other hand, rich data-sets are vulnerable to re-identification procedures. These problems afflict all big data collections that are intended to support longitudinal studies, and are particularly serious where the data is sensitive, as in the case of big health data and big social data.

Over time, many threats arise to data integrity, including the loss of metadata such as the scale against which data was originally collected, the definition at the time of collection, the data's provenance, any supporting evidence for the data's quality, undocumented changes in meaning over time, and loss of contextual information that would have ensured its appropriate interpretation.

The big data movement commonly features the use of data for a purpose extraneous to its original purpose. The many data quality issues identified above are exacerbated by the loss of context, the lack of clarity about the trade-offs applied at the time of collection, and the greatly increased likelihood of misinterpretation. The big data movement commonly involves the further step of physically or virtually consolidating data from multiple sources. This depends on linkages among data-items whose semantics and syntactics are different, perhaps substantially, perhaps subtly. The scope for misunderstandings and misinterpretation multiply. Retrospective studies across long periods, as in the precipitation events, fostering and fraud detection scenarios, face many or all of these problems.

Theorists and practitioners perceive deficiencies in the data, such as missing elements, and inconsistencies among seemingly similar data-items gathered from two or more sources. To address these concerns about the analysts' particular perceptions of data quality, they devise data 'scrubbing', 'cleansing' or 'cleaning' processes. A few such processes use an external, authoritative reference-point, such as a database of recognised location-names and street-addresses. Most, however, lack any external referent, and are merely based on 'logical data quality', i.e. internal consistency within the consolidated data-sets (e.g. Jagadish et al. 2014). As a result, the notion of 'cleanliness' primarily relates to the ease with which analytical tools can be applied to the data, rather than to the data itself. This issue is particularly apparent in the precipitation events scenario.

On the basis of data quality factors, a great deal of 'big data' can be reasonably depicted as being a melange. Data-sets which are of uncertain original quality, and lower current quality, and which have uncertain associations with real-world entities, are combined, by means of unclear validity, and modified by unaudited means, in order to achieve consolidated digital personae that have uncertain validity, and have uncertain relationships with any particular real-world entity. To this melange, powerful analytical tools are then applied.

3. The Quality of Big Data Analytics

There is a variety of ways in which analytical tools can be applied to 'big data'. The analyses may test hypotheses, which may be predictions from theory, existing heuristics, or hunches. Inferences may be drawn about the digital personae, which may be further inferred to apply to the population of entities that the data purports to represent, or to segments of that population. Profiles may be constructed for categories of entities that are of specific interest, such as heavy weather incidents, risk-prone card-holders, wellness recipes, or welfare cheats. Outliers of many different kinds can be identified. Inferences can be drawn about individual entities, directly from a particular digital persona, or from apparent inconsistencies within a consolidated digital persona, or through comparison of a digital persona against the population of similar entities, or against a previously-defined profile, or against a profile created through analysis of that particular big data collection.

It is feasible for big data analytics to be used as a decision system. This may be done formally, by, for example, automatically sending infringement or 'show cause' notices. However, it is also possible for big data analytics to become a decision system not through a conscious decision by an organisation, but by default. This can arise where a decision-maker became lazy, or is replaced by a less experienced person who is not in as good a position to apply human intelligence as a means of checking the reasonableness of the inferences drawn by software.

Where decisions are made by analytics, or inferences arising from analytics are highly influential in decision-making perhaps to the point of being a default decision that a human has to actively overrule, a number of concerns arise about decision-quality. On what scale, with what accuracy and what precision, was the data collected that was instrumental in leading to the decision, and was the inferencing mechanism that was used really applicable to those categories of data? Did the data mean what the inferencing mechanism implicitly treated it as meaning? To the extent that data was consolidated from multiple sources, were those sources compatible with one another in respect of the data's scale, accuracy, precision, meaning and integrity? The precipitation events, fraud detection and insider event detection scenarios are challenged by these concerns.

A common feature of these circumstances is that real-world decisions depend on complex models that feature confounding, intervening and missing variables. Correlations are commonly of a low grade, yet may nonetheless be treated, perhaps implicitly, as though the relationships were causal, and causal in one direction rather than the other. If decisions are being made that have real impacts, is the decision-process and are the decision-criteria transparent? And are they auditable? And are they subject to appeal processes and review? The cancer scenario is a particularly concerning example of naive correlation as a substitute for understanding. In the insider detection scenario, the negative consequences may affect far fewer people, but they are potentially quite vicious for an individual falsely accused of disloyalty.

An even more problematic situation can arise if the nominal decision-maker is not in a position to appreciate the rationale underlying the 'recommendation' made by the analytical procedure, and hence feels themselves to be incapable of second-guessing the system. With what were once called 'third-generation' development tools, the rationale was evident in the form of an algorithm or procedure, which may have been explicitly documented externally to the software, but was at least extractable by any person who had access to the source-code and who had the capacity to read it. The fourth generation of development tools merely expressed the decision-model in a more generally-applicable manner.

The advent of the fifth-generation adopted a different approach, however. Rather than a model of the decision, this involved a model of the problem-domain. It became much more difficult to understand how the model (commonly expressed in logic, rules or frames) applied to particular circumstances. Then, with the sixth generation, an even greater barrier to understanding arose. With a neural network, there is no formal model of a decision or even of a problem-domain. There is just an empirical pile, and a series of inscrutable weightings that have been derived through mathematical processes, and that are then applied to each new instance (Clarke 1991). There have been expressions of concern from many quarters about the delegation of decision-making to software whose behaviour is fundamentally unauditable (e.g. Roszak 1986, Dreyfus 1992, boyd & Crawford 2012).

As an alternative to automated decision-making, big data analytics are capable of being used as a form of decision support system, whereby a human decision-maker evaluates the inferences before applying them to any real-world purpose. However, the person may have great difficulty grasping the details of data provenance, data quality, data meaning, data relevance, and the rationale that have given rise to the recommendation.

4. Impacts

Given the uncertain quality of data and of decision processes, many inferences from 'big data' are currently being accorded greater credibility than they actually warrant. Inevitably resources will be misallocated. Within corporations, the impact will ultimately be felt in lower return on investment, whereas in public sector contexts, there will be negative impacts on public policy outcomes, such as unjustified discrimination against particular population segments.

When big data analytics are inappropriately applied to population inferencing and profile-construction, the harm that can arise includes resource misallocation and unjustified discrimination against population segments. When profiles generated by big data analytics are applied, in order to generate suspects, the result is an obscure and perhaps counter-intuitive "predetermined characterisation or model of infraction" (Marx & Reichman 1984, p.429), based on a 'probabilistic cause' rather than a 'probable cause' (Bollier 2010, pp.33-34). This not only results in unjustified impositions on individuals, but also denies them natural justice in the sense of a right to a fair hearing.

When big data analytics are inappropriately applied to inferencing about individuals, the costs that arise are borne by individual people, sometimes in the form of inconvenience but sometimes with financial or psychological dimensions. This is exacerbated by the lack of transparency relating to data and decision criteria, which result in mysterious and often undefendable accusations.

5. Conclusions

The already-substantial literature about big data is remarkably lacking in discussion of the issues and the impacts outlined in this paper. Moreover, the literature on appropriate business processes for acquiring and consolidating big data, and applying big data analytics, is highly scattered, and not greatly evident in the big data literature itself. Such process guidance as exists merely addresses internal consistency checks, overlooking intrinsic and contextual data quality, , and omitting controls and audit (e.g. Guo 2013). Given that there are problems as well as opportunities, should authors be recommending measures that will enable the good to be achieved and the bad to be avoided?

Alternatively, can computing disciplines and professions wash their hands of these issues, secure in the belief that the responsibility lies elsewhere? Management disciplines study such matters in the abstract, and managers and executives take responsibility for decisions in the real world. So it is arguable that they are the ones who are obligated to concern themselves with quality assurance, risk assessment, and risk management.

The answer is to be found in the fact that computing associations impose responsibilities on academics and professionals, through Codes of Conduct. Table 2 identifies multiple formal obligations in the ACM Code of Ethics and Professional Conduct (ACM 1992), which make abundantly clear that these are not other people's problems. They are ours.

Table 2: Relevant Extracts from the ACM Code

How realistic are the scenarios in the Sidebar? How reliable are the arguments advanced in the text about data quality and decision quality, and their impact on inferences, on actions based on them, and on outcomes? To the extent that they are realistic and reliable, the buoyant atmosphere surrounding big data needs to be tempered. Nuclear physicists, still moreso nuclear engineers, cannot avoid responsibility in relation to the impacts of their work. The same applies to computer scientists, still moreso computing professionals.


ACM (1992) 'ACM Code of Ethics and Professional Conduct' Communications of the Association for Computing Machinery, October 1992, at

Bollier D. (2010) 'The Promise and Peril of Big Data' The Aspen Institute, 2010, at

boyd D. & Crawford K. (2012) 'Critical Questions for Big Data' Information, Communication & Society, 15, 5 (June 2012) 662-679, DOI: 10.1080/1369118X.2012.678878, at

Clarke R. (1991) 'A Contingency Approach to the Software Generations' Database 22, 3 (Summer 1991) 23 - 34, PrePrint at

Clarke R. (1994) 'The Digital Persona and its Application to Data Surveillance' The Information Society 10,2 (June 1994) 77-92, PrePrint at

Clarke R. (2014) 'Promise Unfulfilled: The Digital Persona Concept, Two Decades Later' Information Technology & People 27, 2 (Jun 2014) 182 - 207, PrePrint at

DHHS (2012) 'Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule' Department of Health & Human Services, November 2012, at

Dreyfus H.L. (1992) 'What Computers Still Can't Do: A Critique of Artificial Reason' MIT Press, 1992

Guo P. (2013) 'Data Science Workflow: Overview and Challenges' ACM Blog, 30 October 2013, at

Jagadish H.V., Gehrke J., Labrinidis A., Papakonstantinou Y., Patel J.M., Ramakrishnan R. & Shahabi C. (2014) 'Big data and its technical challenges' Communications of the ACM 57, 7 (July 2014) 86-94

Marx G.T. & Reichman N. (1984) 'Routinising the Discovery of Secrets' Am. Behav. Scientist 27,4 (Mar/Apr 1984) 423-452

Roszak T. (1986) 'The Cult of Information' Pantheon 1986

UKICO (2012) 'Anonymisation: managing data protection risk: code of practice' Information Commissioners Office, November 2012, at

Wang R.Y. & Strong D.M. (1996) 'Beyond Accuracy: What Data Quality Means to Data Consumers' Journal of Management Information Systems 12, 4 (Spring, 1996) 5-33


Elements of the scenarios are readily found throughout the formal literatures and media reports. However, the immediate inspiration for each of the Scenarios was as follows:

(1) Precipitation Events

This is based on personal experience analysing rainfall data relating to the author's conservation property, combined with the climate science debates of the last decade.

See, for example, the boldness of this paper:

Reek T., Doty S.R. & Owen T.W. (1992) 'A Deterministic Approach to the Validation of Historical Daily Temperature and Precipitation Data from the Cooperative Network' Bull. Amer. Meteor. Soc. 73, 6 (June 1992) 753-762, at

(2) Creditworthiness

Occasional media reports, including this one:

Cuomo C. et al. (2009) ''GMA' Gets Answers: Some Credit Card Companies Financially Profiling Customers' ABC News, 28 January 2009, at

(3) Ad Targeting

This is a straightforward extrapolation from the large volume of material published in recent years about 'behavioural targeting', 'web analytics' and 'interest-based targeting'.

See, for example:

O'Reilly (2012) 'Big Data Now' O'Reilly Media, 2012, pp. 56-59, at

Park S.-H., Huh S.-Y., Oh W. & Han A.P. (2012) 'A Social Network-Based Inference Model for Validating Customer Profile Data' MIS Quarterly 36, 4 (December 2012) 1217-1237, at,%2520Sang%2520Pil%5D.pdf

Chen J. & Stallaert J. (2014) 'An Economic Analysis of Online Advertising Using Behavioral Targeting' MIS Quarterly 38, 2 (June 2014) 429-449, at

(4) Foster Parenting

Correspondence from a colleague about findings from data analytics reported by a government agency.

(5) Cancer Treatment

The first sentence is taken directly from a passage on p. 14 of a well-known book, with is much-loved by reviewers, e.g.

Mayer-Schonberger V. & Cukier K. (2013) 'Big Data: A Revolution That Will Transform How We Live, Work and Think' John Murray, 2013

The notion of research funding agencies shifting significant amounts across to digital research was reinforced by the announcement of Australia's second and third Big Data Research Centres during the same month that the scenario was drafted:

(6) Fraud Detection

This is an extrapolation from existing approaches in stock exchange contexts, overlaid with 'big data' concepts.

(7) Insider Detection

In September 2014, the Australian Attorney-General announced upgraded security assessments for all Australian government staff.

AGD (2014) 'Australian Government Personnel Security Core Policy' Attorney-General's Dept, 2 September 2014, at

Coyne A. (2014) 'Brandis boosts vetting of APS staff to prevent insider threats' itNews, 2 September 2014, at,brandis-boosts-vetting-of-aps-staff-to-prevent-insider-threats.aspx

Author Affiliations

Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in the Cyberspace Law & Policy Centre at the University of N.S.W., and a Visiting Professor in the Research School of Computer Science at the Australian National University.

xamaxsmall.gif missing
The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.

From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 50 million in early 2015.

Sponsored by Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer
Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916

Created: 28 August 2014 - Last Amended: 20 September 2014 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at
Mail to Webmaster   -    © Xamax Consultancy Pty Ltd, 1995-2017   -    Privacy Policy