Roger Clarke's 'Big Data, Big Risks'

Roger Clarke's Web-Site

© Xamax Consultancy Pty Ltd, 1995-2024

HOME

eBusiness

Information
Infrastructure

Dataveillance
& Privacy

Identity Matters

Other Topics

What's New

Waltzing
Matilda

Advanced Site-Search

Roger Clarke's 'Big Data, Big Risks'

Big Data, Big Risks

Published in Information Systems Journal 26, 1 (January 2016) 77-90

PrePrint of 30 July 2015

Roger Clarke **

Available under an AEShareNet licence or a Creative Commons licence.

This document is at http://www.rogerclarke.com/EC/BDBR.html

The prior version is at http://www.rogerclarke.com/EC/BDBR-140920.html

The 'big data' literature, academic as well as professional, has a very strong focus on opportunities. Far less attention has been paid to the threats that arise from re-purposing data, consolidating data from multiple sources, applying analytical tools to the resulting collections, drawing inferences, and acting on them. On the basis of a review of quality factors in 'big data' and 'big data analytics', illustrated by means of scenario analysis, this paper draws attention to the moral and legal responsibility of computing researchers and professionals to temper their excitement, and apply reality checks to their promotional activities.

1. Introduction
2. Big Data Scenarios
3. Big Data Quality
4. The Quality of Big Data Analytics
5. Impacts and Their Management
6. Conclusions
References
Supporting Materials

1. Introduction

As sensor technologies mature, and as individuals are encouraged to contribute data into organisations' databases, more transactions are being captured than ever before. Meanwhile, improvements in data-storage technologies have resulted in the cost of evaluating, selecting and destroying old data being now considerably higher than that of simply letting it accumulate. The glut of stored data has greatly increased the opportunities for data to be inter-related, and analysed. The moderate enthusiasm engendered by 'data warehousing' and 'data mining' in the 1990s has been replaced by unbridled euphoria about 'big data' and 'data analytics'. What could possibly go wrong?

The characteristics of big data are commonly depicted as 'volume, velocity and variety' (Laney 2001), leading to the widespread working definition of "data that's too big, too fast or too hard for existing tools to process" (which appears to have originated as a garbling of Jacobs 2009, p. 44). Subsequently, a few commentators have added 'value'. Occasional mention is made of a fifth characteristic, 'veracity', first mentioned in Schroeck et al. (2012). The discussion here brings focus to bear on that aspect. It also draws to attention the need to distinguish the notion of `a very large data-set' from a consolidation of two or more data-sets from different sources, into a single collection, whether of a physical or virtual nature.

A strong form of the big data claim is that "massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. ... [F]aced with massive data, [the old] approach to science -- hypothesize, model, test -- is becoming obsolete. ... Petabytes allow us to say: 'Correlation is enough'" (Anderson 2008). The level of enthusiasm within business schools is scarcely less effusive (LaValle et al. 2012, McAfee & Brynjolfsson 2012, Mayer-Schonberger & Cukier 2013). Whereas observers of the information technology industry have expressed concern, little disquiet is evident within the computer science and information systems literatures. See, however, Jacobs (2009), Oboler et al. (2012), Buhl & Heidemann (2013) and Wigan & Clarke (2013).

This paper addresses the question of what risks arise from inadequate attention to quality factors in big data and big data analytics. It does so by using quasi-empirical scenarios to illustrate a review of data quality and decision quality factors in the big data context.

2. Big Data Scenarios

Scenarios are plausible story-lines. They commence with real-life situations, which are then extended with elements that are hypothetical or speculative, but that are rooted in real-world processes. Background to their use in this context is provided in Clarke (2015). The scenarios used here are not case studies of specific instances of big data at work. Instead, each is generalised, in order to encompass a range of issues not all of which are likely to arise in any particular real-life application. The intention is not to pretend to be a substitute for deep case studies of actual experience, but rather to to test the assumptions underlying the big data value-proposition.

Several scenarios are presented, in order to capture some of the contextual diversity within which big data ideas are being applied. Each scenario is introduced at the point in the analysis that it illustrates. Reflecting the patterns evident in the field, most of the scenarios involve the consolidation of data from multiple sources, with only one of them dealing explicitly with analysis of a single large transaction database. The inspiration for each of the scenarios is explained in a supplementary document.

3. Big Data Quality

This section briefly outlines key aspects of the accumulated knowledge about data quality. A series of scenarios is interspersed through the theoretical material. The quality issues are of course evident with all data, but the intention here is to show how the risks are exacerbated in the context of big data. Key sources used in compiling the review include OECD (1980), Huh et al. (1990), van der Pijl (1994), Wand & Wang (1996), Wang & Strong (1996), Shanks & Darke (1998), Shanks & Corbitt (1999), Rusbridge et al. (2005), English (2006) and Piprani & Ernst (2008). An ISO 8000 series of international data quality standards has been slowly emergent for some time (Benson 2009), but its value is limited due to its naive model of data and information.

Table 1 identifies the primary quality factors, separated into two categories. The `data quality' factors are capable of being assessed at the time the data is collected. The `information quality' factors, on the other hand, are not assessable until the data is used. The quality factors provide a framework within which the analysis can be conducted.

Table 1: Quality Factors

Data Quality Factors
- D1 Syntactical Validity
  Conformance of the data with the domain on which the data-item is defined
- D2 Appropriate (Id)entity Association
  A high level of confidence that the data is associated with the particular real-world identity or entity whose attribute(s) it is intended to represent
- D3 Appropriate Attribute Association
  The absence of ambiguity about which real-world attribute(s) the data is intended to represent
- D4 Appropriate Attribute Signification
  The absence of ambiguity about the state of the particular real-world attribute(s) that the data is intended to represent
- D5 Accuracy
  A high degree of correspondence of the data with the real-world phenomenon that it is intended to represent, typically measured by a confidence interval, such as `+/-1 degree Celsius'
- D6 Precision
  The level of detail at which the data is captured, reflecting the domain on which valid contents for that data-item are defined, such as 'whole numbers of degrees Celsius'
- D7 Temporal Applicability
  The absence of ambiguity about the date and time when, or the period of time during which, the data represents or represented a real-world phenomenon. This is important in the case of volatile data-items such as total rainfall for the last 12 months, marital status, fitness for work, age, and the period during which an income-figure was earned or a licence was applicable
Information Quality Factors
- I1 Theoretical Relevance
  A demonstrable capability of the data-item to make a difference to the decision-making process in which the data is to be used
- I2 Practical Relevance
  A demonstrable capability of the data-item's content to make a difference to the decision-making process in which the data is to be used
- I3 Currency
  The absence of a material lag between a real-world occurrence and the recording of the corresponding data
- I4 Completeness
  The availability of sufficient contextual information that the data is not liable to be misinterpreted
- I5 Controls
  The application of business processes that ensure that the data quality and information quality factors have been considered prior to the data's use
- I6 Auditability
  The availability of metadata that evidences the data quality and information factors

Data Quality items D2, D3 and D4 are key contributors to the meaning of the data. Meaning is capable of definition at the time that the data is gathered. Since the decline of the resource-intensive waterfall method of software development, however, it is much less common for a data dictionary to be even established, let alone maintained. As a result, data definitions may be unclear, ambiguous and even implicit. The lack of clarity about the original meaning increases the likelihood that the meaning will change over time, and that different and even mutually inconsistent usages of the same data-item will eventuate. At the time of use, the more or less clear definition is overlaid by the perspectives and interpretations of the data's user(s).

In Scenario (1) Fraud Detection, introduced below, suspicious inconsistencies can arise not only from attempts to deceive but also from semantic issues even within a single database, let alone within a consolidation of multiple, inherently incompatible databases.

Scenario (1) - Fraud Detection

A company that has large sums of money flushing through its hands is under pressure from regulators, knows that stock exchanges run real-time fraud detection schemes, and accepts at face value the upbeat claims made by the proponents of big data analytics. It combines fraud-detection heuristics with inferences drawn from its large transaction database, and generates suspects. It assigns its own limited internal investigation resources to these suspect cases, and refers some of them to law enforcement agencies.

The large majority of the cases investigated internally are found to be spurious. Little is heard back from law enforcement agencies. Some of the suspects discover that they are being investigated, and threaten to take their business elsewhere and to initiate defamation actions. The investigators return to their tried-and-true methods of locating and prioritising suspicious cases.

The purpose of collection and the value-judgements of the sponsor determine the choice of what data is collected, and against what scales, and what trade-offs are made between data quality and collection costs. Purpose also underlies decisions about what metadata is collected at the same time, and what data is not collected at all. These constraints result in inevitable compromise to all of the data quality factors, yet this is frequently overlooked in the case of large data-sets, and is all-but submerged when multiple data-sets are consolidated. Administrative actions in relation to fraud, and particularly prosecutions, are undermined by poor-quality evidence. In some contexts, commercial liability might arise. For an example, see Scenario (2), below. In others, a duty of care may be breached. See Scenario (3).

Scenario (2) - Creditworthiness

A financial services provider combines its transactions database, its records of chargebacks arising from fraudulent transactions, and government statistics regarding the geographical distribution of income and wealth. It draws inferences about the risks that its cardholders create for the company. It uses those inferences in its decision-making about individual customers, including credit-limits and the issue of replacement and upgraded cards.

Although not publicised by the company, this gradually becomes widely known, and results in negative media comments and recriminations on social media. Questions are raised about whether it conflicts with 'redlining' provisions in various laws. Discrimination against individuals based on the behaviour of other customers of merchants that they use is argued to be at least immoral, and possibly illegal, but certainly illogical from an individual consumer's perspective. The lender reviews the harm done to its reputation.

Scenario (3) - Foster Parenting

A government agency responsible for social welfare programs consolidates data from foster-care and unemployment benefits databases, and discovers a correlation between having multiple foster parents and later being chronically unemployed. On the basis of this correlation, it draws the inference that the longstanding practice of moving children along a chain of foster-parents should be discontinued. It accordingly issues new policy directives to its case managers.

Because such processes lack transparency, and foster-children are young and largely without a voice, the new policy remains 'under the radar' for some time. Massive resistance then builds from social welfare NGOs, as it becomes apparent that children are being forced to stay with foster-parents who they are fundamentally incompatible with, and that accusations of abuse are being downplayed because of the forcefulness of the policy directions based on mysterious 'big data analytics'.

Particularly where data is collected frequently over time, the act of collection may involve data compression, through sampling, filtering and averaging. These actions affect Data Quality 5 (Accuracy) and Information Quality 3 (Completeness). Where interesting outliers are being sought, compression is likely to ensure that the potentially most relevant data is absent from the collection. Problems of these kinds arise in Scenario (4) Precipitation Events, described below.

Scenario (4) - Precipitation Events

Historical rainfall data is acquired from many sources, across an extended period, and across a range of geographical locations. The collectors, some of them professionals but most of them amateurs, used highly diverse collection methods and frequencies, with little calibration and few controls. The data is consolidated into a single collection. A considerable amount of data manipulation is necessary, including the interpolation of data for empty cells, and the arbitrary disaggregation of long-period data into the desirable shorter periods. An attempt to conduct a quality audit against such sources as contemporaneous newspaper reports proves to be too expensive, and is curtailed.

Analytical techniques are applied to the data. Confident conclusions are reached about historical fluctations and long-term trends. Climate-change sceptics point to the gross inadequacies in the database, and argue that climate-change proponents, in conducting their crusade, have played fast and loose with scientific principles.

Data Quality factor 2 requires that data have a reliable association with a particular real-world identity or entity. The data held about any particular (id)entity represents its digital persona (Clarke 1994, 2014a). The reliability of the association between a real-world phenomenon and a data shadow depends on the attributes that are selected as identifiers, and the process of association has error-factors. In some circumstances the link between the digital persona and the underlying entity is challenging (pseudonymity), and in some cases no link can be achieved (anonymity). Indeed, in order to protect important interests and comply with relevant laws, it may be necessary for any link to be broken (de-identification / anonymisation - UKICO 2012, DHHS 2012). On the other hand, rich data-sets are vulnerable to re-identification procedures (Sweeney 2002, Acquisti & Gross 2009, Ohm 2010). These problems afflict all big data collections that are intended to assist in the management of long-term relationships. The problems are compounded by the expropriation of data to support purposes extraneous to the original context of use, such as longitudinal research studies.

Reliable identification processes are difficult enough in individual systems. The challenges multiply when data from multiple sources are combined, particularly where the identifiers used by the underlying systems are not the same. The risks are particularly serious where the data is sensitive. Big social data is one such context, as illustrated by Scenario (5), introduced below. Serious consequences also arise with the big health data.

Scenario (5) - Ad Targeting

A social media service-provider accumulates a vast amount of social transaction data, and some economic transaction data, through activity on its own sites and those of strategic partners. It applies complex data analytics techniques to this data to infer attributes of individual digital personae. It projects third-party ads and its own promotional materials based on the inferred attributes of online identities and the characteristics of the material being projected.

The 'brute force' nature of the data consolidation and analysis means that no account is taken of the incidence of partial identities, conflated identities, obfuscated identities, and imaginary, fanciful, falsified and fraudulent profiles. This results in mis-placement of a significant proportion of ads, to the detriment mostly of advertisers, but to some extent also of individual consumers. It is challenging to conduct audits of ad-targeting effectiveness, and hence advertisers remain unaware of the low quality of the data and of the inferences. This approach to business is undermined by inappropriate content appearing on childrens' screens, and gambling and alcohol ads seen by partners in the browser-windows of nominally reformed gamblers and drinkers.

The big data movement commonly features the use of data for purposes extraneous to its original purpose. The many data quality issues identified above are exacerbated by the loss of context (Information Quality factor 3), including lack of clarity about the trade-offs applied at the time of collection. The absence of this information greatly increases the likelihood of misinterpretation. The big data movement commonly also involves the further step of physically or virtually consolidating data from multiple sources. This depends on linkages among data-items whose semantics and syntactics are different, perhaps substantially, perhaps subtly. The scope for misunderstandings and misinterpretation multiply. Retrospective studies across long periods, as in Scenarios (1) Fraud Detection, (3) Foster Parenting and (4) Precipitation Events, face many or all of these problems.

Deficiencies exist in most data-sets. Over time, however, many further data integrity problems accumulate. A common problem is the loss of metadata - such as the scale against which data was originally collected, the definition at the time of collection, the data's provenance, and any supporting evidence for the data's quality, as well as loss of contextual information including undocumented changes in meaning over time. These undermine Information Quality Factors 3 (Completeness), 4 (Controls) and 5 (Auditability), and greatly increase the likelihood of inappropriate interpretation.

To address perceived data integrity shortfalls, analysts devise and deploy data scrubbing, cleansing or cleaning processes (Rahm & Do 2000, Müller & Freytag 2003). A few such processes use an external, authoritative reference-point, such as a database of recognised location-names and street-addresses. Most, however, lack any external referent, and are merely based on 'logical data quality', i.e. internal consistency within the consolidated data-sets (e.g. Jagadish et al. 2014). As a result, the notion of 'cleanliness' primarily relates to the ease with which analytical tools can be applied to the data, rather than to the data itself.

A special category of challenge is what to do about missing data. It is common to interpolate a value, such as the population mean or mode. This may be a reasonable approach, particularly where the variability of data-item values is low and the analysis is focussed on population characteristics. In other circumstances, however, it may be seriously inappropriate, e.g. where the data-item has highly variable values, or the analysis is focussed on individual (id)entities or on outliers. Scenario (4) Precipitation Events illustrates this problem.

Many individual data-sets lack the necessary quality to be reliably analysed for new insights, and a great many consolidated collections lack integrity, and can be reasonably depicted as being a mere melange. Data-sets that are of uncertain original quality, and of lower current quality, and that have uncertain associations with real-world entities, are combined, and may be modified, using means that are unclear, and that are unaudited and possibly unauditable, in order to achieve consolidated digital personae that have uncertain validity, and that have uncertain relationships with any particular real-world entity. To this melange, powerful analytical tools are then applied.

4. The Quality of Big Data Analytics

There is a variety of ways in which analytical tools can be applied to 'big data'. The analyses may test hypotheses, which may be predictions from theory, correlations arising in other data-sets, existing heuristics, or hunches. Inferences may be drawn about the digital personae, which may be further inferred to apply to the population of entities that the data purports to represent, or to segments of that population. Profiles may be constructed for categories of entities that are of specific interest, such as heavy weather incidents, wellness recipes, risk-prone card-holders, or welfare cheats. Outliers of many different kinds can be identified. Inferences can be drawn about individual entities, directly from a particular digital persona, or from apparent inconsistencies within a consolidated digital persona, or through comparison of a digital persona against the population of similar entities, or against a previously-defined profile, or against a profile created through analysis of that particular big data collection.

It is feasible for big data analytics to be put to use as a decision system. This may be done formally, by, for example, automatically sending infringement or 'show cause' notices. However, it is also possible for big data analytics to become a decision system not through a conscious decision by an organisation, but by default. This can arise where a decision-maker becomes lazy, or is replaced by a less experienced person who is not in as good a position to check the reasonableness of the inferences drawn by software.

Where decisions are made by analytics, or inferences arising from analytics are highly influential in decision-making perhaps to the point of being a default decision that a human has to actively overrule, a number of concerns arise about decision-quality. Has the decision reflected the scale, the accuracy and the precision of the data that was instrumental in leading to the decision? Was the inferencing mechanism that was used really applicable to those categories of data? Did the data mean what the inferencing mechanism implicitly treated it as meaning? To the extent that data was consolidated from multiple sources, were those sources compatible with one another in respect of the data's scale, accuracy, precision, meaning and integrity?

Alternatively, rather than being used as a form of automated decision-making, big data analytics techniques can instead be put to use as a decision support system, with a human decision-maker evaluating the inferences before applying them to a real-world purpose. However, the person may have great difficulty grasping the details of the data's provenance, quality, meaning and relevance, of the analytical technique's nature, pre-requisites, characteristics and constraints, and of the rationale, which have together given rise to the recommendation.

Each item of data, when it is gathered, represents a measurement against a scale. Some data arises from measurement against a ratio scale, and is capable of being subjected to analysis by powerful statistical tools. All too frequently, however, data collected on cardinal, and even on mere ordinal scales (such as Likert-scale data) is blithely assumed to be on a ratio scale, in order to justify the application of statistical inferencing. Meanwhile, a great deal of data is collected against nominal scales (including text and images), which support only weak analytical tools, fuzzy matching and fuzzy logic. A further challenge arises where data that has been measured against different kinds of scale is consolidated and analysed. The applicability of the available analytical tools requires careful consideration. A particularly serious challenge exists in the case of mixed-scale data, whose conjoint analysis is more of a murky art than a precise science. These issues all arise in the context of individual data-sets, but are exacerbated where multiple data-sets are consolidated. Examples of these issues are present in Scenario (6) Insider Detection, presented below.

Scenario (6) - Insider Detection

A government agency receives terse instructions from the government to get out ahead of the whistleblower menace, with Brutus, Judas Iscariot, Macbeth, Manning and Snowden invoked as examples of trusted insiders who turned. The agency increases the intrusiveness and frequency of employee vetting, and lowers the threshold at which positive vetting is undertaken. To increase the pool of available information, the agency exercises its powers to gain access to border movements, credit history, court records, law enforcement agencies' persons-of-interest lists, and financial tracking alerts. It applies big data analytics to a consolidated database comprising not only those sources, but also all internal communications, and all postings to social media gathered by a specialist external services corporation.

The primary effect of these measures is to further reduce employee loyalty to the organisation. To the extent that productivity is measurable, it sags. The false positives arising from data analytics explode, because of the leap in negative sentiments expressed on internal networks and in social media, and in the vituperative language that the postings contain. The false positives greatly increase the size of the haystack, making the presumed needles even harder to find. The poisonous atmosphere increases the opportunities for a vindictive insider to obfuscate their activities and even to find willing collaborators. Eventually cool heads prevail, by pointing out how few individuals ever actually leak information without authority. The wave of over-reaction slowly subsides, leaving a bruised and dissatisfied workforce with a bad taste in its mouth.

A common feature of these circumstances is that decisions are being made about complex real-world phenomena, and which therefore need to be represented by models that embody requisite variety, including explicit recognition of confounding, intervening and missing variables. In practice, however, the models that are applied in big data analytical work may be unduly simple, and even merely implicit. A related concern is that correlations may be, and commonly are, of low grade, yet may nonetheless be treated, perhaps implicitly, as though the relationships were causal, and causal in one direction rather than the other. These issues arise in multiple of the Scenarios, but particularly (3) Foster Parenting, above, and (7) Cancer Treatment, presented below.

Scenario (7) - Cancer Treatment

Millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission. Research funding agencies are excited by this development, and transfer resources to 'big health data analytics' and away from traditional systemic research into causes, pathways and treatments of disease. Pharmaceutical companies follow the trend by purchasing homeopathic suppliers and patenting herb genes. The number of doctoral and post-doctoral positions available in medical science drops sharply.

After 5 years, enough data has become available for the conclusion to be reached that the health treatments 'recommended' by these methods are ineffectual. A latter-day prophet emerges who decries 'the flight from reason', fashion shifts back to laboratory rather than digital research, and medical researchers slowly regain their previous high standing. The loss of momentum is estimated to have delayed progress by 10-15 years and generated a shortage of trained medical scientists.

Where decisions have real-world impacts, it is vital that there be transparency and auditability of the decision-process and the decision-criteria. And of course the principle of natural justice requires that the decisions be subject to appeal processes and review. In Scenario (6) Insider Detection, the consequences are potentially very serious for an individual falsely accused of disloyalty. Scenario (7) Cancer Treatment, meanwhile, is a particularly concerning example of naive correlation as a substitute for understanding.

Transparency and auditability are dependent on the decision-maker and others being able to appreciate the rationale underlying the 'recommendation' made by the analytical procedure. With what were once called 'third-generation' development tools, the rationale was evident in the form of an algorithm or procedure, which may have been explicitly documented externally to the software, but was at least extractable by any person who had access to the source-code and who had the capacity to read it. The fourth generation of development tools changed little in this regard, because it merely expressed the decision-model in a more generally-applicable manner.

The advent of the fifth-generation adopted a different approach, however (Clarke 1991). Rather than a model of the decision, this involves a model of the problem-domain, commonly expressed in logic, rules or frames. It becomes much more difficult to understand how the model applies to particular circumstances. Although explanations (in such forms as 'which rules fired' lists) are feasible, they appear to be seldom made available.

With sixth generation tools, an even greater barrier to understanding arises. A neural network embodies no formal model of a decision or even of a problem-domain. There is just a pile of data, to which mathematical processes are applied, giving rise to a set of inscrutable weightings, which are then applied to each new instance. There have been expressions of concern from many quarters about the delegation of decision-making to software whose behaviour is fundamentally unauditable (e.g. Roszak 1986, Dreyfus 1992, boyd & Crawford 2012, Clarke 2014b).

5. Impacts and Their Management

Given the uncertain quality of data and of decision processes, many inferences from 'big data' are currently being accorded greater credibility than they actually warrant. Inevitably, resources will be misallocated. Within corporations, the impact will ultimately be felt in lower return on investment, whereas in public sector contexts there will be negative impacts on public policy outcomes.

Where big data analytics are inappropriately applied to population inferencing and profile-construction, the harm that can arise includes not only resource misallocation but also unjustified discrimination for and against population segments.

When profiles generated by big data analytics are applied in order to generate suspects, on the other hand, the result is an obscure "predetermined characterisation or model of infraction" (Marx & Reichman 1984, p. 429), based on a 'probabilistic cause' rather than a 'probable cause' (Bollier 2010, pp.33-34). This results in unjustified impositions because the costs are borne by individual people, perhaps just in the form of inconvenience but sometimes with financial or psychological dimensions. The lack of transparency relating to data and decision criteria results in mysterious and often undefendable accusations, which may lead to the unjust deprivation of rights. This represents a denial of natural justice, and in countries that are signatories to international conventions to a breach of the human right to information about "the nature and cause of the charge" (ICCPR 1966, at 14.3).

The already-substantial literature about big data is remarkably lacking in discussion of the issues and the impacts outlined in this paper. Worse still, many papers that touch on these topics fail to reflect the accumulated understanding of data and decision quality. In an earlier phase, the notion of 'data quality mining' was essentially reduced to mere internal consistency within the data collection (Luebbers et al. 2003, Brizan & Tansel 2006, Hipp et al. 2007). Even balanced assessments of big data, such as Bollier (2010), fail to address the issues. Of seven articles in a Special Section of a leading information systems journal in December 2012 (MISQ 36, 4), not only was there no comprehensive treatment of quality factors, but there was almost no mention of such issues. The very limited guidance that exists relating to appropriate process merely addresses internal consistency checks, overlooking the data quality and information quality factors in Table 1, and omitting controls and audit (e.g. Guo 2013). Similarly, Marchand & Peppard (2013) proposes five process guidelines, but omits any meaningful consideration of processes to assure quality of the data and of the decision processes to be applied to it.

Buhl & Heidemann's Editorial in Business and Information Systems Engineering (2013, pp. 66-67)(2013, pp. 66-67) included this clear statement: "Big Data's success is inevitably linked to ... clear rules regarding data quality. ... High quality data requires data to be consistent regarding time ..., content ..., meaning ... , and data that allow for unique identifiability ... , as well as being complete, comprehensible, and reliable". That call to arms has been to date almost entirely ignored. A survey undertaken in mid-2015 of over a score of Calls for Papers for Big Data Conferences identified no more than fleeting mentions of the topic, with more than half of the Calls completely overlooking it.

6. Conclusions

Given that big data harbours big problems as well as big opportunities, should authors be recommending measures that will enable the bad to be avoided, and the good achieved? Alternatively, can computing disciplines and professions wash their hands of these issues, secure in the belief that the responsibility lies elsewhere? Management disciplines study such matters in the abstract, and managers and executives take responsibility for decisions in the real world. So it is arguable that they are the ones who are obligated to concern themselves with quality assurance, risk assessment, and risk management.

Professional associations vary considerably in the extent to which they impose responsibilities on their members. For example, the Association for Information Systems (AIS) imposes obligations in relation to the conduct of research, but not to the impacts of its members' work (AIS 2015). The IEEE requires its members to agree "to accept responsibility in making decisions consistent with the safety, health, and welfare of the public, and to disclose promptly factors that might endanger the public or the environment" (IEEE 2006). The British Computer Society lists "Public Interest" first, and requires its members to "have due regard for public health, privacy, security and wellbeing of others and the environment" (BCS, 2011). The Australian Computer Society declares "The Primacy of the Public Interest" including "matters of public health, safety and the environment" (ACS 2014). A stronger form is to be found in the ACM Code of Ethics and Professional Conduct (ACM 1992). The extracts in Table 2 make abundantly clear that these are not other people's problems, but are responsibilities of computing and information systems professionals and academics.

Table 2: Relevant Extracts from the ACM Code

1. Contribute to society and human well-being

When designing or implementing systems, computing professionals must attempt to ensure that the products of their efforts ... will avoid harmful effects to health and welfare (1.1)
One way to avoid unintentional harm is to carefully consider potential impacts on all those affected by decisions made during design and implementation (1.2)
Respect the privacy of others. This imperative implies ... that personal information gathered for a specific purpose not be used for other purposes without consent of the individual(s) (1.7)

2. Avoid harm to others

Any signs of danger from systems must be reported to those who have opportunity and/or responsibility to resolve them (2.5)
Computing professionals have a responsibility to share technical knowledge with the public by encouraging understanding of computing, including the impacts of computer systems and their limitations. This imperative implies an obligation to counter any false views related to computing (2.7)

3. Be honest and trustworthy

Computer professionals who are in decision making positions should verify that systems are designed and implemented to protect personal privacy and enhance personal dignity (3.5)

Codes of Conduct, and ethics more generally, seldom have volitional force. Their value is in providing a basis for evaluating behaviour. Computer science and information systems academics and professionals have direct responsibility in relation to the technical aspects of data collection, storage and access. Their responsibility in relation to data analysis is shared with other disciplines and professions; but not to the extent that our responsibility is extinguished. The human aspects inherent in big data risks can, on the other hand, only be addressed through risk assessment and risk management, by ensuring that business process design incorporates safeguards, compliance audits, and enforcement activities. Once again, computer science and information systems bear some of the responsibility to ensure that problems are identified, publicised, and addressed. We have moral responsibility, potentially translated by the courts into legal responsibility, to blow the whistle on hyperbole, and on undue reliance on technology.

How realistic are the scenarios presented in this paper? How reliable are the arguments advanced in the text about data quality and decision quality, and their impact on inferences, on actions based on them, and on outcomes? To the extent that they are realistic and reliable, the buoyant atmosphere surrounding big data needs to be tempered. Nuclear physicists and nuclear engineers alike cannot avoid responsibility in relation to the impacts of their work. The same applies to computer scientists and information systems academics, and to computing and information systems professionals. Our discipline and our profession are being culpably timid.

References

ACM (1992) 'ACM Code of Ethics and Professional Conduct' Communications of the Association for Computing Machinery, October 1992, at http://www.acm.org/about/code-of-ethics

Acquisti A. & Gross R. (2009) `Predicting Social Security Numbers from Public Data' Proc. National Academy of Science 106, 27 (2009) 10975-10980

ACS (2014) 'Code of Professional Conduct ' Australian Computer Society, April 2014, at https://acs.org.au/__data/assets/pdf_file/0014/4901/Code-of-Professional-Conduct_v2.1.pdf

AIS (2015) 'Code of Research Conduct' Association for Information Systems, 4 March 2015, at http://c.ymcdn.com/sites/ais.site-ym.com/resource/resmgr/Admin_Bulletin/AIS_Code_of_Research_Conduct.pdf

Anderson C. (2008) 'The End of Theory: The Data Deluge Makes the Scientific Method Obsolete' Wired Magazine 16:07, 23 June 2008, at http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory

BCS (2011) 'Code of Conduct' British Computer Society, 8 June 2011, at http://www.bcs.org/content/conMediaFile/393

Benson P.R. (2009) 'ISO 8000 Data Quality - The Fundamentals Part 1' Real-World Decision Support (RWDS) Journal 3, 4 (November 2009), at http://www.ewsolutions.com/resource-center/rwds_folder/rwds-archives/issue.2009-10-12.0790666855/document.2009-10-12.3367922336

Bollier D. (2010) 'The Promise and Peril of Big Data' The Aspen Institute, 2010, at http://www.ilmresource.com/collateral/analyst-reports/10334-ar-promise-peril-of-big-data.pdf

boyd D. & Crawford K. (2012) 'Critical Questions for Big Data' Information, Communication & Society, 15, 5 (June 2012) 662-679, DOI: 10.1080/1369118X.2012.678878, at http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878#.U_0X7kaLA4M

Brizan D.G. & Tansel A.U. (2006) 'A Survey of Entity Resolution and Record Linkage Methodologies' Communications of the IIMA 6, 3 (2006), at http://www.iima.org/CIIMA/8%20CIIMA%206-3%2041-50%20%20Brizan.pdf

Buhl H.U. & Heidemann J. (2013) `Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?' Editorial, Business & Information Systems Engineering 2 (2013) 65-69, at http://www.bise-journal-archive.org/pdf/01_editorial_36315.pdf

Clarke R. (1991) 'A Contingency Approach to the Software Generations' Database 22, 3 (Summer 1991) 23 - 34, PrePrint at http://www.rogerclarke.com/SOS/SwareGenns.html

Clarke R. (1994) 'The Digital Persona and its Application to Data Surveillance' The Information Society 10,2 (June 1994) 77-92, PrePrint at http://www.rogerclarke.com/DV/DigPersona.html

Clarke R. (2014a) 'Promise Unfulfilled: The Digital Persona Concept, Two Decades Later' Information Technology & People 27, 2 (Jun 2014) 182 - 207, PrePrint at http://www.rogerclarke.com/ID/DP12.html

Clarke R. (2014b) 'What Drones Inherit from Their Ancestors' Computer Law & Security Review 30, 3 (June 2014) 247-262, PrePrint at http://www.rogerclarke.com/SOS/Drones-I.html

Clarke R. (2015) 'Big Data Quality: An Investigation using Quasi-Empirical Scenario Analysis' Proc. Bled eConference, 9 June 2015, PrePrint at http://www.rogerclarke.com/EC/BDSA.html

DHHS (2012) 'Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule' Department of Health & Human Services, November 2012, at http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html

Dreyfus H.L. (1992) 'What Computers Still Can't Do: A Critique of Artificial Reason' MIT Press, 1992

English L.P. (2006) 'To a High IQ! Information Content Quality: Assessing the Quality of the Information Product' IDQ Newsletter 2, 3, July 2006, at http://iaidq.org/publications/doc2/english-2006-07.shtml

Guo P. (2013) 'Data Science Workflow: Overview and Challenges' ACM Blog, 30 October 2013, at http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Hipp J., Guntzer U. & Grimmer U. (2001) 'Data Quality Mining - Making a Virtue of Necessity' Proc. 6th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2001), pp. 52-57, at http://www.cs.cornell.edu/Johannes/papers/dmkd2001-papers/p5_hipp.pdf

Huh Y.U., Keller F.R., Redman T.C. & Watkins A.R. (1990) 'Data Quality' Information and Software Technology 32, 8 (1990) 559-565

ICCPR (1966) 'International Covenant on Civil and Political Rights' United Nations General Assembly, 16 December 1966, at http://www.ohchr.org/en/professionalinterest/pages/ccpr.aspx

IEEE (2006) 'Code of Ethics' IEEE, 2006, at Ihttp://www.ieee.org/about/corporate/governance/p7-8.htmlI

Jacobs A. (2009) 'The Pathologies of Big Data' Communications of the ACM 52, 8 (August 2009) 36-44

Jagadish H.V., Gehrke J., Labrinidis A., Papakonstantinou Y., Patel J.M., Ramakrishnan R. & Shahabi C. (2014) 'Big data and its technical challenges' Communications of the ACM 57, 7 (July 2014) 86-94

Laney D. (2001) `3D Data Management: Controlling Data Volume, Velocity and Variety' Meta-Group, February 2001, at http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

LaValle S., Lesser E., Shockley R., Hopkins M.S. & Kruschwitz N. (2011) 'Big Data, Analytics and the Path From Insights to Value' Sloan Management Review (Winter 2011Research Feature), 21 December 2010, at http://sloanreview.mit.edu/article/big-data-analytics-and-the-path-from-insights-to-value/

Luebbers D., Grimmer U. & Jarke M. (2003) 'Systematic Development of Data Mining-Based Data Quality Tools' Proc. 29th VLDB Conference, Berlin, Germany, 2003, at http://www.vldb.org/conf/2003/papers/S17P02.pdf

McAfee A. & Brynjolfsson E. (2012) 'Big Data: The Management Revolution' Harvard Business Review (October 2012) 61-68

Marchand D.A. & Peppard J. (2013) 'Why IT Fumbles Analytics' Harvard Business Review 91,4 (Mar-Apr 2013) 104-112

Marx G.T. & Reichman N. (1984) 'Routinising the Discovery of Secrets' Am. Behav. Scientist 27,4 (Mar/Apr 1984) 423-452

Mayer-Schonberger V. & Cukier K. (2013) 'Big Data: A Revolution That Will Transform How We Live, Work and Think' John Murray, 2013

Müller H. & Freytag J.-C. (2003) 'Problems, Methods and Challenges in Comprehensive Data Cleansing' Technical Report HUB-IB-164, Humboldt-Universität zu Berlin, Institut für Informatik, 2003, at http://www.informatik.uni-jena.de/dbis/lehre/ss2005/sem_dwh/lit/MuFr03.pdf

Oboler A., Welsh K. & Cruz L. (2012) `The danger of big data: Social media as computational social science' First Monday 17, 7 (2 July 2012), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3993/3269

OECD (1980) 'Guidelines on the Protection of Privacy and Transborder Flows of Personal Data' OECD, Paris, 1980, mirrored at http://www.rogerclarke.com/DV/OECDPs.html

Ohm P. (2010) 'Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization' 57 UCLA Law Review 1701 (2010) 1701-1711, at http://www.patents.gov.il/NR/rdonlyres/E1685C34-19FF-47F0-B460-9D3DC9D89103/26389/UCLAOhmFailureofAnonymity5763.pdf

van der Pijl G. (1994) 'Measuring the strategic dimensions of the quality of information' Journal of Strategic Information Systems 3, 3 (1994) 179-190

Piprani B. & Ernst D. (2008) 'A Model for Data Quality Assessment' Proc. OTM Workshops (5333) 2008, pp 750-759

Rahm E. & Do H.H. (2000) 'Data cleaning: Problems and current approaches' IEEE Data Eng. Bull., 2000, at http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf

Roszak T. (1986) 'The Cult of Information' Pantheon 1986

Rusbridge C., Burnhill P., Seamus R., Buneman P., Giaretta D., Lyon L. & Atkinson M. (2005) 'The Digital Curation Centre: A Vision for Digital Curation' Proc. Conf. From Local to Global: Data Interoperability--Challenges and Technologies, Sardinia, 2005, pp. 1-11, at http://eprints.erpanet.org/archive/00000082/01/DCC_Vision.pdf

Schroeck M., Shockley R., Smart J., Romero-Morales D. & Tufano P. (2012) `Analytics : The real world use of big data' IBM Institute for Business Value / Saïd Business School at the University of Oxford, October 2012, at http://www.ibm.com/smarterplanet/global/files/se__sv_se__intelligence__Analytics_-_The_real-world_use_of_big_data.pdf

Shanks G. & Corbitt B. (1999) 'Understanding Data Quality: Social and Cultural Aspects' Proc. 10th Australasian Conf. on Info. Syst., 1999

Shanks G. & Darke P. (1998) 'Understanding Data Quality in a Data Warehouse' The Australian Computer Journal 30 (1998) 122-128

Sweeney L. (2002) 'k-anonymity: a model for protecting privacy' International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10, 5 (2002) 557-570, at http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyKA02.pdf

UKICO (2012) 'Anonymisation: managing data protection risk: code of practice' Information Commissioners Office, November 2012, at http://ico.org.uk/for_organisations/data_protection/topic_guides/~/media/documents/library/Data_Protection/Practical_application/anonymisation-codev2.pdf

Wand Y. & Wang R.Y. (1996) 'Anchoring Data Quality Dimensions in Ontological Foundations' Commun. ACM 39, 11 (November 1996) 86-95

Wang R.Y. & Strong D.M. (1996) 'Beyond Accuracy: What Data Quality Means to Data Consumers' Journal of Management Information Systems 12, 4 (Spring, 1996) 5-33

Wigan M.R. & Clarke R. (2013) `Big Data's Big Unintended Consequences' IEEE Computer 46, 6 (June 2013) 46 - 53, PrePrint at http://www.rogerclarke.com/DV/BigData-1303.html

Author Affiliations

Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in the Cyberspace Law & Policy Centre at the University of N.S.W., and a Visiting Professor in the Research School of Computer Science at the Australian National University.

Acknowledgements

This paper has benefited significantly from comments by the Section Editor and Editor, which resulted in re-structuring of the material and clarification of the argument.

Personalia

Photographs
Presentations
Videos

Access
Statistics

The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.

From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 65 million in early 2021.

Sponsored by the Gallery, Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer

Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916

Created: 28 August 2014 - Last Amended: 6 January 2016 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at www.rogerclarke.com/EC/BDBR.html
Mail to Webmaster - © Xamax Consultancy Pty Ltd, 1995-2022 - Privacy Policy