Roger Clarke's 'Big Data Quality Assurance'

Roger Clarke's Web-Site

© Xamax Consultancy Pty Ltd, 1995-2024

HOME

eBusiness

Information
Infrastructure

Dataveillance
& Privacy

Identity Matters

Other Topics

What's New

Waltzing
Matilda

Advanced Site-Search

Roger Clarke's 'Big Data Quality Assurance'

Big Data Quality Assurance

Version of 10 November 2015

For presentation at an international conference
organized by the Australian Centre for Cyber Security (ACCS)
at the Australian Defence Force Academy, Canberra - 16 November 2015

Roger Clarke **

Available under an AEShareNet licence or a Creative Commons licence.

This document is at http://www.rogerclarke.com/EC/BDQA.html

Here is the slide-set used to accompany the presentation

The quality of inferences drawn from data, big or small, is heavily dependent on the quality of the data and the quality of the processes applied to it. As big data starts to emerge from the laboratory, a framework is needed to enable quality assurance of security applications of data analytics. This paper outlines the challenges, and draws attention to the consequences of misconceived and misapplied projects. It presents key aspects of the necessary risk assessment and risk management approaches, and suggests opportunities for research.

1. Introduction
2. Challenges
3. Consequences
4. Strategies
5. Research Opportunities
6. Conclusions
References

1. Introduction

A variety of big data sources are available within the security area including network traffic, open source intelligence, social media postings and if the datafication of things and of people comes about, streams of data from eObjects (Manwaring & Clarke 2015). Although these sources may share the features of velocity, volume and variety, they also show a great deal of divergence among other characterictics. The inadequacy of the original, cute triple-V 'definition' for the big data field is progressively giving way to a degree of realism, with various authors adding such features as Value and Veracity into the mix.

Big data promises a lot. But if fulfilling the promise was easy, there would be little need for research, nor for symposia. On the one hand, innovation is at risk of being stifled if accountants, auditors and lawyers begin exercising control too early in the process. On the other hand, there are advantages in ensuring that the scope of the research conducted into big data methods and applications is sufficiently broad.

This paper outlines the quality issues that need to be understood and managed if big data is to fulfil its promise. It highlights the consequences if quality issues are not addressed, and identifies some strategies and areas for research.

2. Challenges

This section considers the range of factors that give rise to risk in big data analytics. First, a number of categories of use are distinguished. Next, a set of data quality factors is identified, together with issues that arise from consolidation of multiple data sets and from data scrubbing activities. Finally, quality concerns arise in relation to data analytics processes.

2.1 Use Categories

As an initial step, it is important to distinguish several categories of use of big data, because they have materially different risk-profiles. As depicted in Table 1, the first distinction is between uses that focus on populations and sub-populations, on the one hand, and those that are concerned about individual instances within populations.

One purpose for which big data appears to be frequently used is the testing of hypotheses. A more exciting and hence more common depiction of big data, as for data mining before it, is the discovery of characteristics and relationships that have not previously been appreciated. Related to this, but different from it, is the construction of profiles for sub-populations based on statistical consistencies. Other uses of big data focus on individual instances. One example is the search for outliers, and needles in haystacks. Another is the search for 'instances of interest' such as anomalies, and those that have a match to a previously computed abstract profile.

Table 1: Functions of Big Data Analytics

After Clarke (2014)

Population Focus

Hypothesis Testing
Evaluation of propositions to see whether they are supported by the available data.
The propositions may be predictions from theory, existing heuristics, or hunches
Population Inferencing
The drawing of inferences about the entire population of (id)entities, or about sub-populations.
In particular, correlations may be drawn among particular attributes
Profile Construction
The identification of key characteristics of some category of (id)entities. (For example, attributes and behaviours of spear-fishing attacks or of 'drug mules' may exhibit statistical consistencies)

Individual Focus

Outlier Discovery
Statistical outliers are commonly disregarded, but this approach regards them instead as valuable needles in large haystacks, because they may herald a 'flex-point' or 'quantum shift'. (In Bollier 2010, p.18, the notion is attributed to Joichi Ito)
Inferencing about Individual Instances
The drawing of inferences about individual entities within the population.
In particular, a search can be conducted for instances that exhibit patterns associated with a particular, previously computed profile, or with a profile generated from the data-set, producing a set of suspect (id)entities.
Another example involves a person being inferred to have provided inconsistent information to two organisations, or to have exhibited behaviour in one context inconsistent with behaviour in another

2.2 Data and Data Quality

It's first necessary to reflect on the nature of data. Empirical data purports to represent a real-world phenomenon, but there are significant limitations on the extent to which the representation is effective. For example, only some data is collected, and only in respect of some aspects of the phenomenon. It may be compressed or otherwise pre-processed. Assuring the quality of data collection processes can be expensive, and hence the quality at time of collection may be low. For example, the frequency of calibration of measuring devices may be less than is desirable, and large volumes of data may need to be subject to sampling techniques.

A second fundamental matter is the reliability of the association of data with a particular real-world entity or identity. Great reliance is placed on (id)entifiers of various kinds, but these too are just particular instances of data, and hence suffer from the same quality issues. In most cases, they are also highly amenable to obfuscation and falsification.

For data to exhibit sufficient quality to support reliable decision-making, it needs to satisfy a range of requirements. Table 2 provides a checklist, drawing on literature such as Wang & Strong (1996), Shanks & Darke (1998) and Piprani & Ernst (2008), as summarised in Clarke (2016). The first group of quality factors, which can be assessed at the time of collection, include syntactical validity, the appropriateness of the association of the data with an (id)entity, and with a particular attribute of that (id)entity, together with the degree of accuracy and precision, and the data's temporal applicability.

The second group, termed in Table 2 'information quality factors', can only be judged at the time of use. They include the relevance of the data, its currency, it completeness, and its dependability. Judgements about the suitability of data, and about the impact of quality factors on decision-making, are dependent on the creation, maintenance and accessibility of metadata. The quality of data generally falls over time, most significantly because of changes in context. These issues are considered in greater depth in Clarke (2014).

Table 2: Quality Factors

Data Quality Factors
- Syntactical Validity
  Conformance of the data content with the domain on which the data-item is defined
- Appropriate (Id)entity Association
  A high level of confidence that the data-item is associated with the particular real-world identity or entity whose attribute(s) it is intended to represent
- Appropriate Attribute Association
  The absence of ambiguity about which real-world attribute(s) the data-item is intended to represent
- Appropriate Attribute Signification
  The absence of ambiguity about the state of the particular real-world attribute(s) that the data content is intended to represent
- Accuracy
  A high degree of correspondence of the data content with the real-world phenomenon that it is intended to represent, typically measured by a confidence interval, such as `+/-1 degree Celsius'
- Precision
  The level of detail at which the data content is captured, reflecting the domain on which valid contents for that data-item are defined, such as 'whole numbers of degrees Celsius'
- Temporal Applicability
  The absence of ambiguity about the date and time when, or the period of time during which, the data content represents or represented a real-world phenomenon. This is important in the case of volatile data-items such as total rainfall for the last 12 months, marital status, fitness for work, age, and the period during which an income-figure was earned or a licence was applicable
Information Quality Factors
- Theoretical Relevance
  A demonstrable capability of the data-item to make a difference to the decision-making process in which it is to be used
- Practical Relevance
  A demonstrable capability of the data content to make a difference to the decision-making process in which it is to be used
- Currency
  The absence of a material lag between a real-world occurrence and the recording of the corresponding data content
- Completeness
  The availability of sufficient contextual information that the data content is not liable to be misinterpreted
- Controls
  The application of business processes that ensure that the data quality and information quality factors have been considered prior to the data's use
- Auditability
  The availability of metadata that evidences the data quality and information quality factors

2.3 The Consolidation of Data Collections

The preceding section outlined quality issues that arise within a single data collection. In those circumstances, it may be reasonable to expect that quality standards are defined, documented and understood. On the other hand, those assumptions may not hold when data is drawn from multiple sources.

When data-sets are consolidated, rules are applied to achieve inter-relation between digital identities that exist within different frames of reference. Care is needed to specify precise mappings between data-items across the two or more data-sets. However, there are circumstances in which this cannot be reliably achieved. Even where reliable matching or merger is feasible, it is all-too-common for limited attention to be paid to the risks of differing item-definitions, differing data-domain definitions and differing dates of applicability.

2.4 Data Scrubbing

Processes are applied to data-sets in order to address data quality issues. The terms 'cleansing' and 'cleaning' have come into vogue (Rahm & Do 2000, Müller & Freytag 2003), but the original notion of 'scrubbing' is more apt, because the extent to which cleanliness is actually achievable is highly varied and success is far from assured. Some of the deficiencies that data scrubbing seeks to address are missing values, syntactical errors in data content, syntactical differences among apparently comparable data-items, low quality at time of capture, degraded quality during storage and missing metadata.

The standard of the literature in this area is a cause for considerable concern. In principle, it is necessary to inspect and adapt data on the basis of some external authority. An example of this process is spell-checking of street and suburb-names, checks of the existence of street-numbers, and post-code substitution and interpolation. Unfortunately, authoritative, external sources are available for only a small proportion of data-items. Almost all of the literature deals with internal checks within data-sets and rule-based checks (e.g. Jagadish et al. 2014). While such approaches are likely to deliver some improvements, they also inevitably result in spurious interpolations and corrections.

Even more disturbing is the limited extent to which the literature discusses quality audits of the outcomes of data scrubbing processes. Proponents of the techniques blithely assume that a set of transformation rules that are derived at least in part, and often to a considerable extent, from the data-sets that are being manipulated, reliably improve the correspondence between the data and the real world phenomena that the data is presumed to represent. Reliance should not be placed on changes made by data scrubbing operations, unless and until the assumptions and the results have been subjected to reality tests.

2.5 Decision Quality

A further set of issues arise in relation to big data analytics, and to the decisions that may be made based on the inferences drawn. Quite fundamentally, assurance is needed that the techniques applied to the data are appropriate, given the nature of the data. A common problem is the blind application of powerful statistical tools. Many of these assume that all data is on a ratio scale, whereas some or even all of the data may be only on cardinal, ordinal or merely nominal scales. Mixed-mode data is particularly challenging to analyse. Guidance in relation to the applicability of the various approaches to different categories of data can be difficult to find. The need for reflection on the risks involved is all-too-easily overlooked.

Uncertainties can arise in relation to the 'meaning' of data, at the syntactic and semantic levels, and in some contexts at the pragmatic level as well. Transparency in relation to the data relied upon, and the reasoning applied to it, can be very limited. This is particularly the case where the technique does not involve a humanly-understandable rational basis, such as rule-based domain models, and especially neural nets and machine learning.

3. Consequences

The conventional security model (Clarke 2015b) provides a basis for understanding the consequences of inadequate quality in big data and big data analytics . Risk assessment techniques commence by understanding stakeholders, the assets that they place value on, and the harm that can be done to those values. That lays the foundation for examining the threats and vulnerabilities that could give rise to harm, and the existing patterns of safeguards against harm arising.

Whereas there is an emergent literature that considers the potential impacts on individuals (e.g. boyd & Crawford 2012, Wigan & Clarke 2013), there is a shortage of papers looking at negative organisational impacts. From a data security perspective, the effects of big data activities are to create more copies, and to consolidate them. This creates honeypots, which attract attackers, and some attacks succeed. From a quality perspective, on the other hand, errors in data quality, information quality and analytics quality have negative impacts on return on investment and/or in public policy outcomes. There may also be opportunity costs, to the extent that resources are diverted to big data projects that, with perfect hindsight, could have been invested in alternative activities with higher return.

One of the scenarios presented in Clarke (2016) commences like this:

A government agency receives terse instructions from the government to get out ahead of the whistleblower menace, with Brutus, Judas Iscariot, Macbeth, Manning and Snowden invoked as examples of trusted insiders who turned. The agency increases the intrusiveness and frequency of employee vetting, and lowers the threshold at which positive vetting is undertaken. To increase the pool of available information, the agency exercises its powers to gain access to border movements, credit history, court records, law enforcement agencies' persons-of-interest lists, and financial tracking alerts. It applies big data analytics to a consolidated database comprising not only those sources, but also all internal communications, and all postings to social media gathered by a specialist external services corporation.

There is of course a chance that any untrustworthy insider might be discovered in such ways. There is a far higher chance, however, that the outliers that are discovered will be subjected to unjustified suspicion and discrimination. Moreover, the breach of trust in relation to all employees and contractors inevitably undermines morale within an organisation.

There is also the possibility that harm could come to the organisation from external quarters. Civil and criminal actions may seem unlikely, but public disquiet about intrusive data-handling and data analytics, amplified by media coverage, may be damaging to the reputation of the organisation and its stakeholders.

4. Strategies

While big data projects continue to be regarded as experimental, they may enjoy freedom from corporate governance constraints. There might nonetheless be good reasons for applying some tests to big data analytics activities, even while they are still inside the laboratory. One reason is that some inferences may escape, and be adopted uncritically by a senior executive or an operational manager, with unpredictable effects. Another is that, as laboratory techniques and results begin to be released into the enterprise, they will be checked by hard-bitten realists who are sceptical about tools that they don't understand.

One approach that is likely to deliver value is early recognition of quality issues, and awareness and training activities to sensitise participants to the risks arising from them. A more substantial approach involves formal risk assessment in advance of deployment, and the implementation of appropriate data quality, information quality and decision quality assurance safeguards.

Post-controls are also at least advisable, and in many contexts are essential. For example, organisational filters can be applied, to ensure that inferences are not blindly converted into action without reality checks being applied first. In addition, audits can be performed after early uses of each technique in order to assess its impact in the real world. Table 3 offers some more fully articulated suggestions.

Table 3: A Framework for Big Data Quality Assurance

Reproduced from Clarke (2014)

Incorporate applications of 'big data' within the organisation's risk assessment and risk management framework
Incorporate applications of 'big data' within the organisation's data quality assurance framework
Ensure that the organisation's data quality framework addresses the data and information quality factors identified in Table 2
Ensure that data collections are not consolidated unless:
- they satisfy threshold data quality tests
- their purposes, their quality and the meanings of relevant data-items are compatible
- relevant legal, moral and public policy constraints are respected
Ensure that, where sensitive data is involved, particularly personal data, anonymisation techniques are applied, such that the data is not re-identifiable
Ensure that, where data scrubbing operations are undertaken:
- they are undertaken within the context of the organisation's data quality assurance framework
- they involve external reference-points, and are not limited to internal consistency checks
- their accuracy and effectiveness are audited, particularly where they are based on internal consistency checking
- the results are not used for decision-making unless the audits demonstrate that the results satisfy threshold data quality tests
Ensure that inferencing mechanisms are not relied upon to make decisions, unless the applicability of those mechanisms in respect of the data in question has been subjected to independent review and they have been found to be suitable
Ensure that, when 'big data' is applied to decision-making:
- the criteria of relevance, meaning, and transparency of decision mechanisms are all satisfied
- the results are audited, including by testing against known instances
- the outcomes are subjected to post-implementation assessment, including through transparency arrangements and complaints mechanisms

5. Research Opportunities

The outline of quality factors provided above suggests that research is needed not only into how data can be manipulated in order to provide new insights, but also into how the risks arising from data manipulation can be managed. Table 4 lists research topics that have emerged during the analysis undertaken above.

Table 4: Research Opportunities

Indicators and Contra-Indicators for particular data analytic techniques
Scenario Analyses
Case Studies
Data Scrubbing against external reference-points
Quality Audit Techniques for data scrubbing and for inferencing
Transparency Mechanisms for rule-based, neural-net and machine-learning analytics
Integration into QA, TRA and SRMP processes
Cognitive Load Management incl. anomaly definition, filtering, clustering and prioritisation

An important research technique in leading/bleeding-edge situations is the case study. Whereas a set of vignettes or scenarios can provide a degree of insight (Clarke 2015a), case studies can deliver in-depth understanding. Another area of opportunity is in methods research, where studies can be undertaken of the ways to embed data and process quality into risk assessment and risk management frameworks.

In circumstances in which considerable volumes of output will be generated - such as anomaly detection in computer networks - experiments can be conducted on alternative ways of presenting the data-glut and avoiding cognitive overload. Hence another potentially fruitful area is the analysis of categories, in order to refine the notion of 'anomaly', and prioritise and cluster them.

6. Conclusions

Big data offers promise in a variety of security contexts. But risks need to be recognised, and managed. This requires the rediscovery of existing knowledge about data, information and process quality, and efforts to devise effective and efficient safeguards against harmful misapplication of data and analytic techniques. The undertaking needs to be progressively brought within organisations' existing risk assessment and risk management frameworks. Research programs need to encompass quality analysis and assurance.

References

Bollier D. (2010) 'The Promise and Peril of Big Data' The Aspen Institute, 2010, at http://www.ilmresource.com/collateral/analyst-reports/10334-ar-promise-peril-of-big-data.pdf

boyd D. & Crawford K. (2012) 'Critical Questions for Big Data' Information, Communication & Society, 15, 5 (June 2012) 662-679, DOI: 10.1080/1369118X.2012.678878, at http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878#.U_0X7kaLA4M

Clarke R. (2014) 'Quality Factors in Big Data and Big Data Analytics' Working Paper, Xamax Consultancy Pty Ltd, September 2014, at http://www.rogerclarke.com/EC/BDQF.html

Clarke R. (2015a) 'Quasi-Empirical Scenario Analysis and Its Application to Big Data Quality' Proc. 28th Bled eConference, Slovenia, June 2015, at http://www.rogerclarke.com/EC/BDSA.html

Clarke R. (2015b) 'The Prospects of Easier Security for SMEs and Consumers' Computer Law & Security Review 31, 4 (August 2015) 538-552, at http://www.rogerclarke.com/EC/SSACS.html

Clarke R. (2016) 'Big Data, Big Risks' Forthcoming, Information Systems Journal, January 2016, at http://www.rogerclarke.com/EC/BDBR.html

Jagadish H.V., Gehrke J., Labrinidis A., Papakonstantinou Y., Patel J.M., Ramakrishnan R. & Shahabi C. (2014) 'Big data and its technical challenges' Communications of the ACM 57, 7 (July 2014) 86-94

Manwaring K. & Clarke R. (2015) 'Surfing the third wave of computing: a framework for research into eObjects' Computer Law & Security Review 31,5 (October 2015) 586-603, at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2613198

Müller H. & Freytag J.-C. (2003) 'Problems, Methods and Challenges in Comprehensive Data Cleansing' Technical Report HUB-IB-164, Humboldt-Universität zu Berlin, Institut für Informatik, 2003, at http://www.informatik.uni-jena.de/dbis/lehre/ss2005/sem_dwh/lit/MuFr03.pdf

Piprani B. & Ernst D. (2008) 'A Model for Data Quality Assessment' Proc. OTM Workshops (5333) 2008, pp 750-759

Rahm E. & Do H.H. (2000) 'Data cleaning: Problems and current approaches' IEEE Data Eng. Bull., 2000, at http://dc-pubs.dbs.uni-leipzig.de/files/Rahm2000DataCleaningProblemsand.pdf

Shanks G. & Darke P. (1998) 'Understanding Data Quality in a Data Warehouse' The Australian Computer Journal 30 (1998) 122-128

Wang R.Y. & Strong D.M. (1996) 'Beyond Accuracy: What Data Quality Means to Data Consumers' Journal of Management Information Systems 12, 4 (Spring, 1996) 5-33

Wigan M.R. & Clarke R. (2013) 'Big Data's Big Unintended Consequences' IEEE Computer 46, 6 (June 2013) 46 - 53, PrePrint at http://www.rogerclarke.com/DV/BigData-1303.html

Author Affiliations

Roger Clarke is Principal of Xamax Consultancy Pty Ltd, Canberra. He is also a Visiting Professor in the Cyberspace Law & Policy Centre at the University of N.S.W., and a Visiting Professor in the Research School of Computer Science at the Australian National University. He is a longstanding Board member of the Australian Privacy Foundation (APF), including as Chair 2006-14, and was a Board member of Electronic Frontiers Australia (EFA) 2000-05, and a Director of the Internet Society of Australia 2010-15, including as Secretary 2012-15.

Personalia

Photographs
Presentations
Videos

Access
Statistics

The content and infrastructure for these community service pages are provided by Roger Clarke through his consultancy company, Xamax.

From the site's beginnings in August 1994 until February 2009, the infrastructure was provided by the Australian National University. During that time, the site accumulated close to 30 million hits. It passed 65 million in early 2021.

Sponsored by the Gallery, Bunhybee Grasslands, the extended Clarke Family, Knights of the Spatchcock and their drummer

Xamax Consultancy Pty Ltd
ACN: 002 360 456
78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Tel: +61 2 6288 6916

Created: 23 October 2015 - Last Amended: 10 November 2015 by Roger Clarke - Site Last Verified: 15 February 2009
This document is at www.rogerclarke.com/EC/BDQA.html
Mail to Webmaster - © Xamax Consultancy Pty Ltd, 1995-2022 - Privacy Policy