Thursday, August 21, 2008

Data quality 1: Data Governance

I work in the dirty end of town, as far as data goes. All kinds of rubbish filters down from a variety of data sources, to a data warehouse from which I'm typically called to extract more value than it contains. What makes you expect that data to be fully loaded and correct when you don't even maintain a data dictionary - or any meta-data repository or documentation? From large enterprise to small business, the organisations I've experienced have not gone far enough beyond lip service on this.

That's only one reason data is dirty. Another: it's only as good as the source. Data from sales people - such as lead information - is notoriously haphazard. Conversely, bank account data is likely to be pretty clean - the balance is, anyway, although peripheral information is often less than pristine.


Rule 1) The data is only as good as its source. Don't overextend the data's purview unless you're willing to work on it first.
Rule 2) The state of the data directly relates to the amount it's used, how much someone cares about it.

For example, in putting together aggregate information for resource planning, I find reliability of sales information is poor - but much better for those specific elements of data on which the sales person hangs a KPI. (even then, they may have their own reasons for misrepresenting reality.)
And I find that people may have procedures set out that guide their workflow, and thus govern data at the point of entry. But I find that people often do what they need to get their job done, which may not always involve following procedure strictly - only enough to satisfy the next person in their data stream. But if that data - which purports to be clean - is used in aggregate, the numbers may not add up.

Rule 3) Officially documented procedures don't always guarantee the health of the data.
Rule 4) If the data is to be used for a purpose outside the current day-to-day processing, it should be tested.
...etc etc. There's a lot more to state that could sound like the obvious if you already manage your data well. Business analysis from the dirty end can unearth business process improvements (or enforcements). But systemic improvement needs to come from the C level.


Data Governance, defined broadly, is “a system of decision rights and accountabilities” which describes roles and processes for the use of data. More here.

Steve Adler, IBM's international Director of Data Governance, gave a stimulating talk earlier this year on data governance issues. He began with the concept of toxic (data) content: that which leads to poor, uninformed decision-making. Two examples I heard from him were a) origins of the sub-prime crisis within poor quality risk/lending data; and b) influence-leading by the US administration on the Iraq war by seeding news sources with purportedly independent expert sources (see here). Information that is tainted because the sources weren't verified as pristine.


Adler's presentation (here) is valuable to any organisation with an important stake in the quality of its data. The solutions he recommends are predicated on an organisation caring about its data and having the resources to look after it. Either the will or the resourcing may be lacking, intentionally or by negligence. But at the very least, it is important for an IT manager to understand the issues. Otherwise, they may promise the earth on a simple-sounding programme, only to find the deliverables mandate costly data cleansing projects, from the technical analysis to the business analysis, to documentation to procedural change. Not to mention the issues of political will and influence to improve.

Adler details a theoretical internal market for data whereby user demand "sets the internal price". That often happens already - but very much on an ad hoc basis. If it were more formalised, the IT budget - especially for data maintenance - would be enhanced to the point where it could look after the user's data to the extent that the user has an interest in and is willing to pay for it. This would, of course, necessitate good cooperation with the business areas that source the data – the answers are always ultimately at the business end of the company. Thus it is quite unviable to run good data governance without C-level buyin.


IBM originated the concept of Data Governance back in 2005, however it has spread beyond the vendor-specific (see here). It may sound bureaucratic – un-free-market – but the above illustrates well: if a free market is desired, paradoxic as it sounds, it requires governance.


The general principles apply equally everywhere; however an ideal implementation may be best suited at the enterprise level, as smaller organisations would find it harder to meaningfully commit specific resources on an ongoing basis. For those below enterprise level, there are plenty of ways to grasp the nettle and improve the framework for effective use of data.


Further reading:

Wikipedia: brief description of Data Governance

Steve Adler's blog

Adler on the Data Governance Council

The Data Governance Institutute

Mike2.0 (an open source information management methodology) on Data Governance

Mike2.0 giving a context for Data Governance





No comments: