Accelerating Data Profiling Efforts

By Jim Van de Water

Some companies limit data profiling to a tsunami of SQL queries by analysts. This ‘non-scalable’ approach consumes a lot of time and is a tedious and uninspiring activity for a skilled analyst. Most important, this approach does not enable the groupthink of data profiling reviews.  For that, we need an accelerator – and a quorum of in-house experts.

Vendors have come to market with toolsets that make it virtually inexcusable to run massive manual SQL checks to profile data. These data quality analysis accelerators provide an effortless and consistent set of data heuristics at the click of a mouse. The tools offer an ad hoc capability to see the data that is both broad and deep.  Here are just a few of the items that can be validated: unique values, domain and range of values, default values, data types, field formats, outliers, codeset validity, presence of nulls, blank data, and invalid characters.

The data profiling tools are the accelerator, but the real value comes from the meeting of minds at data profiling review sessions.

Data profiling review sessions need a quorum of business and IT participants. The analyst(s) who wrote the source target mappings and data requirements needs to attend. Invite business SMEs that know and use the data regularly. A QA representative should attend to clarify issues, log the issues into a tracker or enterprise quality tool, and track issue resolution over the coming days.

Sessions are generally guided by the analyst. The source target mappings and business requirements documents should be close at hand during the session for reference. The most useful review sessions have live connections to the data profiling tool and to the data sources. Questions that pop up during the review sessions can be addressed in more detail by drilling into the data profiler, and/or by querying the source data.

Send out the link to the data quality profiling results at least a day before the review session so that everyone has a chance to do reasonability checks. Distribute a checklist of generic data quality pointers. The checklist will direct attention to key fields like primary keys and fields required by the business reports.  This will also help ensure a consistent approach to the effort.

Witnessing the groupthink in these sessions is fascinating. Each participant comes to the meeting with a unique point of view on the project inputs and outputs. The data profiling results are parsed to answer questions, generating new questions. A quick exploration back into the source system provides some immediate answers. The data mappings and business requirements are validated.  New business rules are proposed on the spot to solve observed issues. The roundtable discussion can be rich and fruitful. This is one forum where the whole is certainly greater than the sum of the parts.

Automated data profiling has become so effortless that we need to consider checking data quality at many points in the development lifecycle – during source analysis, when source data is landed, and after business rules are applied. The data profiling exercise need not be limited to single files or tables. Many profiling tools have inter-table profiling capabilities that can help validate referential integrity and find orphan keys.

Analysts should be analyzing data profiling output, not writing and running endless queries. The delivery team reaches a new level of collaboration when the tools and processes to enable data profiling are part of the team mindset.

This entry was posted in Blog and tagged , , . Bookmark the permalink.

Newsroom


Contact Us


Post Categories