Data Warehouse Project Work Effort Estimation…and Why Hope is Not a Strategy

By Gregory Lampshire

I talk with many clients around their data warehouse programs. In some cases, projects run longer and cost more than expected. This is true for clients who have mature data warehouse development process as well as those with new capabilities. Why is estimation so hard? Let’s ask that question in the context of healthcare analytics and figure out why counting on “hope” is a bad way to run your data warehouse program.

The Path to Reliable Estimates is not Well-Defined

One reason that estimation is hard is because it is difficult to specify analytical requirements. Yes, I know it’s fairly easy to specify that you want a report that runs a cross-tab between admission types and paid claim amounts. But let’s look a little deeper at what’s necessary to get a valid report–the analytics underlying it. What definition do you use for admission types? Are they the same definitions across all of the groups? How do you calculate paid claim amounts–what do you exclude or include? Do you have all the data? While data governance standards can help you answer these questions, sometimes it’s the next level of detail that makes it difficult to define a path from a report request to the development effort to create that report–that analytical artifact.

The path to reliable estimation is not always well-defined because analytics, using best practices, tries to be predictive. When you use predictive modeling techniques, however, you do not always know what variables are important. For example, when you are forecasting utilization rates, do you want to take last year’s utilization rates and bump them up 10 percent? Why 10 percent? Maybe other factors are influential. Are patient demographics changing? Are providers cost-shifting procedures to achieve higher reimbursement rates in one area versus another? Perhaps governmental factors, such as the increase in Medicaid enrollees, is driving up utilization for basic services.

Because you cannot specify everything, you will need to address basic utilization forecasting before you start. That means you have to play with the data, understand data quality issues, understand proxies for measures, understand the information content of specific datasets. In other words, you do not know what you need and how you will need it until you solve the problem. That’s why it’s a not a tidy, well-defined path to trustworthy analytics underlying your estimates.. At the end of the day, however, predictive modeling that takes into account various factors is still a better way to forecast than guessing or hopeful thinking.

Just because something is not well-defined does not mean you should give up or just hope that the answer you give is the right one. As we know, “hope” is not a great way to run a company or a program. Here I’ll share some of my thoughts on a more studied approach to estimating.

Use Top-Down Estimation

The most basic approach to estimation that most companies use is top-down estimation. It comes in two forms.

The first form is by saying, “hmmm….my management judgment says it will take 4 weeks to gather requirements, 3 weeks for design and architecture, 6 weeks of build and unit test, 3 weeks of systems integration testing, 2 weeks of UAT and 1 week to deploy. Okay…that’s about right.” That’s management judgment at work. This approach is actually fairly good but is highly variable and not easily reproduced consistently across different managers. This method is good at capturing organizational dynamics–for example, the business is busy this month, so requirements will take longer, or the development team is busy on another release, so they need 2x the amount of time they would normally need. So it’s good and proper to use this form of top-down estimation.

The second form of top-down estimation uses look-alike comparsions. Project X was really HARD and it took 6 months. Project Y is of the same order of complexity, so therefore it’s 6 months as well +/- 2 weeks. This approach takes into account high-level, structural complexity. For example, I have to master the members (uniquing). or I have to build an organizational view of providers, or I must consider some other aspect of complexity that can be identified at a high level. This is a good way to capture estimates for this type of information.

But top-down estimation is not enough. While it can be good at times and eerily accurate, it’s not always reliable, scalable across many projects and managers, or consistent. Hoping that top-down answers are right is not a strategy. A strategy is a purposeful creation that management should address with purposeful action. In some cases, these top-down estimates are guesses, and guesses are not why I pay managers to work for me. I do not expect perfect estimates, but I expect estimates that I can learn from organizationally, and continually refine during the life of the project. I need estimates that make me better even if they are wrong today.

Use Bottom-Up Estimation

Bottom-up estimation is the second major approach to creating work estimates. Bottom-up estimates do involve management judgment and analytical models and estimation of complexity. But they do so in a reproducable, structured way:

What goes into a bottom-up estimate?

  • How many tables are being staged from a source to a staging database, then to an operational data store, then to a data warehouse?
  • How many ETL programs must be written assuming the organization’s standard architecture and design for ETL?
  • How many sessions with users are needed to gather requirements?
  • How many reports are to be created? Are they well understood?
  • How many new servers must be procured, installed and configured?
  • How many data model changes, dimensions or facts are needed?
  • …other factors in your estimating model

Work units represent fundamental steps in data warehousing needed to properly move data from one place to the other in a way that satisfies business requirements. There are many different levels of bottom-up estimation, but generally you need these types of details, in some form, to create a large spreadsheet to perform estimation. For each item, find a count or use a parametric model to convert the count into work effort. Those conversion factors represent your productivity factors.

You cannot always get all of the data in the exact form you want. However, you can detail your estimates and submodels in the bottom-up calculation. If you don’t have the number of tables in the sources, you can estimate based on organizational average, say 50 or 20 or 30. You can model the average number of attributes per table. You can state your assumption for the number of requirements and reports. All of this can be stated and most importantly, recorded, in a spreadsheet or estimating tool that captures the assumption and allows you to change it as needed, throughout the duration of the project.

It‘s the ability to be explicit with your assumptions and use an explicit model that makes a good bottom-up estimate.

Are bottom-up estimates sometimes wrong? You bet. But through a few iterations, the models can be become significantly better more quickly. In addition, you can do things you could not do before, such as engage multiple vendors and understand their assumptions, use a disciplined number of development steps, the SDLC, built into the estimate, which again forces standardization across potentially diverse sets of suppliers.

Putting it All Together

So both top-down and bottom-up estimation are needed and useful. BOTH must be performed to arrive at an estimate that can be triangulated from different points of view. Somewhere in the middle is the work estimate you can use to start your project. Hope is not a strategy when it comes to data warehouse projects–use analytics to help you estimate and get better over time–in a way, that’s the whole point of analytics.

Putting it together also means that you need to run the top-down and bottom-up model across all projects at the same time. This allows you to identify duplication or other types of common costs that can be more efficiently solutioned.

How good can your estimates be? For small-to-large and simple-to-complex projects you can achieve good estimates with 5-10% of actuals. I’ve always considered a good estimate to be one that is within 5% of the actuals, especially for analytical projects. Your target tolerance may vary but you’ll be using a methodology that employs the analytics you are building for others.

What strategies do you use to develop reliable estimates for your data warehousing project work efforts? Let me hear from on this interesting topic.

This entry was posted in Blog and tagged , , , , , , , . Bookmark the permalink.


Contact Us

Post Categories