Showing posts with label Arrays. Show all posts
Showing posts with label Arrays. Show all posts

Monday, December 01, 2014

Selecting Random Transactions - Generating Random Numbers for Auditing

I taught a class recently and one of the students wanted to know how to randomly sample transactions from a list for auditing. He phrased the question this way: "How can I generate a list of random transaction numbers and make sure they don't repeat". Immediately, I thought of a really easy way to answer the question. But to give the literal answer to that question, it takes more effort.

Speaking with the accountants, they let me know that the highest value transactions should be investigated. Also, I should note that when selecting a random sample of transactions from a list, the subset of transactions should be a representative sample of the entire list. Statistics can be used to figure out how many transactions are enough to gather a representative sample, but this article will focus on how to generate the list of unique random numbers once you know how many you need.

The Easy Method

The good thing about the easy method is that it can be done quickly and easily, it just doesn't directly create a list of unique random transaction numbers. List the transactions in an Excel Worksheet, then create a new column to the right of the data set. In the first cell at the top of the column, type the column label as "Random Number". In the cell below type the following formula: =RAND() then hit enter. A random number between 0 and 1 is generated and every time you hit the F9 key or change a cell on the Worksheet, a new random number between 0 and 1 will appear. If you copy and paste this formula down to the end of the transaction list, you can assign a random number to each transaction. At this point, you should copy this new column then paste as values so that the numbers won't change. Now sort the list of transactions based on these random values. You can take the first n rows to create a random subset of transactions.

The Hard Method

Now, to answer the question directly, some formula magic is required. In fact, an array formula is needed to create a unique list of random transaction numbers. To start with, we need a lower and upper boundary of transaction numbers. Type “Lower Bound” in cell C1 and 70 in cell D1. Type “Upper Bound” in cell C2 and 79 in cell D2.

Place a label in cell A1 that says "Unique Random Numbers". Now, in cell A2, type the following formula: =RANDBETWEEN(D1,D2) then press enter. This will generate a random number without decimal places from 70 to 79. Pressing the F9 key or changing a cell on the worksheet will generate another random number within the transaction number boundaries.









Now, an array formula needs to be written in cell A3 that respects the transaction number range, is aware of what is in the cell above, and chooses a random number from the available remaining numbers. The magic formula is:
 =LARGE(ROW(INDIRECT($D$1&":"&$D$2))*NOT(COUNTIF(A$2:A2,ROW(INDIRECT($D$1&":"&$D$2)))),RANDBETWEEN(1,($D$2-$D$1)-ROW(A3)+3))
When entering the formula, make sure to hold down Ctrl+Shift while pressing the Enter key. This is a Ctrl+Shift+Enter array formula.

To understand the formula better, it should be explained step-by-step, so start with:
ROW(INDIRECT($D$1&":"&$D$2)).
When the numbers from D1 and D2 are placed in this formula, the result is each number from 70 to 79 in an array: {70,71,72,73,74,75,76,77,78,79}. We want to select a random number from this array, but not the number that appears above in cell A2, so this array needs to be multiplied by an array of 1’s and 0’s.

COUNTIF(A$2:A2,ROW(INDIRECT($D$1&":"&$D$2))) is the next part of the formula and is tricky to understand. The expandable range to search is A$2:A2, which contains one random number between 70 and 79. The second part of the COUNTIF performs a search for all 10 numbers between 70 and 79. Only one of these 10 numbers can be found in cell A2. If A2 contains the number 79, then this COUNTIF formula will return an array of counts that look like {0,0,0,0,0,0,0,0,0,1}. This array shows us what has been eliminated by the previously generated random number, but we want to know which numbers are still available for selection, so put this COUNTIF inside a NOT function to reverse the logic and switch the 0’s and 1’s to {1,1,1,1,1,1,1,1,1,0}.

=ROW(INDIRECT($D$1&":"&$D$2))*NOT(COUNTIF(A$2:A2,ROW(INDIRECT($D$1&":"&$D$2)))) multiplies the two arrays together. In this example, it would look like {70,71,72,73,74,75,76,77,78,0} because 79 is already selected in cell A2. The next step is to select the first or second or third or maybe the ninth largest number in this array. Place this formula inside a LARGE function to choose, let’s say, the fourth largest number from the remaining numbers: =LARGE(ROW(INDIRECT($D$1&":"&$D$2))*NOT(COUNTIF(A$2:A2,ROW(INDIRECT($D$1&":"&$D$2)))),4)
To get the LARGE formula working, make sure to hold down Ctrl+Shift while pressing the Enter key.

Of course we don’t just want the fourth largest number, we want to randomly choose from the 9 remaining numbers. Replace the 4 with RANDBETWEEN(1,9). Make sure to press Ctrl+Shift+Enter. This formula works well for cell A3, but when it gets copied down to cell A4, there will only be 8 numbers to choose from, so the RANDBETWEEN(1,9) needs to change to RANDBETWEEN(1,($D$2-$D$1)-ROW(A3)+3). The 9 has been replaced with a formula that calculates the difference between the upper and lower bound then subtracts the current row and adds back 3. Ctrl+Shift+Enter this formula.

If you want to follow the action inside the formula using Excel, select cell A3 then click on the Formulas tab and the Evaluate Formula button. Next, press the Evaluate button repeatedly to show the step-by-step calculation.

With cell A3 selected, copy the cell then select A4:A7 and paste the formula. The $ symbols were carefully placed in the formula to make sure that the formula works when copied down. You should always have six unique numbers between 70 and 79. You can change which six numbers appear by pressing the F9 key.












Once you are confident that there will never be a duplicate number, change the values in cells D1 and D2 to a range of transaction numbers. Also, copy and paste the formula in cell A3 down for as many unique random numbers as you want.

If you have any questions for future articles, would like to know more about using Excel, Access Databases, VBA, Charts, need a dashboard, data model, database solution or analytics training, please contact Scott Cuthbertson:

Email: scottjcuthbertson@gmail.com
LinkedIn: http://ca.linkedin.com/in/scottcuthbertson
Blog: http://theanalyticspro.blogspot.ca/

Scott Cuthbertson BBA

Scott is an Analytics Professional who has worked with over 100 clients in the retail and financial services industries. He has helped organizations improve their labour scheduling, inventory replenishment, performance management, organization design and pricing. He has been a practitioner of analytics since childhood, using data models to win Hockey Pools. His expertise lies in developing data models, reporting, algorithms and solutions for any organization that wants to use their data to improve decision making. He also teaches at the CMA Professional Development Institute.

© Scott Cuthbertson 2014 Blog: Words From The Analytics Pro



Tuesday, January 14, 2014

Ctrl+Shift+Enter - Welcome To The Wild World Of Excel Arrays

One of the most obscure and least documented features in Excel is also one of the most powerful. If you have not yet encountered array calculations in Excel, I’ll let you in on the secret. An array formula is one that performs multiple calculations in one cell. Here is a simple example:

In order to calculate the value of inventory, the Cost should be multiplied by the Units On-Hand for each Item. Once the four multiplications have taken place, the results should be summed.

A common approach to this calculation is to add a new column to perform the multiplications. The formula in cell D2 is =B2*C2, in cell D3 is =B3*C3, and so on.

After each row’s Cost Value is calculated, a SUM Function adding column D will total $97.00. This result was obtained in two steps and used five cells, one for each Item and one for the SUM. If there is more data, this approach will use more cells to calculate a result. The accumulation of enough of these intermediate calculations can bloat a data model to a size that is difficult to open or email.

An Array Formula can achieve the same result of $97.00 by performing this calculation in one step using one cell. The Array Formula is: =SUM(B2:B5*C2:C5)

Each of the four cells in column B is multiplied by the corresponding cell in column C. The four multiplications are then enclosed in a SUM Function to add the four results. The only problem is that after typing in this formula and pressing the Enter key, a #VALUE! error appears. Since this formula is an array calculation, it needs to be entered using a special combination of keys: Ctrl+Shift+Enter. Hold Ctrl and Shift down while pressing Enter, doing this will enclose the formula in curly brackets. This is Excel’s method of showing that it understands that an array operation is performed. The curly brackets are never typed in, they result from entering an array formula with Ctrl+Shift+Enter.

There is another Array solution for this calculation that does not require Ctrl+Shift+Enter, the SUMPRODUCT Function. You can see in the function ScreenTip that an array is expected. =SUMPRODUCT(B2:B5,C2:C5) and then Enter will multiply corresponding cells then add the results. Ctrl+Shift+Enter was required in the previous example because the SUM function is not expecting arrays to be multiplied inside.

This example is a simple introduction to the existence of array calculations. The real benefit of an array formula is in situations where there are thousands of Items. The array formula can be performed in one cell without the need for thousands of intermediate calculations. This results in smaller data models and can help avoid 30-60 Megabyte file sizes.

Also, array formulae open up amazing new possibilities…

Where is the MAXIF Function? Where is the MINIF Function?

A common gateway to the world of array calculations is when you want to retrieve the maximum value based on a condition, a MAXIF function. Even though Excel has provided functions for SUMIF, COUNTIF and even AVERAGEIF, there is no MAXIF. Here is an example to show how an array formula can simulate MAXIF until the day when Microsoft adds the MAXIF function to Excel.

This data set stores the sales value for each salesperson on each day through the year. Imagine the table starts in cell A1 and ends in cell C2000. What is the largest daily sales value for Tara? The solutions I see most often are:

1) Manually sort or filter the data to calculate the MAX Function for the “Tara” records. This is a good solution for a quick one-time calculation but is not ideal for calculating the MAX for each salesperson.

2) In column D, add an IF Function for each row to only display sales values where the salesperson is “Tara” then use the MAX Function on the new column. This solution is not much better than the first one.

3) Use a PivotTable to calculate the MAX value for Tara and the other salespeople. This is a good solution and has much more flexibility than the first two.

4) Use an array formula for maximum flexibility.

I prefer formula-based solutions because they automatically recalculate when cell inputs change. Here is the array formula to calculate Tara’s maximum daily sales (don’t forget Ctrl+Shift+Enter):

=MAX(IF($B$2:$B$2000=E2,$C$2:$C$2000,""))

IF($B$2:$B$2000=E2,$C$2:$C$2000,"") will perform 1,999 IF calculations, each comparing a value in column B to “Tara” in cell E2. Wherever the column B value is “Tara”, the corresponding value in column C is returned, otherwise the empty string “” is returned. Once all these IF calculations are finished, they are enclosed inside a MAX Function which will extract the largest value. The spreadsheet has been spared the use of 1,999 cells.




















This formula can easily be copied and pasted downwards to the other salespeople. The formula can also be easily modified to determine the MIN, MEDIAN or QUARTILE of Tara’s sales values. Any summary function can be used to quickly understand the data.

One caution though, since almost 2,000 calculations are happening in one cell, the array formula takes a little longer to calculate than a simple formula, but only one cell was required. It is not advisable to take this array calculation and copy and paste it thousands of times, you will have wandered into the realm of databases at that point.

Knowing how to use arrays will take your data models to the next level of efficiency. Array calculations work in different versions of Excel and can help you create better data models. There are many more applications of these ideas, especially those simulating database functionality on a small scale in Excel.

If you are interested in learning more about Excel Array Calculations, let me show your organization how it can improve its Excel models. For any analytics training or analytical solutions, please contact:

Email: scottjcuthbertson@gmail.com
LinkedIn: http://ca.linkedin.com/in/scottcuthbertson
Blog: http://theanalyticspro.blogspot.ca/

Scott Cuthbertson BBA

Scott is an Analytics Professional who has worked with over 100 clients in the retail and financial services industries. He has helped organizations improve their labour scheduling, inventory replenishment, performance management, organization design and pricing. He has been a practitioner of analytics since childhood, using data models to win Hockey Pools. His expertise lies in developing data models, reporting, algorithms and solutions for any organization that wants to use their data to improve decision making. He also teaches at the CMA Professional Development Institute.

© Scott Cuthbertson 2014 Blog: Words From The Analytics Pro