Monday, 18 February 2013

Count Distinct equivalent in Excel

Sometimes MicroSoft try to gently tell me off for using the wrong product to do a job.
Sometimes I respond in a very adolescent fashion: "But that's not FAIR! It's not my fault!"
This, sad to say, is one of those times.
I have been requested to generate a report describing how many accesses each of our ALevel courses have had since the start of September. I don't have access to the SQL databases - this was a choice made by the IT team which I can understand and appreciate as if I knacker it, they have to repair it. That being said, it causes me grief at times like this.
The alternative method is to use the reports system in Moodle.
  •  In Logs you can produce a report of all activity in a course on a specific day, or over the whole of time. 
  •  In Course Participation, you can look back over a specific period of time, but only for each individual resource or activity in the course, not the whole course at once.
We have 75 courses making up our ALevels (they've been broken down into units, and there are some Enrichment programmes included - we do not deliver 75 ALevel subjects!) so I really want to do as little processing as possible.
I used the Logs option and exported the data to Excel sheets (alternative options: print to page, print to ODS, download as text).
At this point I should probably have imported the data from the spreadsheets into a database, but I did not. This was partially because I need to be able to distribute the data and while many people here understand the layout and structure of a spreadsheet, they do not all feel as comfortable with databases.
Instead, I combined all the downloaded sheets into a workbook, deleted everything older than September 1st 2012 and removed all instances of my name (my presence skews the data quite badly).
My next action explains the purpose of this post: I wanted to count the total number of hits in each course which is fairly simple (=counta(SheetName!D:D) - Note I wanted it to count everything in the column as columns are variable height. I will need to subtract one to remove the column headings, but no biggie) and I also wanted to count the number of users who had accessed the course.
This was more complicated - if I was using SQL I could simply CountDistinct but MicroSoft in their wisdom, have not made it that simple.
However, there is a very elegant mathematical solution:
=SUMPRODUCT((DataRange<>"")/COUNTIF(DataRange,DataRange&""))
Isn't it lovely?
In summary, it checks the data range for duplicate values (discounting blank cells as non-values) and creates an array describing the number of times each item appears. Each of these is then converted to a fractional value in a beautifully cunning twist so that the sum of all their occurances is equal to 1. The total of these is then taken to give the number of distinct values.
Now, I owe my thanks to one of the ALevel Maths teachers for sending me an example of almost exactly this, but I'm not sure where he got it from.
However, for a much better explanation of how it works, check out: here and here
There is still a lot to be done in terms of pointing to the correct data range 75 times, but it simply adds an incentive to learn the VBA to do that job for me.