Wednesday, March 2, 2011

NWN Modding Statistical Analysis: Part 2

Some of the most interesting studies I've ever done resulted in my discovery that the initial assumptions that lead me to the study were wrong. This is one such case.

Last post looked at download and voting rates for NWN2 modules. However, I also wanted to discuss module scores, as it seemed to me that scores were increasing over time. This occurred in NWN1, and there were many explanations posited. One of the most common was that as time progressed, modules were getting better and more sophisticated. Therefore, the rising scores actually did reflect rising quality. Wouldn’t the same phenomena apply to NWN2?

I admit that my initial thought was "no" for three reasons. First, the community-made content for NWN1 increased at a much more rapid rate than it has for NWN2. It is undoubtedly true that some of the community-made tilesets far exceeded those developed by Bioware for the initial game, and the same would apply to a host of creature models, placeables, and so on. NWN2, on the other hand, hasn’t seen nearly the same amount of content. Therefore, newer NWN2 modules use the same tilesets and monsters older ones did and so don’t look decidedly better as a result.

Second, NWN2 started from a higher standard than did NWN1. Very few people just played around with the toolset and released a half-baked mod for NWN2 whereas several dozen did that within the first few months of NWN1's release. Therefore, there wasn’t as far to go up in terms of quality.

Finally, one of the earliest NWN2 modules I played was Zach Holbrook's "Harp and Chrysanthemum" - a module I shall dwell on quite a bit in this post – and with absolutely no false modesty here, I firmly believe that module, released in December of 2007, is every bit as high quality as TMGS, although shorter. Therefore, the Module of the Year for 2007 is not substantially different in terms of quality from one of the top mods by vote score of 2010.

Therefore, I set out to see if I could discover if module scores were really increasing, and if so, why. I took the top 50 modules from the NW Vault, which is all of them from the first page. The only one I removed was "Mysteries of Westgate," which checked in at #46 because it really wasn't a "community-made" mod. Otherwise, everything from the front page was included. I then graphed (1) the current score as of February 28 vs. the release month and (2) the module rank vs. the release month. These graphs are below.

As can be seen, I did denote which of the modules were not traditional adventure modules – things like character generators or OC add-ons – but I included them on the graphs anyway. Now, looking at the first graph, it is clear that scores are generally increasing over time. The equation represents the linear "best fit" for the scoring of only the traditional adventure modules. Theoretically, an adventure mod of a quality to get on the Vault's first page that is released today (x = 0) should get a score of 9.67 whereas one from the beginning of 2008 (x = 38) would be 9.32. So the average adventure module has seen its score increase by 0.35 in a little more than three years. Although I didn't include the equation here, the difference is virtually the same if you include the non-adventure modules, although all scores generally increase.

The information is shown a bit differently in the second graph. Here, the module rank is on the y axis, so only one point can have any given value of the y-axis. Every y-value from 1 to 50 is represented here with the exception of the missing #46 ("Mysteries of Westgate’s" rank). Again, the modules represented to the left of the graph – meaning they’re newer – generally have a lower rank. As can be seen, none of the eleven modules released before December 2007 currently has a rank better than 33rd while none of the twelve modules released in 2010 or later is ranked worse than 35th. Clearly the newer modules tend to be doing better.

As an aside, another major trend jumps out at me, especially looking at the second graph. The non-adventure modules tend to be scoring better at every stage of NWN2's life than their adventure mod counterparts. This trend started early. Looking at the modules released in 2007, "Harp and Chrysanthemum" is far and away the best performing adventure mod, and it's still ranked an amazing #14. However, "Bishop's Romance," released a mere two weeks before H&C is ranked #6. Even today, of the top 10 modules on the Vault, an overwhelming EIGHT – including all of the top FOUR – cannot be considered adventure modules. These eight are "Lute Hero", "Romance Pack for the OC", "Halloween", "The Heist at the NW Lights Casino", "Bishop's Romance", "NWN2 OC Makeover", "SOZ Holiday Expansion", and "DM101 for NWN2" (a new addition to the top 10). The only two modules to buck the trend are "Planescape: Shaper of Dreams" at #5 and TMGS at #7.

This trend, at least, should be easy to explain. Niche modules are apt only to be downloaded by people looking for those types of modules. People looking for said modules are likely to be favorably impressed by them. After all, who would download "Bishop's Romance" except someone looking to romance Bishop? And if that's what you're after, I'm sure the entry does a sterling job and so it will garner high marks. There's very little room to misinterpret what you'll get with such an entry.

To be quite honest, even "Planescape: the Shaper of Dreams" and TMGS are on the niche side of adventure modules. The Planescape setting isn’t for everyone, and I imagine I've lost quite a few downloads from players who have no interest in playing a cleric, much less a Tyrran specifically. Even the Ravenloft setting of "Misery Stone" at #11 might make it a niche module.

But getting back to the main point, it seems pretty clear that some vote inflation is occurring. Now the question is why. My initial thought was that a different kind of player predominated in the early days of NWN2. This player was in the community only because NWN2 was the "new" thing at that point. They bought the game, played through the campaign, downloaded everything that came out, easily had their attention diverted elsewhere, dumped a module if they weren't grabbed by it in 10 minutes, and then left the community as soon as Mass Effect or Dragon Age became the new "new" thing. Now – four years later – we’re only left with the truly dedicated, those people who appreciate D&D and well-developed stories, and their votes aren’t being as diluted by the more transient masses.


My test for this hypothesis was to look at the votes of some older modules. The Vault registers the date on which each vote was cast, so we can look at voting trends on the same module over time. If the prior paragraph were really true, even established modules should see their votes trending upwards over time.

My perfect test case for my hypothesis was the oft-mentioned "Harp and Chrysanthemum." H&C came out in December, 2007, and it has the most votes of any NWN2 module on the Vault at 505. So I logged every vote H&C has, split them by month, and then calculated both the score for that particular month, and the aggregate for all scores up to that point. For example, in December of 2007, H&C had 108 votes that averaged 9.55 for that month. At that point, the module score was (theoretically) 9.55 but because of the way the Vault doesn't count the top and bottom 10% of votes, the actual score at that point was 9.63. In January of 2008, there were another 60 votes that averaged 9.62 for just this batch of 60. However, the aggregate score of the 168 votes through that point after discounting the top and bottom 10% was 9.65. The complete chart is given below.

As can be seen, the hypothesis that H&C's votes are generally rising doesn't hold up. Every month since the beginning, the aggregate score has been between 9.63 and 9.66. In the 14 months since January 2010, six months have had monthly averages below the aggregate and six months have had averages above it. Two months – May and October of 2010 – had no votes at all and so have no data point associated with them. To confirm, I looked at the average scores for each year. In 2007, H&C had a score of 9.63. For 2008, it was 9.67. In 2009, it was 9.64. In 2010, it was also 9.64. Thus far in 2011, it's 9.83, but that only includes five votes thus far, two of which (the high and the low) must be thrown out according to the Vault algorithm, so this is thus far statistically irrelevant.

So I wondered if H&C was a unique case due to its overwhelming early and enduring popularity, so I then looked for something a little newer that still had a fair number of votes. I saw that "Asphyxia" was released in April of 2008, has 179 votes, and is currently ranked #26. The data will come in a bit, but the long and the short of it is that the same trend is apparent. The total score for "Asphyxia" for 2008 was 9.49. In 2009, it was 9.51. In 2010, it was 9.39. Thus far in 2011, it’s 9.5, but that’s only one vote.

Splitting Asphyxia's votes into quartiles, the first 45 votes covering from April 1, 2008 to May 15, 2008, averaged 9.48. The second group of 45 votes, covering until July 30, 2008, averaged 9.54. The third group of 45 votes, covering until January 2, 2009, averaged 9.54. The final 44 votes averaged 9.48. Again, no substantial increase over time, whether that time is defined in years or votes.

So I looked again for a third module. The first two I had looked at had the vast bulk of their votes cast before 2010, and I wanted one that had a substantial number of its votes cast both before and after Dragon Age's release. My idea was that that might be the game that took away all the great unwashed I discussed earlier. I therefore looked for something that released around mid-2009, say April through August. There are surprisingly few that fit that single criteria. "Dark Waters (Full)", "ZORK", "Last of the Danaan", "Serene", and "Lolthanchwi" are it. I removed "Dark Waters" because it is a re-release of a couple already existing modules. "ZORK" only has 26 votes. Among the last three, all have roughly 5400 downloads, but "Serene" has 115 votes compared to "Last of the Danaan's" 69 and "Lolthanchwi's" 51. Additionally, 32 of these votes, or about 27.8% come after January 1, 2010. This is about the same percentage as the others, but the sample size is bigger. So "Serene" was my choice.

The votes for "Serene" for 2009 averaged 9.32. The 2010 votes averaged 9.49. The six 2011 votes to date average 9.35. This module came closest to proving my hypothesis (because the 2011 vote total is still statistically insignificant). However, I'm still not convinced. Even though I picked "Serene" because of its closer proximity to Dragon Age, I'm still left with the obvious fact that the first two modules also garnered a number of votes after January 1, 2010, and they didn't show the same pattern. In fact, H&C has had 31 votes since New Year's Day, 2010, which compares favorably with "Serene’s" 32 votes, although this represents a much smaller percentage of H&C's overall total. Also, a closer look will reveal much more scatter in the month-to-month votes for "Serene." Breaking down the voting further, it started out with a high peak, then came down sharply, and then slowly rose a bit during the time of Dragon Age's release. It's just not the pattern I would have expected if my original hypothesis were correct.

Finally, I did the same analysis for TMGS for my own interest. Because it's only been out 7 months, I don't think anything interesting can be gleaned from those numbers, but for the record, the score has been coming down for it. The monthly aggregate scores for the four modules are given below. Again, note that these are only the monthly aggregates calculated as I described above.

So this analysis sort of shot down my thought that we would see score inflation for all modules, not just the newer ones. There are, of course, problems with the preceding analysis, so I'm not 100% sure I was wrong yet. It is possible that people are less likely to vote on a module that already has a bunch of votes. There may be the thought that their vote doesn't count when there are 500 others unlike when there are only 10-15 votes. People may also vote with the crowd. If they know that H&C has a score of 9.65, they may approach voting from a standpoint of deciding whether it should be a 9.5 or 9.75, even if they would vote a 10 or an 8 if they did not see the score before-hand. To get a much better idea of voting patterns, we'd need to look at the behavior of individual voters, and that gets into psychological factors, not to mention I have no interest in calling attention to individuals here. To get the best idea, we'd have to somehow wipe everyone's memory, re-release H&C today, and see what it would score. That, of course, is impossible.

From my own standpoint, my thoughts on voting have changed recently. When I first played and reviewed H&C in December 2007, I said it was "in the 9.00 range and probably higher" by the new NWN2 voting standards. Right now, I say it's a 10.00 because it's pretty obviously near the pinnacle, if not actually the pinnacle, of NWN2 mods. By the time I reviewed "Trinity" in March of 2010, I didn’t mention a score, but I was pretty ready to call it a 9.75 if not a 10.00. Today, I'd go all the way and say it's a 10.00. So I guess I would admit to some voting inflation of my own... if I voted. I bet there are a number of people who feel the same way.

So another post and I've solved few of my original questions. I am convinced vote inflation is occurring, but I assumed it was occurring universally across the board even for older modules. That doesn't appear to be the case. It's the reason I'm not sure about.


Nemorem said...

One theory: With so few modules coming out, the players that are still around are appreciative of anything new and are rewarding new mods with higher scores.

Interesting (and detailed!) analysis.

Kamal said...

From the graphs it looks like there's generally a slight uptrend in votes before settling in, as three of the four you graph do that. I do think there's a good chunk of "everyone else gave this an x so I should be close" scoring.

Why might average module scores go up over time, with newer ones getting higher scores? Experience with the toolset and experience seeing what others have done and shown can be done could be reasons, earlier builders have set a bar for newer modules to try to jump over in terms of area design complexity, or scripting, or whatnot.

If I want to see how builder y did z in their mod I can open things up and take a look. Without having to reinvent the wheel of doing z, I am freed up to implement x.

And builders have released more and more systems for things. With my next module I didn't have to script lights coming on at night, Scripted Lighting System from the Vault gives me a package I just drop in and throw the relevant variables on the relevant objects and I'm done. Likewise Hellfire of RWS let me get my hands on an unreleased GUI lockpicking system.

Things like that have freed me up to do things like implement weaponry that players can hide so npc's do not see it, and make that a useful thing for players in my next work.

Those types of things combine to make more and more advanced systems available to builders, and more "trickery" to show players. So maybe in an early module you just go kill some orcs, but later works take advantage of systems that allow the orcs to only come out at night. A player will increase their score in recognition of the additional work involved in that.

I think another reason might be newer builders are better able to push nwn2 graphically. Newer faster pc's mean a builder can throw more complicated visuals in, since the computers can handle them. Graphics are always going to win some points.