Like many bloggers, I’m interested in whether my blogs are being read. Most blogging platforms, including Blogger and WordPress, have some kind of interface for looking at your blog’s statistics. When I looked at the stats for my site Protists in Singapore, it seemed that there was some kind of pattern to the numbers. I wanted to investigate further, but with the tools available on the WordPress dashboard, there’s not much analysis that one can do.
Fortunately, it is possible to export pageview data from a WordPress blog. You’ll need to send a query to the WordPress stats server. This forum post explains how. The only drawback is that API keys are being phased out by WordPress.com and are no longer distributed with new WordPress accounts.
I set up my WordPress account before that, so I could use this method to obtain my pageview statistics. The output is a .csv file with two columns: ‘date’ and ‘views’. I imported this .csv file into the statistical computing environment R for further analysis. What’s great about R is that it’s freely available, and there are various packages for different types of specialized functions. There’s also plenty of tutorial material floating around the web. I referred to a few of these: a time-series analysis intro from the University of Göttingen (pdf), and lecture notes from the University of Bristol.
My pageview data is parsed by day: each ‘view’ number represents the total views on a given day. Put together it is a series covering more than 600 days, i.e. more than 1 and a half years of the blog’s availability.
As you can see there is a small but consistent pattern of traffic. There are two spikes around week 40 and week 60. The pattern looks noisy but possibly periodic. But first, is there any long-term trend? I smoothed out the time-series using a filter, i.e. taking a running average over 21 days, and overlaid this smoothed line on the original plot.
Another exploratory tool is the correlogram, i.e. a plot of autocorrelation against lag. When two random variables are correlated, it means that they are not independent, but the values are related to each other (dependent) in some way. In autocorrelation, we deal with pairs of points along our time series, spaced a certain width (“lag”) apart. For example, if we calculate autocorrelation for a lag of 7 days, we are trying to see if our values in the series show any dependence on the values from one week ago. By plotting the correlation coefficients for a range of lag values, we can identify dependencies in time.
In this correlogram, the blue dashed lines represent the 95% confidence interval. Values beyond the interval are statistically significant (i.e. the probability that you’d get a value beyond this interval by chance alone is only 1/20). Some features: Autocorrelation at lag = 0 is 1, which will always be true because values have to be perfectly correlated with themselves! There is a significant ACF at lag = 1 day and 2 days, suggesting that the previous day’s traffic could have predictive values for the next day’s web traffic. There are significant “peaks” at 1 week, 2 weeks, 3 weeks (etc.), suggesting that there is some sort of weekly pattern in the pageviews. A look at the raw data shows that weekends have lower traffic than weekdays. Why would people be browsing my protist blog on weekdays?
The answer is revealed by the top search engine terms that bring people here:
|why are protists important||68|
|what are protists||55|
|flagellar movement in euglena animation||48|
People want to know “what are protists”, and want to find pictures of particular species. My suspicion is that the people visiting my page are mostly students who have to look up these organisms for school work. That would explain the pattern of web traffic, and also the search terms that bring them there!