State of page view stats

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

State of page view stats

[Resending as plain text]

I maintain compacted monthly version of page view stats, starting
with Jan 2010 (not an official WMF project).
This is to preserve our page views counts for future historians (compare
Twitter archive by Library of Congress)
It could also be used to resurrect which was very popular.
Alas the author vanished and does not reply on requests and we don’t have
the source code.

I just applied for storage on dataset1 or ..2, will publish the monthly <
2Gb files asap.

Each day I download 24 hourly files and compact these into one
Each month I compact these into monthly file.

Major space saving: monthly files with all hourly page views is 8 Gb
with only articles with 5+ page views  per month it is even less than 2 Gb.

This is because each page title occurs once instead of up to 24*31 times,
and ‘bytes sent’ field is omitted.
All hourly  counts are preserved, prefixed by day number and hour number.  

Here are the first lines of one such file which also describes the format:

Erik Zachte (on wikibreak till Sep 12)

# Wikimedia article requests (aka page views) for year 2010, month 11
# Each line contains four fields separated by spaces
# - wiki code (subproject.project, see below)
# - article title (encoding from original hourly files is preserved to
maintain proper sort sequence)
# - monthly total (possibly extrapolated from available data when hours/days
in input were missing)
# - hourly counts (only for hours where indeed article requests occurred)
# Subproject is language code, followed by project code
# Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote,
s:wikisource, v:wikiversity, z:wikipedia
# Note: suffix z added by compression script: project wikipedia happens to
be sorted last in files, so add this suffix to fix sort order
# To keep hourly counts compact and tidy both day and hour are coded as one
character each, as follows:
# Hour 0..23 shown as A..X                            convert to number:
ordinal (char) - ordinal ('A')
# Day  1..31 shown as A.._  27=[ 28=\ 29=] 30=^ 31=_  convert to number:
ordinal (char) - ordinal ('A') + 1
# Original data source: Wikimedia full (=unsampled) squid logs
# These data have been aggregated from hourly pagecount files at, originally produced by Domas Mituzas
# Daily and monthly aggregator script built by Erik Zachte
# Each day hourly files for previous day are downloaded and merged into one
file per day # Each month daily files are merged into one file per month
# This file contains only lines with monthly page request total
greater/equal 5
# Data for all hours of each day were available in input
aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1
aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1
aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1
aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[C1
aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1
aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3
aa.b File_talk:Incubator-notext.svg 60
aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2

Wikitech-l mailing list
[hidden email]