Re: Tracking external link statistics over time

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Lars Aronsson
On 11/25/2011 10:20 PM, Lars Aronsson wrote:
> This query compiles a list of external links (from the main namespace)
> grouped by website:
>
> select count(*), trim(leading 'www.' from substring_index(substring_index(el_to, '/', 3),'/',-1)) as site
>    from externallinks, page
>    where el_from = page_id and page_namespace=0
>    group by 2  having count(*)>   10 order by 1;

In the last few days, I've learned many things, such as
1) using the rrdb instead of userdb servers,
2) using SGE to submit jobs that require a particular database cluster,
3) getting a list of databases in a cluster, looping over them,
4) including the File, Author, Index, and Portal namespaces.

This is going great. Looping over all of the 516 databases in
the s3 cluster takes 17 minutes, or an average of 2 seconds
per query. This is very reasonable. The 17 databases in s2
take 47 minutes or 151 seconds per query, which is okay.

But enwiki is too large to execute this query within the allowed
30 minutes. Even a simple "select count(*) from externallinks;"
timed out on me. For Wikimedia Commons, the query times
out when I try to include external links found in the File (6)
namespace, but if I only count the article (0) namespace
(where there are very few external links), it produces a fine
result.

Does the toolserver give me any other options to get statistics
on the enwiki database? I could load page.sql and
externallinks.sql into my own MySQL instance, now that
these dumps are produced on a monthly schedule. Is that
my best option? Can I set up my own MySQL for this on the
toolserver? Or should I move the project to my own server?

So far, I output the result to text files. Later I might want
to store it in a database, but this is no hurry.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Jan Luca
Have you read the part in the wiki about "SLOW_OK" [1]?

[1]
https://wiki.toolserver.org/view/Databases#Slow_queries_and_the_query_killer


Best regards,
Jan

-----Urspr√ľngliche Nachricht-----
Von: [hidden email]
[mailto:[hidden email]] Im Auftrag von Lars
Aronsson
Gesendet: Dienstag, 29. November 2011 18:49
An: [hidden email]
Betreff: Re: [Toolserver-l] Tracking external link statistics over time

On 11/25/2011 10:20 PM, Lars Aronsson wrote:
> This query compiles a list of external links (from the main namespace)
> grouped by website:
>
> select count(*), trim(leading 'www.' from
substring_index(substring_index(el_to, '/', 3),'/',-1)) as site
>    from externallinks, page
>    where el_from = page_id and page_namespace=0
>    group by 2  having count(*)>   10 order by 1;

In the last few days, I've learned many things, such as
1) using the rrdb instead of userdb servers,
2) using SGE to submit jobs that require a particular database cluster,
3) getting a list of databases in a cluster, looping over them,
4) including the File, Author, Index, and Portal namespaces.

This is going great. Looping over all of the 516 databases in the s3 cluster
takes 17 minutes, or an average of 2 seconds per query. This is very
reasonable. The 17 databases in s2 take 47 minutes or 151 seconds per query,
which is okay.

But enwiki is too large to execute this query within the allowed 30 minutes.
Even a simple "select count(*) from externallinks;"
timed out on me. For Wikimedia Commons, the query times out when I try to
include external links found in the File (6) namespace, but if I only count
the article (0) namespace (where there are very few external links), it
produces a fine result.

Does the toolserver give me any other options to get statistics on the
enwiki database? I could load page.sql and externallinks.sql into my own
MySQL instance, now that these dumps are produced on a monthly schedule. Is
that my best option? Can I set up my own MySQL for this on the toolserver?
Or should I move the project to my own server?

So far, I output the result to text files. Later I might want to store it in
a database, but this is no hurry.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette


_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Platonides
In reply to this post by Lars Aronsson
On 29/11/11 18:49, Lars Aronsson wrote:
> In the last few days, I've learned many things, such as
> 1) using the rrdb instead of userdb servers,
If you were to store the data in the server, you would need a userdb server.

(...)
>
> Does the toolserver give me any other options to get statistics
> on the enwiki database?

You could load it in ranges, and then merge all of the results.


> I could load page.sql and
> externallinks.sql into my own MySQL instance, now that
> these dumps are produced on a monthly schedule. Is that
> my best option? Can I set up my own MySQL for this on the
> toolserver? Or should I move the project to my own server?

That's the wrong solution. It might be faster to process the sql file
with string tools (unlikely, having to match pageids with namespaces)
but using a user level mysqld for that is completely overkill.

_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Lars Aronsson
On 11/29/2011 11:18 PM, Platonides wrote:
> It might be faster to process the sql file
> with string tools (unlikely, having to match pageids with namespaces)
> but using a user level mysqld for that is completely overkill.

I succeeded to do this in 120 lines of Perl. First I parse page.sql,
where page_id and page_namespace are the first two columns, setting
up a bitmap to indicate which page_id are in the relevant
namespaces (article, File, Author, Creator, Index, Museum, Portal).

Then I parse externallinks.sql, where el_from (= page_id) and el_to
(= the URL) are the first two columns. I check if el_from is in my
bitmap and cut out the domain name from the URL, counting its
occurrences in a hash. Finally, I sort the hash keys to get the
output in alphabetic order, and print only those domains that have
10 links or more (the long tail of less popular domains are of no
interest to me).

For both commonswiki and enwiki, this runs in 12 minutes and the
RAM footprint on wolfsbane stays under 250 MB. It takes far longer
to download the database dumps to the toolserver. It would make
sense to run this on dumps.wikimedia.org, as part of the dump
process.

For those wikis I have tested, the output looks very similar to
what I got from looking at the replicated database, except that
all external links to WMF sites seem to have been removed from
the SQL dumps.

Based on the database dump of frwiki-20111123, I got:

  350570 toolserver.org
   90505 culture.gouv.fr
   52081 legifrance.gouv.fr
   51837 imdb.fr
   50189 akas.imdb.com
   46754 books.google.fr
   38654 ncbi.nlm.nih.gov
   38184 recensement.insee.fr
   36028 catalogueoflife.org
   35382 insee.fr

Based on the live, replicated frwiki_p database:

  352260 toolserver.org
  101619 commons.wikimedia.org
   90281 culture.gouv.fr
   82110 en.wikipedia.org
   52161 legifrance.gouv.fr
   52026 imdb.fr
   50379 akas.imdb.com
   46860 books.google.fr
   38715 ncbi.nlm.nih.gov
   38197 recensement.insee.fr


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Lars Aronsson
In reply to this post by Jan Luca
On 11/29/2011 10:11 PM, Jan Luca wrote:
> Have you read the part in the wiki about "SLOW_OK" [1]?
>
> [1]
> https://wiki.toolserver.org/view/Databases#Slow_queries_and_the_query_killer

Yes, I have. But I would still risk to be shut down when replag
is high. That's a factor I have no control over, and I don't
think I could build a reliable service on that. Now I have a
solution to parse the SQL dumps, that I can use for the largest
databases (commons, enwiki) and for historic dumps, and the live
database queries work fine for all other databases.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Platonides
In reply to this post by Lars Aronsson
TL;DR: Look for protocol relative links.

> For those wikis I have tested, the output looks very similar to
> what I got from looking at the replicated database, except that
> all external links to WMF sites seem to have been removed from
> the SQL dumps.


That's interesting. There are such links in the dumps:

gzip -d <
/mnt/user-store/dumps/frwiki-group-writable/frwiki-20111123-externallinks.sql.gz
| grep -oP ".{17}commons.wikimedia.org[^']*" | more

> (572952,'http://commons.wikimedia.org/wiki/Image:Georges_Seurat_043.jpg
> (572952,'http://commons.wikimedia.org/wiki/Image:Tour_Eiffel_exposition_universelle_1889.jpg
> (603411,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:Nokia_N-Gage.jpg
> (150418,'http://commons.wikimedia.org/wiki/Category:The_Yorck_Project
> (564205,'http://commons.wikimedia.org/wiki/Category:Icons
> (446364,'http://commons.wikimedia.org/wiki/Image:Viaduc_Millau_France.PNG
> (446364,'http://commons.wikimedia.org/wiki/Image:La2-demis-france.png
> (339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalities_in_the_canton_of_Vaud
> (339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalities_in_the_canton_of_Geneva
> (339872,'http://commons.wikimedia.org/wiki/Category:Maps_of_municipalities_in_the_canton_of_Geneva
> (339872,'http://commons.wikimedia.org/wiki/Accueil
> (518924,'http://commons.wikimedia.org/wiki/Special:Search?search=tintagel
> (488187,'http://commons.wikimedia.org/skins-1.5/common/metadata.js
> (342102,'http://commons.wikimedia.org/wiki/Category:Photography
> (547524,'http://commons.wikimedia.org/wiki/Special:Contributions/Pereubu
> (601588,'http://commons.wikimedia.org/wiki/Main_Page
> (601588,'http://commons.wikimedia.org/w/index.php?title=Image:Uz%C3%A8s_7.jpg&action=edit
> (366318,'http://commons.wikimedia.org/wiki/Category:Chinese_stroke_order
> (208246,'http://commons.wikimedia.org/wiki/Category:SVG_flags
> (208246,'http://commons.wikimedia.org/wiki/Category:Rolleiflex
> (350993,'http://commons.wikimedia.org/wiki/Paris
> (232033,'http://commons.wikimedia.org/wiki/Category:Bolivia
> (488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:IMac_G3_slot_loading.jpg
> (488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:Nokia_N-Gage.jpg
> (209441,'http://commons.wikimedia.org/wiki/Accueil
etc.

Although few of them seem to come from NS_MAIN:

sql frwiki_p "select page_namespace from page where page_id IN (0$(gzip
-d < frwiki-20111123-externallinks.sql.gz | grep --line-buffered -oP
"([0-9]+)(?=,'http://commons.wikimedia.org[^']*)" | uniq | head -500 |
sed s/^/,/g ))"

The trick seems to be that most of them they will be present through
//commons.wikimedia.org, not http://commons.wikimedia.org

Talk pages will have more url-copying, while articles will use templates
like {{Autres projets}}

> sql frwiki_p "select page_namespace from page where page_id IN (0$(gzip -d < frwiki-20111123-externallinks.sql.gz | grep --line-buffered -oP "([0-9]+)(?=,'//commons.wikimedia.org[^']*)" | uniq | head -500 | sed s/^/,/g ))"
does show many NS_MAIN entries.

I suspect you are only taking into accoutn http: and https: links with
your perl script, while all of them with your sql query.

> select count(*) from externallinks join page on (el_from = page_id) where el_to like '//commons.wikimedia.org/%' and page_namespace = 0;
> +----------+
> | count(*) |
> +----------+
> |   101439 |
> +----------+


> select count(*) from externallinks join page on (el_from = page_id) where el_to like '<a href="http://commons.wikimedia.org/%'">http://commons.wikimedia.org/%' and page_namespace = 0;
> +----------+
> | count(*) |
> +----------+
> |      399 |
> +----------+

101439 + 399 = 101838 ~= 101619 which is your db result.


_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Reply | Threaded
Open this post in threaded view
|

Re: Tracking external link statistics over time

Lars Aronsson
On 11/30/2011 06:13 PM, Platonides wrote:
> I suspect you are only taking into accoutn http: and https: links with
> your perl script, while all of them with your sql query.

Yes, the protocol-relative links ("//commons") was the entire
explanation. Thanks! I do count ftp:, http:, https:, and news:
and now also the naked slashes ("//").

It seems fr.wikipedia is more extreme in this use.

My code is ~la2/dumplinks.pl for those who are logged in.


--
   Lars Aronsson ([hidden email])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([hidden email])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette