About Statistical Data from Query of Multiple Duplicates

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

About Statistical Data from Query of Multiple Duplicates

Marco Antonio
Hi folks.

This is my first time using this mail list, so if this is not the right place to ask this kind of question please lemme know about how I should proceed in this case. 


Question
I have basically downloaded from MediaWiki API a lot of pages related to mathematics. Some of them are just duplicated of the same Article, but with one difference being their title, such as different way os calling the same subject, or letter that differs from one and another, ao so on and so forth. 

One example that I can show you right away is: 
  • "Adição_de_segmentos", and
  • "Adição_de_Segmentos", 
both written in portuguese (my native language). The only difference between the titles are just the lowercase and uppercase of the letter "s".As I was testing on the URL's, it seems that they both are the same article, but redirecting from different links to the official "title".

Keeping in mind those kind of duplicates, when I've started to analyse the statistics of views on a specific article, while going through its cases, I was expecting to receive the following structure of data:

  • The old ones (deprecated) would hold views until some day X, and then it would have nothing to further count and show;
  • The up-to-date titles would have data starting from day X and then would hold until the last day that I want to analyse.

Nothing too crazy to expect from the database. But that was not what happened. There are plenty of articles that are still receiving views even though they all redirect to another article. At first, I've just thought that people are getting to the articles's content with different links available on search engines, such as google, so all views must be independent from one another. The problem is, after testing on the google platform different search for the same Wikipedia's article I can only get the up-to-date articles, not the old ones.

  1. How can this be possible? 
  2. But more important for me, are all acesses on the deprecated articles made by bots or old links available on old pages from other sites? 
  3. Are the count on all different article's title independent? 
  4. If so, how could I be able to even track all the possible acesses on a particular subject to create an effective study o it?

Anyway, this is (if I remember well) the fourth time I'm trying to get a proper answer for my question, and I'm hopping I'll get it soon.

Thanks!


Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico





_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: About Statistical Data from Query of Multiple Duplicates

Brian Wolff
There are a variety of reasons why someone might view a redirected title:
* following a link still using the old title. (Either internally or externally)
* typing the old name exactly in the search bar
* typing old name in address bar

--
Brian

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
Hi folks.

This is my first time using this mail list, so if this is not the right place to ask this kind of question please lemme know about how I should proceed in this case. 


Question
I have basically downloaded from MediaWiki API a lot of pages related to mathematics. Some of them are just duplicated of the same Article, but with one difference being their title, such as different way os calling the same subject, or letter that differs from one and another, ao so on and so forth. 

One example that I can show you right away is: 
  • "Adição_de_segmentos", and
  • "Adição_de_Segmentos", 
both written in portuguese (my native language). The only difference between the titles are just the lowercase and uppercase of the letter "s".As I was testing on the URL's, it seems that they both are the same article, but redirecting from different links to the official "title".

Keeping in mind those kind of duplicates, when I've started to analyse the statistics of views on a specific article, while going through its cases, I was expecting to receive the following structure of data:

  • The old ones (deprecated) would hold views until some day X, and then it would have nothing to further count and show;
  • The up-to-date titles would have data starting from day X and then would hold until the last day that I want to analyse.

Nothing too crazy to expect from the database. But that was not what happened. There are plenty of articles that are still receiving views even though they all redirect to another article. At first, I've just thought that people are getting to the articles's content with different links available on search engines, such as google, so all views must be independent from one another. The problem is, after testing on the google platform different search for the same Wikipedia's article I can only get the up-to-date articles, not the old ones.

  1. How can this be possible? 
  2. But more important for me, are all acesses on the deprecated articles made by bots or old links available on old pages from other sites? 
  3. Are the count on all different article's title independent? 
  4. If so, how could I be able to even track all the possible acesses on a particular subject to create an effective study o it?

Anyway, this is (if I remember well) the fourth time I'm trying to get a proper answer for my question, and I'm hopping I'll get it soon.

Thanks!


Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico





_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: About Statistical Data from Query of Multiple Duplicates

Marco Antonio
This kind of answers me the first question and the third one, but the second and the fourth are still an issue for me.




Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico






On Fri, Feb 7, 2020 at 12:04 AM Brian Wolff <[hidden email]> wrote:
There are a variety of reasons why someone might view a redirected title:
* following a link still using the old title. (Either internally or externally)
* typing the old name exactly in the search bar
* typing old name in address bar

--
Brian

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
Hi folks.

This is my first time using this mail list, so if this is not the right place to ask this kind of question please lemme know about how I should proceed in this case. 


Question
I have basically downloaded from MediaWiki API a lot of pages related to mathematics. Some of them are just duplicated of the same Article, but with one difference being their title, such as different way os calling the same subject, or letter that differs from one and another, ao so on and so forth. 

One example that I can show you right away is: 
  • "Adição_de_segmentos", and
  • "Adição_de_Segmentos", 
both written in portuguese (my native language). The only difference between the titles are just the lowercase and uppercase of the letter "s".As I was testing on the URL's, it seems that they both are the same article, but redirecting from different links to the official "title".

Keeping in mind those kind of duplicates, when I've started to analyse the statistics of views on a specific article, while going through its cases, I was expecting to receive the following structure of data:

  • The old ones (deprecated) would hold views until some day X, and then it would have nothing to further count and show;
  • The up-to-date titles would have data starting from day X and then would hold until the last day that I want to analyse.

Nothing too crazy to expect from the database. But that was not what happened. There are plenty of articles that are still receiving views even though they all redirect to another article. At first, I've just thought that people are getting to the articles's content with different links available on search engines, such as google, so all views must be independent from one another. The problem is, after testing on the google platform different search for the same Wikipedia's article I can only get the up-to-date articles, not the old ones.

  1. How can this be possible? 
  2. But more important for me, are all acesses on the deprecated articles made by bots or old links available on old pages from other sites? 
  3. Are the count on all different article's title independent? 
  4. If so, how could I be able to even track all the possible acesses on a particular subject to create an effective study o it?

Anyway, this is (if I remember well) the fourth time I'm trying to get a proper answer for my question, and I'm hopping I'll get it soon.

Thanks!


Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico




_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: About Statistical Data from Query of Multiple Duplicates

Brian Wolff
2. No
4. You would have to figure out all the redirects and sum them. The api allows you to fetch the list of redirects. Another option is the redirect table available from the database dumps at download.wikimedia.org


https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Redirects might be helpful (its a bit old, i assume its still accurate)
--
Bawolff

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
This kind of answers me the first question and the third one, but the second and the fourth are still an issue for me.




Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico






On Fri, Feb 7, 2020 at 12:04 AM Brian Wolff <[hidden email]> wrote:
There are a variety of reasons why someone might view a redirected title:
* following a link still using the old title. (Either internally or externally)
* typing the old name exactly in the search bar
* typing old name in address bar

--
Brian

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
Hi folks.

This is my first time using this mail list, so if this is not the right place to ask this kind of question please lemme know about how I should proceed in this case. 


Question
I have basically downloaded from MediaWiki API a lot of pages related to mathematics. Some of them are just duplicated of the same Article, but with one difference being their title, such as different way os calling the same subject, or letter that differs from one and another, ao so on and so forth. 

One example that I can show you right away is: 
  • "Adição_de_segmentos", and
  • "Adição_de_Segmentos", 
both written in portuguese (my native language). The only difference between the titles are just the lowercase and uppercase of the letter "s".As I was testing on the URL's, it seems that they both are the same article, but redirecting from different links to the official "title".

Keeping in mind those kind of duplicates, when I've started to analyse the statistics of views on a specific article, while going through its cases, I was expecting to receive the following structure of data:

  • The old ones (deprecated) would hold views until some day X, and then it would have nothing to further count and show;
  • The up-to-date titles would have data starting from day X and then would hold until the last day that I want to analyse.

Nothing too crazy to expect from the database. But that was not what happened. There are plenty of articles that are still receiving views even though they all redirect to another article. At first, I've just thought that people are getting to the articles's content with different links available on search engines, such as google, so all views must be independent from one another. The problem is, after testing on the google platform different search for the same Wikipedia's article I can only get the up-to-date articles, not the old ones.

  1. How can this be possible? 
  2. But more important for me, are all acesses on the deprecated articles made by bots or old links available on old pages from other sites? 
  3. Are the count on all different article's title independent? 
  4. If so, how could I be able to even track all the possible acesses on a particular subject to create an effective study o it?

Anyway, this is (if I remember well) the fourth time I'm trying to get a proper answer for my question, and I'm hopping I'll get it soon.

Thanks!


Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico




_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: About Statistical Data from Query of Multiple Duplicates

Marco Antonio
I see.
Thanks for you answer and suggestions.

I think that it could be beneficial (in terms of structure in the data set) and more reasonable to create a specific count for content besides the current article titles. That could be useful, for example, to outsiders as us to measure how people are reaching Wikipedia articles, either by google search or by using old links, with a margin of error of course.

Anyway, thanks again!




Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico






On Fri, Feb 7, 2020 at 4:43 AM Brian Wolff <[hidden email]> wrote:
2. No
4. You would have to figure out all the redirects and sum them. The api allows you to fetch the list of redirects. Another option is the redirect table available from the database dumps at download.wikimedia.org


https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Redirects might be helpful (its a bit old, i assume its still accurate)
--
Bawolff

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
This kind of answers me the first question and the third one, but the second and the fourth are still an issue for me.




Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico






On Fri, Feb 7, 2020 at 12:04 AM Brian Wolff <[hidden email]> wrote:
There are a variety of reasons why someone might view a redirected title:
* following a link still using the old title. (Either internally or externally)
* typing the old name exactly in the search bar
* typing old name in address bar

--
Brian

On Thursday, February 6, 2020, Marco Antonio <[hidden email]> wrote:
Hi folks.

This is my first time using this mail list, so if this is not the right place to ask this kind of question please lemme know about how I should proceed in this case. 


Question
I have basically downloaded from MediaWiki API a lot of pages related to mathematics. Some of them are just duplicated of the same Article, but with one difference being their title, such as different way os calling the same subject, or letter that differs from one and another, ao so on and so forth. 

One example that I can show you right away is: 
  • "Adição_de_segmentos", and
  • "Adição_de_Segmentos", 
both written in portuguese (my native language). The only difference between the titles are just the lowercase and uppercase of the letter "s".As I was testing on the URL's, it seems that they both are the same article, but redirecting from different links to the official "title".

Keeping in mind those kind of duplicates, when I've started to analyse the statistics of views on a specific article, while going through its cases, I was expecting to receive the following structure of data:

  • The old ones (deprecated) would hold views until some day X, and then it would have nothing to further count and show;
  • The up-to-date titles would have data starting from day X and then would hold until the last day that I want to analyse.

Nothing too crazy to expect from the database. But that was not what happened. There are plenty of articles that are still receiving views even though they all redirect to another article. At first, I've just thought that people are getting to the articles's content with different links available on search engines, such as google, so all views must be independent from one another. The problem is, after testing on the google platform different search for the same Wikipedia's article I can only get the up-to-date articles, not the old ones.

  1. How can this be possible? 
  2. But more important for me, are all acesses on the deprecated articles made by bots or old links available on old pages from other sites? 
  3. Are the count on all different article's title independent? 
  4. If so, how could I be able to even track all the possible acesses on a particular subject to create an effective study o it?

Anyway, this is (if I remember well) the fourth time I'm trying to get a proper answer for my question, and I'm hopping I'll get it soon.

Thanks!


Marco Antonio


Graduando em Matemática Pura na USP | Divulgador Científico




_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api