[RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

MZMcBride-2
Gabriel Wicke wrote:
>A heavily-used content API will perform better and use less resources
>when it is cacheable. This will become more important over time, so I
>believe it is worth spending a small amount of effort on now.

Sure, I think everyone agrees that a heavily used Web resource will
perform better with caching. I'm just not sure futzing around with path
names is the best way to try to ensure sustainable cacheability.

Is there a breakdown of what in a typical MediaWiki API request takes the
most time or uses the most resources (i.e., profiling a local request)? I
imagine there are multiple caching opportunities at other layers that
don't rely on path name, but it's difficult to say where you might see the
most gains without further data.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Tim Starling-2
In reply to this post by Jon Robson
On 17/09/13 13:59, Jon Robson wrote:
> I would suggest taking a look at the number of 404s caused by people trying
> to access pages without the wiki prefix.... This would be interesting data
> to go alongside this interesting proposal...

There are lots of different sorts of 404s, so it's necessary to do
some filtering. For example:

* double-slashes, due to bug 52253
* sitemap.xml
* Apple touch icons
* bullet.gif in various directories
* vulnerability scanning, e.g. xmlrpc.php
* BlueCoat verify/notify, as described in
<http://www.webmasterworld.com/search_engine_spiders/3859463.htm>
* Serial numbers like http://en.wikipedia.org/B008NAYASM .

I filtered out everything with a dot or slash in the prospective
article title, as well as the BlueCoat URLs and the UAs responsible
for serial number URLs. To simplify analysis, I took log lines from
the English Wikipedia only.

Most of the remaining log entries were search engine crawlers, so I
took those out too.

The result was 149 log entries at a 1/1000 sample rate, for the week
of September 8-14, implying a request rate of about 639,000 per month.
This is about 0.006% of the English Wikipedia's page view rate.

The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

C. Scott Ananian
Note also that zhwiki (and others?) profitably uses the first part of the
path to do variant selection.

https://zh.wikipedia.org/wiki/User:Cscott uses the wiki default variant (if
logged in, uses the variant from the user's preferences)
https://zh.wikipedia.org/zh-hans/User:Cscott
https://zh.wikipedia.org/zh-hk/User:Cscott
etc use the specified variant.

I have a dream to eventually enable
https://en.wikipedia.org/en-gb/Football
in a similar fashion.
  --scott
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Jon Robson
In reply to this post by Tim Starling-2
Thanks Tim for running those data. That seems to suggest the URL
structure works for the most case.

On Wed, Sep 18, 2013 at 12:07 AM, Tim Starling <[hidden email]> wrote:

> On 17/09/13 13:59, Jon Robson wrote:
>> I would suggest taking a look at the number of 404s caused by people trying
>> to access pages without the wiki prefix.... This would be interesting data
>> to go alongside this interesting proposal...
>
> There are lots of different sorts of 404s, so it's necessary to do
> some filtering. For example:
>
> * double-slashes, due to bug 52253
> * sitemap.xml
> * Apple touch icons
> * bullet.gif in various directories
> * vulnerability scanning, e.g. xmlrpc.php
> * BlueCoat verify/notify, as described in
> <http://www.webmasterworld.com/search_engine_spiders/3859463.htm>
> * Serial numbers like http://en.wikipedia.org/B008NAYASM .
>
> I filtered out everything with a dot or slash in the prospective
> article title, as well as the BlueCoat URLs and the UAs responsible
> for serial number URLs. To simplify analysis, I took log lines from
> the English Wikipedia only.
>
> Most of the remaining log entries were search engine crawlers, so I
> took those out too.
>
> The result was 149 log entries at a 1/1000 sample rate, for the week
> of September 8-14, implying a request rate of about 639,000 per month.
> This is about 0.006% of the English Wikipedia's page view rate.
>
> The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html
>
> -- Tim Starling
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Jon Robson
http://jonrobson.me.uk
@rakugojon

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Matthew Flaschen-2
In reply to this post by Daniel Friesen-2
On 09/17/2013 05:59 AM, Daniel Friesen wrote:
> Side topic https://en.wiktionary.org/w/r/t is messed up: " To check for
> "r/t" on Wikipedia, see: //en.wikipedia.org/wiki/r/t
> <https://en.wikipedia.org/wiki/r/t>"

Good catch, filed: https://bugzilla.wikimedia.org/show_bug.cgi?id=54357

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Tim Starling-2
In reply to this post by Jon Robson
On 20/09/13 03:04, Jon Robson wrote:
> Thanks Tim for running those data. That seems to suggest the URL
> structure works for the most case.

I think the request rate for actual articles in the root is very, very
low. And if you look at the paste I gave earlier:

http://paste.tstarling.com/p/uhtFqg.html

there's reason to think that the amount of traffic that comes from
naive readers typing URLs and expecting an article is much smaller
than even 149k per week. A naive user would be more likely to type a
URL starting with a lower-case letter, and if you take those entries,
and filter out the obvious client bugs and typos, that leaves only 39
log entries. If we filter out some more log entries that are unlikely
search terms for Wikipedia articles ("enregistrement-audio-musique",
"is", "unlimited_data_plan", etc.), that leaves maybe 30.
http://paste.tstarling.com/p/KWuHif.html

Of these, only 12 actually correspond to Wikipedia articles or redirects:

abolition
addicting_games
apple_inc
carnaval
dreamshade
facade
girls
insidious
karthik
online_coupons
snam
walkabout

So the number of naive readers actually helped by our 404 Refresh to
/wiki/ is probably closer to 12k per week than 149k per week.

Personally, I think the refresh is annoying, since it makes it much
more difficult to correct typos in manually-typed URLs. If you
actually meant to type some non-article URL like a CSS resource, and
make a typo which causes it to hit the refresh, the URL you typed is
erased from your browser's address bar and history, making correction
of the typo much more difficult. Maybe we should just include a link
to the search page, rather than redirect or refresh.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

MZMcBride-2
Tim Starling wrote:
>Personally, I think the refresh is annoying, since it makes it much
>more difficult to correct typos in manually-typed URLs. If you
>actually meant to type some non-article URL like a CSS resource, and
>make a typo which causes it to hit the refresh, the URL you typed is
>erased from your browser's address bar and history, making correction
>of the typo much more difficult. Maybe we should just include a link
>to the search page, rather than redirect or refresh.

Mark Ryan redesigned the 404 page in 2009 and specifically removed the
meta refresh tag (cf. <https://bugs.wikimedia.org/17316#c0>).

The redesigned page eventually got deployed, but the client-side refresh
very sneakily moved from the HTML output to a Refresh header (cf.
<https://bugs.wikimedia.org/35052#c0>).

Neither bug is resolved, if anyone is interested in helping out. :-)

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /w/index.php?title=..

Gabriel Wicke-3
In reply to this post by Jon Robson
On 09/19/2013 10:04 AM, Jon Robson wrote:
> Thanks Tim for running those data. That seems to suggest the URL
> structure works for the most case.

It certainly confirms that search engines link to working links, and
users typing URLs manually are rare and (eventually) learn to prefix
/wiki/. I am not that convinced that the current number of 404s says
that much about the user-friendliness or aesthetics of different URL
schemes, but that is besides the point (and subjective).

I see /w/index.php?title=.. as the more important clean-up, which is why
the RFC is only about that aspect.

Gabriel

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Jon Robson
In reply to this post by Tim Starling-2
On 19 Sep 2013 18:23, "Tim Starling" <[hidden email]> wrote:
>
> On 20/09/13 03:04, Jon Robson wrote:
> > Thanks Tim for running those data. That seems to suggest the URL
> > structure works for the most case.
>
> I think the request rate for actual articles in the root is very, very
> low.

I agree.. Sorry I guess my message wasn't so clear. I meant "existing" URL
structure :)

And if you look at the paste I gave earlier:

>
> http://paste.tstarling.com/p/uhtFqg.html
>
> there's reason to think that the amount of traffic that comes from
> naive readers typing URLs and expecting an article is much smaller
> than even 149k per week. A naive user would be more likely to type a
> URL starting with a lower-case letter, and if you take those entries,
> and filter out the obvious client bugs and typos, that leaves only 39
> log entries. If we filter out some more log entries that are unlikely
> search terms for Wikipedia articles ("enregistrement-audio-musique",
> "is", "unlimited_data_plan", etc.), that leaves maybe 30.
> http://paste.tstarling.com/p/KWuHif.html
>
> Of these, only 12 actually correspond to Wikipedia articles or redirects:
>
> abolition
> addicting_games
> apple_inc
> carnaval
> dreamshade
> facade
> girls
> insidious
> karthik
> online_coupons
> snam
> walkabout
>
> So the number of naive readers actually helped by our 404 Refresh to
> /wiki/ is probably closer to 12k per week than 149k per week.
>
> Personally, I think the refresh is annoying, since it makes it much
> more difficult to correct typos in manually-typed URLs. If you
> actually meant to type some non-article URL like a CSS resource, and
> make a typo which causes it to hit the refresh, the URL you typed is
> erased from your browser's address bar and history, making correction
> of the typo much more difficult. Maybe we should just include a link
> to the search page, rather than redirect or refresh.
>
> -- Tim Starling
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Gryllida
In reply to this post by Tyler Romeo
On Tue, 17 Sep 2013, at 7:51, Tyler Romeo wrote:
> On Mon, Sep 16, 2013 at 6:12 PM, Gabriel Wicke <[hidden email]> wrote:
>
> > * use simple action urls
> >   https://en.wikipedia.org/Foo?action=history instead of
> >   https://en.wikipedia.org/w/index.php?title=Foo&action=history
> >
>
> This already works.
>

I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

Matthew Flaschen-2
On 09/27/2013 06:03 AM, Gryllida wrote:
> I would be concerned about proper work of this feature in wikilinks. [[Main Page?action=history|Foo]] makes a red broken link.

So does:

[[/w/index.php?title=Main Page|Foo]]

Neither would be expected to work.  Anything to the left of the pipe in
your example is considered a page title.  I don't think anything about
wikilink parsing (or any parsing) is proposed to change.

Matt Flaschen



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
123