need a URL encoding expert

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

need a URL encoding expert

Ryan Kaldari-2
Could someone knowledgeable about URL encoding take a look at this pull
request? Thanks!
https://github.com/wikimedia/DeadlinkChecker/pull/26/files
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: need a URL encoding expert

bawolff
HTML4 reccomended people use ; instead of & to separate url
parameters, to avoid conflicts with entity references. However, afaik
most web servers don't support this (I think its mostly some java
things that do). See
https://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2
Modern HTML5 abandoned this reccomendation afaik.

--
Brian

On Fri, Oct 20, 2017 at 8:07 PM, Ryan Kaldari <[hidden email]> wrote:
> Could someone knowledgeable about URL encoding take a look at this pull
> request? Thanks!
> https://github.com/wikimedia/DeadlinkChecker/pull/26/files
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: need a URL encoding expert

Ryan Kaldari-2
The main issue I'm not sure about here is the use of ; as a query string
initiator (rather than a query string parameter separator). This use of
semicolons is completely non-standard, AFAIK, but it looks like there are
some web servers that are actually using it this way. Comments at the pull
request itself would be most useful (rather than by email).

On Fri, Oct 20, 2017 at 1:17 PM, bawolff <[hidden email]> wrote:

> HTML4 reccomended people use ; instead of & to separate url
> parameters, to avoid conflicts with entity references. However, afaik
> most web servers don't support this (I think its mostly some java
> things that do). See
> https://www.w3.org/TR/1999/REC-html401-19991224/appendix/
> notes.html#h-B.2.2
> Modern HTML5 abandoned this reccomendation afaik.
>
> --
> Brian
>
> On Fri, Oct 20, 2017 at 8:07 PM, Ryan Kaldari <[hidden email]>
> wrote:
> > Could someone knowledgeable about URL encoding take a look at this pull
> > request? Thanks!
> > https://github.com/wikimedia/DeadlinkChecker/pull/26/files
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: need a URL encoding expert

Stas Malyshev
Hi!

> The main issue I'm not sure about here is the use of ; as a query string
> initiator (rather than a query string parameter separator). This use of
> semicolons is completely non-standard, AFAIK, but it looks like there are
> some web servers that are actually using it this way. Comments at the pull
> request itself would be most useful (rather than by email).

Generally, the servers are free to parse the local part of the URL as
they like. After all, many servers using REST treat something like
/user/2/name as essentially query string, even though / is a path
separator. Nothing prevents other servers from adopting the scheme of
user;2;name instead or any other way of parsing the local path.

https://tools.ietf.org/html/rfc3986#section-3.4 clearly states that
query is delimited by "?". Which means the URLs with ";" are path
components, as per RFC:

   Aside from dot-segments in hierarchical paths, a path segment is
   considered opaque by the generic syntax.  URI producing applications
   often use the reserved characters allowed in a segment to delimit
   scheme-specific or dereference-handler-specific subcomponents.  For
   example, the semicolon (";") and equals ("=") reserved characters are
   often used to delimit parameters and parameter values applicable to
   that segment.  The comma (",") reserved character is often used for
   similar purposes.  For example, one URI producer might use a segment
   such as "name;v=1.1" to indicate a reference to version 1.1 of
   "name", whereas another might use a segment such as "name,1.1" to
   indicate the same.

So, the specific application can treat path components the same way as
query components, but they are still path components. My reading of the
RFC also seems to be that ";" is a reserved character, and as such
should not be URL-encoded. Indeed, path BNF includes sub-delims without
encoding, which includes ";". However, I am not sure I understand other
part of the patch where it plays with query string.

--
Stas Malyshev
[hidden email]

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: need a URL encoding expert

Gergo Tisza
In reply to this post by Ryan Kaldari-2
It would be easier to comment if either the pull request or this thread
explained what the patch is trying to do (XY problem etc) but in general:

- the URI spec defines ? as the separator character between the path and
the query part, but it doesn't say much about what the path and query part
*are* (except that the path specifies a resource in a hierarchical manner,
and the query specifies it in a non-hierarchical manner). So a web
application using something other for separating the query part will
violate the spirit of the spec but probably won't experience any problems.
Haven't heard of anything doing that before; this display.w3p thing seems
like some super-obscure web framework only used by Australian and
Singaporean government web pages.

- the URI spec does not say anything about the contents of the query part.
It specifies ;/!?:@&=+*$,'[]() as the set of reserved characters, so those
are the only sane choices for separating sub-arguments (as everything else
might get percent-encoded by the browser, but reserved characters are
guaranteed to be preserved). The choice of & and = as argument separator
and key-value separator are a common convention, and used by some standards
such as x-www-form-urlencoded, but a web application is free to choose
something else in theory. In practice I think only very old and fringe ones
do.

- the URI spec allows parameters in path segments (sometimes called matrix
parameters). https://www.w3.org/DesignIssues/MatrixURIs.html has some
examples. The older URI RFC, 2396, prescribed the semicolon as the
parameter separator; RFC 3986 allows any reserved character; but in
practice usually it's a semicolon. These are used a fair bit in RESTish
URLs; Angular uses them, for example. When only the last path segment has
parameters, the URL has the same structure as the one in the pull request.



On Fri, Oct 20, 2017 at 1:07 PM, Ryan Kaldari <[hidden email]>
wrote:

> Could someone knowledgeable about URL encoding take a look at this pull
> request? Thanks!
> https://github.com/wikimedia/DeadlinkChecker/pull/26/files
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l