garbage characters show up when fetching wikimedia api

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

garbage characters show up when fetching wikimedia api

Trung Dinh
Hi all,
I have an issue why trying to parse data fetched from wikipedia api.
This is the piece of code that I am using:
api_url = 'http://en.wikipedia.org/w/api.php<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'
api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'

f = urllib2.Request(api_url, api_params)
print ('requesting ' + api_url + '?' + api_params)
source = urllib2.urlopen(f, None, 300).read()
source = json.loads(source)

json.loads(source) raised the following exception " Expecting , delimiter: line 1 column 817105 (char 817104"

I tried to use source.encode('utf-8') and some other encodings but they all didn't help.
Do we have any workaround for that issue ? Thanks :)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

MZMcBride-2
Trung Dinh wrote:

>Hi all,
>I have an issue why trying to parse data fetched from wikipedia api.
>This is the piece of code that I am using:
>api_url = 'http://en.wikipedia.org/w/api.php'
>api_params =
>'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rc
>dir=newer&format=json&rcstart=20160504022715'
>
>f = urllib2.Request(api_url, api_params)
>print ('requesting ' + api_url + '?' + api_params)
>source = urllib2.urlopen(f, None, 300).read()
>source = json.loads(source)
>
>json.loads(source) raised the following exception " Expecting ,
>delimiter: line 1 column 817105 (char 817104"
>
>I tried to use source.encode('utf-8') and some other encodings but they
>all didn't help.
>Do we have any workaround for that issue ? Thanks :)

Hi.

Weird, I can't reproduce this error. I had to import the "json" and
"urllib2" modules, but after doing so, executing the code you provided
here worked fine for me: <https://phabricator.wikimedia.org/P3009>.

You probably want to use 'https://en.wikipedia.org/w/api.php' as your
end-point (HTTPS, not HTTP).

As far as I know, JSON is always encoded as UTF-8, so you shouldn't need
to encode or decode the data explicitly.

The error you're getting generally means that the JSON was malformed for
some reason. It seems unlikely that MediaWiki's api.php is outputting
invalid JSON, but I suppose it's possible.

Since you're coding in Python, you may be interested in a framework such
as <https://github.com/alexz-enwp/wikitools>.

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

Brad Jorsch (Anomie)
On Thu, May 5, 2016 at 6:16 PM, MZMcBride <[hidden email]> wrote:

> The error you're getting generally means that the JSON was malformed for
> some reason. It seems unlikely that MediaWiki's api.php is outputting
> invalid JSON, but I suppose it's possible.
>

There is https://phabricator.wikimedia.org/T132159 along those lines,
although it's not an API issue.

I note that the reported issue is with list=recentchanges, the output of
which (even at a constant timestamp offset) could easily change with page
deletion or revdel.


--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

Antoine Musso-3
In reply to this post by Trung Dinh
Le 05/05/2016 21:56, Trung Dinh a écrit :

> I have an issue why trying to parse data fetched from wikipedia api.
> This is the piece of code that I am using:
> api_url = 'http://en.wikipedia.org/w/api.php<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'
> api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'
>
> f = urllib2.Request(api_url, api_params)
> print ('requesting ' + api_url + '?' + api_params)
> source = urllib2.urlopen(f, None, 300).read()
> source = json.loads(source)
>
> json.loads(source) raised the following exception " Expecting , delimiter: line 1 column 817105 (char 817104"
>
> I tried to use source.encode('utf-8') and some other encodings but they all didn't help.
> Do we have any workaround for that issue ? Thanks :)

The error is due to the response not being valid json.

Can you have your script write the failing content to a file and share
it somewhere? For example via https://phabricator.wikimedia.org/file/upload/

There is a very thin chance that the server/caches actually garbage the
tail of some content.  I have seen some related discussion about it
earlier this week.


--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

Trung Dinh
In reply to this post by MZMcBride-2
Guys,

Thanks so much for your prompt feedback.
Basically, what I am doing is to keep sending the request based on date &
time until we reach to another day.
Specifically, what I have is something like:

api_url = 'http://en.wikipedia.org/w/api.php'
date='20160504022715'

while (True):
  api_params =
'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcd
ir=newer&format=json&rcstart={date}'.format(date=date)
  f = urllib2.Request(api_url, api_params)
  source = urllib2.urlopen(f, None, 300).read()
  source = json.loads(source)
  Increase date.

Given the above code, I am encountering an weird situation. In the query,
if I set rclimit to 500 then it runs normally. However, if I set rclimit
to 5000 like my previous email, I will see the error. I know that for
recent change rclimit should be set to 500. But is there anything
particular about the values of rclimit that could lead to the break in
json ?

On 5/5/16, 11:16 PM, "Wikitech-l on behalf of MZMcBride"
<[hidden email] on behalf of [hidden email]>
wrote:

>Trung Dinh wrote:
>>Hi all,
>>I have an issue why trying to parse data fetched from wikipedia api.
>>This is the piece of code that I am using:
>>api_url =
>>'https://urldefense.proofpoint.com/v2/url?u=http-3A__en.wikipedia.org_w_a
>>pi.php&d=CwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=K9jJjNfacravQkfypdTZOg&m=Gl3eq
>>wsc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4&s=AjBJxhe0ZaeTqz3r3wPQOH_kiIjq2_h4
>>UgKIgJUC5XQ&e= '
>>api_params =
>>'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&r
>>c
>>dir=newer&format=json&rcstart=20160504022715'
>>
>>f = urllib2.Request(api_url, api_params)
>>print ('requesting ' + api_url + '?' + api_params)
>>source = urllib2.urlopen(f, None, 300).read()
>>source = json.loads(source)
>>
>>json.loads(source) raised the following exception " Expecting ,
>>delimiter: line 1 column 817105 (char 817104"
>>
>>I tried to use source.encode('utf-8') and some other encodings but they
>>all didn't help.
>>Do we have any workaround for that issue ? Thanks :)
>
>Hi.
>
>Weird, I can't reproduce this error. I had to import the "json" and
>"urllib2" modules, but after doing so, executing the code you provided
>here worked fine for me: <https://phabricator.wikimedia.org/P3009>.
>
>You probably want to use
>'https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_w_a
>pi.php&d=CwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=K9jJjNfacravQkfypdTZOg&m=Gl3eqw
>sc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4&s=aw9laFsQi8JGilqru0zbRUlrBdcWj52NmF
>tRw6ZW5sI&e= ' as your
>end-point (HTTPS, not HTTP).
>
>As far as I know, JSON is always encoded as UTF-8, so you shouldn't need
>to encode or decode the data explicitly.
>
>The error you're getting generally means that the JSON was malformed for
>some reason. It seems unlikely that MediaWiki's api.php is outputting
>invalid JSON, but I suppose it's possible.
>
>Since you're coding in Python, you may be interested in a framework such
>as <https://github.com/alexz-enwp/wikitools>.
>
>MZMcBride
>
>
>
>_______________________________________________
>Wikitech-l mailing list
>[hidden email]
>https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

Marius Hoch
In reply to this post by Trung Dinh
Hi,

that sounds like https://phabricator.wikimedia.org/T133866.

Cheers

Marius

On 05.05.2016 21:56, Trung Dinh wrote:

> Hi all,
> I have an issue why trying to parse data fetched from wikipedia api.
> This is the piece of code that I am using:
> api_url = 'http://en.wikipedia.org/w/api.php<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'
> api_params = 'action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715<http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rclimit=5000&rctype=edit&rcnamespace=0&rcdir=newer&format=json&rcstart=20160504022715>'
>
> f = urllib2.Request(api_url, api_params)
> print ('requesting ' + api_url + '?' + api_params)
> source = urllib2.urlopen(f, None, 300).read()
> source = json.loads(source)
>
> json.loads(source) raised the following exception " Expecting , delimiter: line 1 column 817105 (char 817104"
>
> I tried to use source.encode('utf-8') and some other encodings but they all didn't help.
> Do we have any workaround for that issue ? Thanks :)
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: garbage characters show up when fetching wikimedia api

MZMcBride-2
In reply to this post by MZMcBride-2
MZMcBride wrote:
>The error you're getting generally means that the JSON was malformed for
>some reason. It seems unlikely that MediaWiki's api.php is outputting
>invalid JSON, but I suppose it's possible.

I left a note on the Phabricator task that Marius linked to:
<https://phabricator.wikimedia.org/T133866#2272654>.

It seems api.php end-points really are outputting garbage characters in
some cases, though it remains unclear which layer is to blame. :-/

MZMcBride



_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l