[MediaWiki-l] How to convert WikiText to Plain Tex

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[MediaWiki-l] How to convert WikiText to Plain Tex

Nikhil Prakash
 Hi There,

I'm searching for some efficient way to convert the WikiText of the
downloaded data dumps(in XML) to plain text. I basically need plain text of
each and every revision of Wikipedia articles.

Therefore, it would be very helpful if you can tell me about some library
or some piece of code(bunch of regex) to convert WikiText to Plain Text.
BTW, I write my code in Python!

Thanks.
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Reply | Threaded
Open this post in threaded view
|

Re: How to convert WikiText to Plain Tex

Erik Bernhardson
You can source that from the cirrussearch dumps, which contain the text
already cleaned up.  The python looks something like:

import json
from itertools import zip_longest
from pprint import pprint
import requests
import zlib

def get_gzip_stream(url):
    with requests.get(url, stream=True) as res:
        d = zlib.decompressobj(16+zlib.MAX_WBITS)
        for data in res.iter_content():
            yield d.decompress(data).decode('utf8')

def decode_lines(stream):
    buf = []
    for data in stream:
        buf.append(data)
        if '\n' in data:
            line, tail = ''.join(buf).split('\n', 1)
            buf = [tail]
            yield json.loads(line)

    if buf:

        yield json.loads(''.join(buf))


def pair_up_lines(lines):
    return zip_longest(*([iter(lines)] * 2))

url = '
https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cirrussearch-content.json.gz
'
stream = get_gzip_stream(url)
stream = decode_lines(stream)
stream = pair_up_lines(stream)

for meta, doc in stream:
    print(meta['index']['_id'])
    print(doc['title'])
    print(doc['text'])



On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash <[hidden email]>
wrote:

>  Hi There,
>
> I'm searching for some efficient way to convert the WikiText of the
> downloaded data dumps(in XML) to plain text. I basically need plain text of
> each and every revision of Wikipedia articles.
>
> Therefore, it would be very helpful if you can tell me about some library
> or some piece of code(bunch of regex) to convert WikiText to Plain Text.
> BTW, I write my code in Python!
>
> Thanks.
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l