Question about summary regex in api for ML dataset

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about summary regex in api for ML dataset

Daniel Kramer
Hi, 

I'm trying to create a dataset of summaries vs full text bodies for automatic text summarization models. 

I was looking at the online api for retrieving the summary of a page, so I could recreate it in my Spark code for parsing wiki dumps. Specifically, I was looking at the regex in: https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/ApiQueryExtracts.php;012b89e966edf20834f0e551a66fbb4ebfd185cd$210

$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';

With section marker start filled in:

$regexp = '/^(.*?)(?=' . \1\2 . ')/s';

However, when I plug that expression into an online tester (regex101.com), I see that: \2 This token references a non-existent or invalid subpattern

I am wondering if this is a bug or if I'm placing it incorrectly?

The alternative branch is when plaintext is set to false - that's for parsing HTML correct / not applicable for the xml in wiki dumps?

Thanks for your help,
Dan Kramer

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Platonides
That \1\2 are literal bytes. You would do:

$regexp = '/^(.*?)(?=\x01\x02)/s';

But those bytes are not present in the original wikitext, they are set
by ExtractFormatter
                        $html = preg_replace( '/\s*(<h([1-6])\b)/i',
                                        "\n\n" . self::SECTION_MARKER_START . '$2' .
self::SECTION_MARKER_END . '$1',
                                        $html);

Best regards


PS: These are section names, not edit summaries.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Daniel Kramer
Hi Platonides,

Thanks so much for your reply. 

That makes a lot more sense - unfortunately,  I can't seem to find section names as elements in the xml schema (https://www.mediawiki.org/xml/export-0.10.xsd). Do you have any recommendations for parsing the intro section out of the xml dumps? Trying to avoid parsing html or querying the api because I have Cloud9's wiki xml reader for processing the xml dumps in spark. 

Thanks again,
Dan

On Fri, Oct 12, 2018 at 1:00 PM Platonides <[hidden email]> wrote:
That \1\2 are literal bytes. You would do:

$regexp = '/^(.*?)(?=\x01\x02)/s';

But those bytes are not present in the original wikitext, they are set
by ExtractFormatter
                        $html = preg_replace( '/\s*(<h([1-6])\b)/i',
                                        "\n\n" . self::SECTION_MARKER_START . '$2' .
self::SECTION_MARKER_END . '$1',
                                        $html);

Best regards


PS: These are section names, not edit summaries.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_mediawiki-2Dapi&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=v6T2EyE4KveT7ULVWpZKEQ&m=3ifKD97b-oU21yT3FsgrNa_MYPjLADy0HJTfStT5SoQ&s=mHorhY1TsQMyQABupg-HuaEIRMc8ZKmX3zhn9u1o0a4&e=

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Platonides
Hello Daniel

I'm afraid I'm not sure what you are trying to do. What exactly do you want to extract? The section names, the introduction sections (section 0), something different... ?

Kind regards

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Daniel Kramer
I'm trying to extract the section 0 text in full (hopefully section 0 is the "summary" of the page), and extract the rest of the article as another string.

The Cloud9 library I'm using can give me html from the xml dump, so I'm working on replicating the regex patterns for the section markers. Unless you think there's a better way to get the section 0 text from the xml? 

Thanks,
Dan

On Fri, Oct 12, 2018 at 4:56 PM Platonides <[hidden email]> wrote:
Hello Daniel

I'm afraid I'm not sure what you are trying to do. What exactly do you want to extract? The section names, the introduction sections (section 0), something different... ?

Kind regards
_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_mediawiki-2Dapi&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=v6T2EyE4KveT7ULVWpZKEQ&m=EL2sTpdXBk1ewu1fXVjDrRMtYCUtO9cqoPJnTuWpNwU&s=1qoYuYmG75JGcAEST0CGvPr2Cm0ltrKMoPSFMEvYkoY&e=

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Platonides
Are you sure you are getting html from the XML and not wikitext?

Assuming you are working with wikitext, and want everything up to the first heading, and handwaving things like a section set by a template, you could break at the first line matching /^(=={2,5})[ \\t]*(.+?)[ \\t]*\1\\s*$/m (see the function Parser::doHeadings below).
In practice, splitting at "\n==" will give you the right result on 99% of articles.

If the library is really giving you html, it's even easier, split the html at the first <h[1-6]>.

Note that the wikitext will contain many non-textual characters like templates, tables, wikitext formatting, references... that you'd need to clean up before applying your models.
However, other projects have done this in the past (sorry, I have no links to them), so I would either make a very basic cleaning, or reuse what others made.

Best regards

===============================
        public function doHeadings( $text ) {
            for ( $i = 6; $i >= 1; --$i ) {
                $h = str_repeat( '=', $i );
                // Trim non-newline whitespace from headings
                // Using \s* will break for: "==\n===\n" and parse as <h2>=</h2>
                $text = preg_replace( "/^(?:$h)[ \\t]*(.+?)[ \\t]*(?:$h)\\s*$/m", "<h$i>\\1</h$i>", $text );
            }
            return $text;
        }
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/parser/Parser.php$1672

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Question about summary regex in api for ML dataset

Erik Bernhardson
I don't know if it helps you, but the cirrussearch dumps contain the opening text (the text before the first section header) broken out into plain text. These dumps are limited to only the current (as of time of dump) version of each article with no historical data.  The dumps themselves are lines of json so not too hard to parse.


The cirrusbuilddoc property of the api is roughly the same format as the dumps:


On Fri, Oct 12, 2018 at 2:22 PM Platonides <[hidden email]> wrote:
Are you sure you are getting html from the XML and not wikitext?

Assuming you are working with wikitext, and want everything up to the first heading, and handwaving things like a section set by a template, you could break at the first line matching /^(=={2,5})[ \\t]*(.+?)[ \\t]*\1\\s*$/m (see the function Parser::doHeadings below).
In practice, splitting at "\n==" will give you the right result on 99% of articles.

If the library is really giving you html, it's even easier, split the html at the first <h[1-6]>.

Note that the wikitext will contain many non-textual characters like templates, tables, wikitext formatting, references... that you'd need to clean up before applying your models.
However, other projects have done this in the past (sorry, I have no links to them), so I would either make a very basic cleaning, or reuse what others made.

Best regards

===============================
        public function doHeadings( $text ) {
            for ( $i = 6; $i >= 1; --$i ) {
                $h = str_repeat( '=', $i );
                // Trim non-newline whitespace from headings
                // Using \s* will break for: "==\n===\n" and parse as <h2>=</h2>
                $text = preg_replace( "/^(?:$h)[ \\t]*(.+?)[ \\t]*(?:$h)\\s*$/m", "<h$i>\\1</h$i>", $text );
            }
            return $text;
        }
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/parser/Parser.php$1672
_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api