Data decay? Concerning graph

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data decay? Concerning graph

thingles
This email isn't really a bug report, but I’m seeing a pattern in
Semantic MediaWiki that sure is worrisome. I tend to think that
WikiApiary is pushing some boundaries for Semantic MediaWiki, so I fully
expect I might be seeing some behavior that hasn't been seen before, or
perhaps hasn't been monitored closely before.

As background, all 19,000+ websites on WikiApiary are assigned to a
segment. The segment is simply the Page ID mod 15 (relatively even
distribution between 0-15).

[[Has bot segment::{{ #expr: {{PAGEID}} mod {{WikiApiary:Bot segments}}
}}]]

I care about how these segments are balanced because the bots use them
to do work, so I have munin graph the count of websites in each segment
every 5 minutes. This has been happening for a while, and you can see
the graphs here:

http://db.thingelstad.com/munin/thingelstad.com/db.thingelstad.com/wikiapiary_segments.html

Now, take a moment to look at the monthly one.

The craziness that happened in Week 22 seems to have been the result of
some issue in the master branch. I’m sorry to say I didn’t do a good job
of tracking which commit I went between, but something started dropping
SMW data like crazy and an update to the newest master fixed it (does
composer keep a log that would tell me?)

However, I’m more concerned when I look at the weekly one. Note the
behavior in weeks 19, 20 and 21 the graph jumps up and then gradually
decays the entire week. There is NO behavior in WikiApiary that would
justify that pattern. It is worth noting that I have a cron job that
runs SMW_refresh every weekend. That is when you see the graph correct
back up.

This looks like there is some gradual decay in semantic data that is
naturally occurring, and then getting corrected by the refresh. (This
might also explain why sometimes websites just stop collecting data in
WikiApiary for no known reason, a bug I've tried fruitlessly to track
down in my code.)

I know everyone has concerns about the data store. It lacks unit tests
and all. This behavior, combined with the never-diagnosed duplicates
problem, makes me worry there are numerous issues at the heart of SMW
that need to be ferreted out.

Note, if you are curious about how this data is collected you can see
these wiki pages:

https://wikiapiary.com/wiki/WikiApiary:Munin

The only valid reason for the counts in a segment going down is an
operator marking them as inactive, and that cannot explain the decay in
these graphs week over week.

--
  Jamie Thingelstad
  [hidden email]

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: Data decay? Concerning graph

nischay nahata
Isn't the PAGEID part of MW itself? In that case I guess you meant that values for your SMW property 'Has bot segment' might be getting lost?


On Tue, Jun 3, 2014 at 3:21 AM, Jamie Thingelstad <[hidden email]> wrote:
This email isn't really a bug report, but I’m seeing a pattern in
Semantic MediaWiki that sure is worrisome. I tend to think that
WikiApiary is pushing some boundaries for Semantic MediaWiki, so I fully
expect I might be seeing some behavior that hasn't been seen before, or
perhaps hasn't been monitored closely before.

As background, all 19,000+ websites on WikiApiary are assigned to a
segment. The segment is simply the Page ID mod 15 (relatively even
distribution between 0-15).

[[Has bot segment::{{ #expr: {{PAGEID}} mod {{WikiApiary:Bot segments}}
}}]]

I care about how these segments are balanced because the bots use them
to do work, so I have munin graph the count of websites in each segment
every 5 minutes. This has been happening for a while, and you can see
the graphs here:

http://db.thingelstad.com/munin/thingelstad.com/db.thingelstad.com/wikiapiary_segments.html

Now, take a moment to look at the monthly one.

The craziness that happened in Week 22 seems to have been the result of
some issue in the master branch. I’m sorry to say I didn’t do a good job
of tracking which commit I went between, but something started dropping
SMW data like crazy and an update to the newest master fixed it (does
composer keep a log that would tell me?)

However, I’m more concerned when I look at the weekly one. Note the
behavior in weeks 19, 20 and 21 the graph jumps up and then gradually
decays the entire week. There is NO behavior in WikiApiary that would
justify that pattern. It is worth noting that I have a cron job that
runs SMW_refresh every weekend. That is when you see the graph correct
back up.

This looks like there is some gradual decay in semantic data that is
naturally occurring, and then getting corrected by the refresh. (This
might also explain why sometimes websites just stop collecting data in
WikiApiary for no known reason, a bug I've tried fruitlessly to track
down in my code.)

I know everyone has concerns about the data store. It lacks unit tests
and all. This behavior, combined with the never-diagnosed duplicates
problem, makes me worry there are numerous issues at the heart of SMW
that need to be ferreted out.

Note, if you are curious about how this data is collected you can see
these wiki pages:

https://wikiapiary.com/wiki/WikiApiary:Munin

The only valid reason for the counts in a segment going down is an
operator marking them as inactive, and that cannot explain the decay in
these graphs week over week.

--
  Jamie Thingelstad
  [hidden email]

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel



--
Cheers,

Nischay Nahata

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: Data decay? Concerning graph

James HK
In reply to this post by thingles
Hi,

This is not an analysis of the mentioned issue but an explanation as
to how updates of a wikipage will end up in the SMW-store.

## Hooks

MediaWiki deploys several hooks that are used to identify, parse and
filter information relevant to SMW. The hooks required for the storage
process are InternalParseBeforeLinks, LinksUpdateConstructed,
NewRevisionFromEditComplete, and ParserAfterTidy.

## Page edit/save
After someone or somewhat (a bot) edits and saves a page, the
InternalParseBeforeLinks hook [1] will be called which in case of SMW
is responsible for parsing the "raw" text from the wikipage using the
InTextAnnotationParser. The InTextAnnotationParser will convert links
like [[Foo::bar]] into an internal representation and remove any SMW
specific logic ([[ :: ]]) from the text in order for the wikipage to
display a simple "bar".

At this point, data are not stored with SMW itself and instead the
collected data [2] are transferred using the ParserOutput object to
enable post-processing after a edit/save page process.

Each time MediaWiki executes Parser::parse(), the
InternalParseBeforeLinks hook is fired and a status
('smw-semanticdata-status' as Page property [3]) is set to distinguish
between SMW relevant and non-relevant edits.

## Predefined properties
NewRevisionFromEditComplete is called when a revision was created and
will add predefined properties such as "Modification date" etc. Again,
at this point data are not stored and only updates to the ParserOutput
object are carried out.

## Sortkey / Category
ParserAfterTidy hook is used to update the ParserOutput object with
sortkey and category information as they are not available in an
earlier process.

## Store update
LinksUpdateConstructed is one of three places in MW where the
collected data (in form of [2]) are retrieved from the ParserOutput
object to initiate a StoreUpdate.

## Page purge
In some circumstances (based on the customizing) it is desirable to
purge the content of a page together with its semantic data (in
earlier version of SMW that was not possible) therefore the
ParserAfterTidy hook is used for this occasion as well since
LinksUpdateConstructed can't be used (it is only triggered on a page
save). It makes the ParserAfterTidy hook the second place that can
initiate a StoreUpdate but only in case of "&action=purge".

## Data rebuild
When data are scheduled for a rebuild, each selected page will trigger
an UpdateJob [4].

The UpdateJob at the time of its execution will find the most recent
revision using the ContentParser and parses its "raw" text (and by
doing so run through the InternalParseBeforeLinks,
NewRevisionFromEditComplete, and ParserAfterTidy hook) to create a
ParserOutput object.

The created ParserOutput is used to retrieve the SemanticData
container, followed by a StoreUpdate.

UpdateJob is the third place to set off a StoreUpdate (of course you
could trigger a LinksUpdate as the refreshLinksJob does but that's
another discussion).

PS: A general note, WikiApiary currently runs MediaWiki 1.23.0-rc.1
(which because of [0] should not be used in production).

[0] https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/212

[1] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/includes/src/MediaWiki/Hooks/InternalParseBeforeLinks.php

[2] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/includes/SemanticData.php

[3] https://www.mediawiki.org/wiki/Manual:Page_props_table

[4] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/includes/src/MediaWiki/Jobs/UpdateJob.php

Cheers

On 6/3/14, Jamie Thingelstad <[hidden email]> wrote:

> This email isn't really a bug report, but I’m seeing a pattern in
> Semantic MediaWiki that sure is worrisome. I tend to think that
> WikiApiary is pushing some boundaries for Semantic MediaWiki, so I fully
> expect I might be seeing some behavior that hasn't been seen before, or
> perhaps hasn't been monitored closely before.
>
> As background, all 19,000+ websites on WikiApiary are assigned to a
> segment. The segment is simply the Page ID mod 15 (relatively even
> distribution between 0-15).
>
> [[Has bot segment::{{ #expr: {{PAGEID}} mod {{WikiApiary:Bot segments}}
> }}]]
>
> I care about how these segments are balanced because the bots use them
> to do work, so I have munin graph the count of websites in each segment
> every 5 minutes. This has been happening for a while, and you can see
> the graphs here:
>
> http://db.thingelstad.com/munin/thingelstad.com/db.thingelstad.com/wikiapiary_segments.html
>
> Now, take a moment to look at the monthly one.
>
> The craziness that happened in Week 22 seems to have been the result of
> some issue in the master branch. I’m sorry to say I didn’t do a good job
> of tracking which commit I went between, but something started dropping
> SMW data like crazy and an update to the newest master fixed it (does
> composer keep a log that would tell me?)
>
> However, I’m more concerned when I look at the weekly one. Note the
> behavior in weeks 19, 20 and 21 the graph jumps up and then gradually
> decays the entire week. There is NO behavior in WikiApiary that would
> justify that pattern. It is worth noting that I have a cron job that
> runs SMW_refresh every weekend. That is when you see the graph correct
> back up.
>
> This looks like there is some gradual decay in semantic data that is
> naturally occurring, and then getting corrected by the refresh. (This
> might also explain why sometimes websites just stop collecting data in
> WikiApiary for no known reason, a bug I've tried fruitlessly to track
> down in my code.)
>
> I know everyone has concerns about the data store. It lacks unit tests
> and all. This behavior, combined with the never-diagnosed duplicates
> problem, makes me worry there are numerous issues at the heart of SMW
> that need to be ferreted out.
>
> Note, if you are curious about how this data is collected you can see
> these wiki pages:
>
> https://wikiapiary.com/wiki/WikiApiary:Munin
>
> The only valid reason for the counts in a segment going down is an
> operator marking them as inactive, and that cannot explain the decay in
> these graphs week over week.
>
> --
>   Jamie Thingelstad
>   [hidden email]
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/NeoTech
> _______________________________________________
> Semediawiki-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel