How to track all the diffs in real time?

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

How to track all the diffs in real time?

Maximilian Klein
Hello Researchers,

I've been playing with Recent Changes Stream Interface recently, and have started trying to use the API's "action=compare" to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes?

The other thing to do would be to use "action=query" to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is.

What techniques would you recommend?


Make a great day,
Max Klein ‽ http://notconfusing.com/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Toby Negrin
Hi Max -- let me ping the API folks. I don't think we researchers can make the final call on this.

-Toby

On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein <[hidden email]> wrote:
Hello Researchers,

I've been playing with Recent Changes Stream Interface recently, and have started trying to use the API's "action=compare" to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes?

The other thing to do would be to use "action=query" to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is.

What techniques would you recommend?


Make a great day,
Max Klein ‽ http://notconfusing.com/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Andrew G. West
Greetings list,

I do something similar for the [[WP:STiki]] anti-vandalism service. I
listen on IRC to determine new edits (en only) and then hit the API with
"action=query" on a non-batch basis. My application is multi-threaded in
order to keep up.

I asked along these lines in 2009 when it was initially authored. I was
told it wouldn't be a problem, but one should set their user agent so
that it provides a contact email and alludes to the benign impetus of
all these requests. -AW

--
Andrew G. West, PhD
Research Scientist
http://www.andrew-g-west.com


On 12/12/2014 06:54 PM, Toby Negrin wrote:

> Hi Max -- let me ping the API folks. I don't think we researchers can
> make the final call on this.
>
> -Toby
>
> On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hello Researchers,
>
>     I've been playing with Recent Changes Stream Interface
>     <https://wikitech.wikimedia.org/wiki/RCStream> recently, and have
>     started trying to use the API's "/action=compare/" to look at every
>     diff of every wiki in real time. The goal is to produce real-time
>     analytics on the content that's being added or deleted. The only
>     problem is that is will really hammer the API with lots of reads
>     since it doesn't have a batch interface. Can I spawn multiple
>     network threads and do 10+ reads per second forever without the API
>     complaining? Can I warn someone about this and get a special
>     exemption for research purposes?
>
>     The other thing to do would be to use "/action=query/" to get the
>     revisions in batches and do the diffing myself, but then i'm not
>     guaranteed to be diffing in the same way that the site is.
>
>     What techniques would you recommend?
>
>
>     Make a great day,
>     Max Klein ‽ http://notconfusing.com/
>
>     _______________________________________________
>     Wiki-research-l mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Scott Hale
In reply to this post by Toby Negrin
So Datasift is doing something like this already because they have a stream of edits to the English edition of Wikipedia that contains content in near real-time [1]. I'm not saying to use them, but it might be instructive if we can figure out (or if the API folk know) what they are doing.


- Scott

On Sat, Dec 13, 2014 at 8:54 AM, Toby Negrin <[hidden email]> wrote:
Hi Max -- let me ping the API folks. I don't think we researchers can make the final call on this.

-Toby

On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein <[hidden email]> wrote:
Hello Researchers,

I've been playing with Recent Changes Stream Interface recently, and have started trying to use the API's "action=compare" to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes?

The other thing to do would be to use "action=query" to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is.

What techniques would you recommend?


Make a great day,
Max Klein ‽ http://notconfusing.com/

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Yuvi Panda
If a lot of people are doing this, then perhaps it makes sense to have
an 'augmented real time streaming' interface that is an exact replica
of the streaming interface but with diffs added.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Yuvi Panda
On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda <[hidden email]> wrote:
> If a lot of people are doing this, then perhaps it makes sense to have
> an 'augmented real time streaming' interface that is an exact replica
> of the streaming interface but with diffs added.

Or rather, if I were to build such a thing, would people be interested
in using it?

--
Yuvi Panda T
http://yuvi.in/blog

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Scott Hale
Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month of data from the streaming API for a research project, I certainly could see an 'augmented stream' with diffs included being very useful for research and also for bots.


On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda <[hidden email]> wrote:
On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda <[hidden email]> wrote:
> If a lot of people are doing this, then perhaps it makes sense to have
> an 'augmented real time streaming' interface that is an exact replica
> of the streaming interface but with diffs added.

Or rather, if I were to build such a thing, would people be interested
in using it?

--
Yuvi Panda T
http://yuvi.in/blog

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Oliver Keyes-4
Oh dear god, that would be incredible.

The non-streaming API has a wonderful bug: if you request a series of diffs, and there are >1 uncached diffs in that series, only the first uncached diff will be returned. For the rest it returns...an error? No. Some kind of special value? No. It returns an empty string. You know: that thing it also returns if there is no difference >.> So instead you stream edits and compute the diffs yourself and everything goes a bit Pete Tong. Having this service around would be a lifesaver.

On 13 December 2014 at 10:14, Scott Hale <[hidden email]> wrote:
Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month of data from the streaming API for a research project, I certainly could see an 'augmented stream' with diffs included being very useful for research and also for bots.


On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda <[hidden email]> wrote:
On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda <[hidden email]> wrote:
> If a lot of people are doing this, then perhaps it makes sense to have
> an 'augmented real time streaming' interface that is an exact replica
> of the streaming interface but with diffs added.

Or rather, if I were to build such a thing, would people be interested
in using it?

--
Yuvi Panda T
http://yuvi.in/blog

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[hidden email]

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Ed Summers
In reply to this post by Scott Hale
+1 Yuvi

About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up.

Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like:

    http://stream.wikimedia.org/rc?diff=true

//Ed

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

signature.asc (507 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Aaron Halfaker-3
Hey folks,

I've been working on building up a revision diffs service that you'd be able to listen to or download a dump of revision diffs.

See https://github.com/halfak/Difference-Engine for my progress on the live system and https://github.com/halfak/MediaWiki-Streaming for my progress developing a Hadoop Streaming primer to generate old diffs[1].  See also https://github.com/halfak/Deltas for some experimental diff algorithms developed specifically to track content moves in Wikipedia revisions. 

In the short term, I can share diff datasets.  In the near-term, I'm wondering if you folks would be interested in working on the project with me.  If so, let me know and I'll give you a more complete status update. 

1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].

-Aaron

On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers <[hidden email]> wrote:
+1 Yuvi

About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up.

Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like:

    http://stream.wikimedia.org/rc?diff=true

//Ed

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Oliver Keyes-4
I'd be interested in helping if we could generalise it!

You can probably get a substantial speed improvement in C or C++. C and C++ are generaliseable to Python and R, our primary working languages for analytics. And R lacks any kind of text diffing engine, so I've been distinctly looking into how to build that.

So if we switch langs for performance, win and generaliseability, I'm in ;).

On 13 December 2014 at 12:33, Aaron Halfaker <[hidden email]> wrote:
Hey folks,

I've been working on building up a revision diffs service that you'd be able to listen to or download a dump of revision diffs.

See https://github.com/halfak/Difference-Engine for my progress on the live system and https://github.com/halfak/MediaWiki-Streaming for my progress developing a Hadoop Streaming primer to generate old diffs[1].  See also https://github.com/halfak/Deltas for some experimental diff algorithms developed specifically to track content moves in Wikipedia revisions. 

In the short term, I can share diff datasets.  In the near-term, I'm wondering if you folks would be interested in working on the project with me.  If so, let me know and I'll give you a more complete status update. 

1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].

-Aaron

On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers <[hidden email]> wrote:
+1 Yuvi

About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up.

Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like:

    http://stream.wikimedia.org/rc?diff=true

//Ed

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Thomas Steiner
In reply to this post by Ed Summers
Hi all,

+1 to Ed's point on making it a parameter rather than a new endpoint. I would definitely use it. Currently, I share a Server-Sent Events API connection for my projects (and invite others to use it, too: http://wikipedia-edits.herokuapp.com/sse).

Thanks,
Tom


--
Dr. Thomas Steiner, Employee, Google Inc.
http://blog.tomayac.com, http://twitter.com/tomayac

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/
-----END PGP SIGNATURE-----

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Mitar
In reply to this post by Maximilian Klein
Hi!

I made a a Meteor DDP API to the stream of recent changes on all
WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor
application to connect to stream of changes on Wikipedia. You can use
MongoDB queries to limit only to those changes you are interested in.
If there is interest, I could add also full diffs support and then you
could try to hit this API. We could probably also eventually host it
on Wikimedia Labs.

http://wikimedia.meteor.com/


Mitar

On Fri, Dec 12, 2014 at 11:53 PM, Maximilian Klein <[hidden email]> wrote:

> Hello Researchers,
>
> I've been playing with Recent Changes Stream Interface recently, and have
> started trying to use the API's "action=compare" to look at every diff of
> every wiki in real time. The goal is to produce real-time analytics on the
> content that's being added or deleted. The only problem is that is will
> really hammer the API with lots of reads since it doesn't have a batch
> interface. Can I spawn multiple network threads and do 10+ reads per second
> forever without the API complaining? Can I warn someone about this and get a
> special exemption for research purposes?
>
> The other thing to do would be to use "action=query" to get the revisions in
> batches and do the diffing myself, but then i'm not guaranteed to be diffing
> in the same way that the site is.
>
> What techniques would you recommend?
>
>
> Make a great day,
> Max Klein ‽ http://notconfusing.com/
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
http://mitar.tnode.com/
https://twitter.com/mitar_m

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Mitar
Hi!

Now with full diffs as well.


Mitar

On Sat, Dec 13, 2014 at 8:28 PM, Mitar <[hidden email]> wrote:

> Hi!
>
> I made a a Meteor DDP API to the stream of recent changes on all
> WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor
> application to connect to stream of changes on Wikipedia. You can use
> MongoDB queries to limit only to those changes you are interested in.
> If there is interest, I could add also full diffs support and then you
> could try to hit this API. We could probably also eventually host it
> on Wikimedia Labs.
>
> http://wikimedia.meteor.com/
>
>
> Mitar
>
> On Fri, Dec 12, 2014 at 11:53 PM, Maximilian Klein <[hidden email]> wrote:
>> Hello Researchers,
>>
>> I've been playing with Recent Changes Stream Interface recently, and have
>> started trying to use the API's "action=compare" to look at every diff of
>> every wiki in real time. The goal is to produce real-time analytics on the
>> content that's being added or deleted. The only problem is that is will
>> really hammer the API with lots of reads since it doesn't have a batch
>> interface. Can I spawn multiple network threads and do 10+ reads per second
>> forever without the API complaining? Can I warn someone about this and get a
>> special exemption for research purposes?
>>
>> The other thing to do would be to use "action=query" to get the revisions in
>> batches and do the diffing myself, but then i'm not guaranteed to be diffing
>> in the same way that the site is.
>>
>> What techniques would you recommend?
>>
>>
>> Make a great day,
>> Max Klein ‽ http://notconfusing.com/
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m



--
http://mitar.tnode.com/
https://twitter.com/mitar_m

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Jeremy Baron
In reply to this post by Aaron Halfaker-3

On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <[hidden email]> wrote:
> 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Flöck, Fabian
If anyone is interested in a faster processing of revision differences, you could also adapt the strategy we implemented for wikiwho [1], which is keeping track of bigger unchanged text chunks with hashes and just diffing the remaining text (usually a relatively small part oft the article). We specifically introduced that technique because diffing all the text was too expensive. And in principle, it can produce the same output, although we currently use it for authorship detection, which is a slightly different task.  Anyway, it is on average >100 times faster than pure "traditional" diffing. Maybe that is useful for someone. Code is available at github [2].

[1] http://f-squared.org/wikiwho 


On 14.12.2014, at 07:23, Jeremy Baron <[hidden email]> wrote:

On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <[hidden email]> wrote:
> 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Cheers, 
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
[hidden email]
 
www.gesis.org
www.facebook.com/gesis.org






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Maximilian Klein
All,
Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and Mitar are now all offering overlapping solutions to the real-time diff monitoring problem. The one thing I take away from that is that if the API is robust enough to serve these 4 clients in real time, then adding another is a drop in the bucket.

However, as others like Yuvi pointed out, and Aaron has prototyped we could make this better, by serving an augmented RCstream. I wonder how easy it would be to allow community development on that project since it seems that it would require access to the full databases, which only WMF developers seem to have access to at the moment.

Make a great day,
Max Klein ‽ http://notconfusing.com/

On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian <[hidden email]> wrote:
If anyone is interested in a faster processing of revision differences, you could also adapt the strategy we implemented for wikiwho [1], which is keeping track of bigger unchanged text chunks with hashes and just diffing the remaining text (usually a relatively small part oft the article). We specifically introduced that technique because diffing all the text was too expensive. And in principle, it can produce the same output, although we currently use it for authorship detection, which is a slightly different task.  Anyway, it is on average >100 times faster than pure "traditional" diffing. Maybe that is useful for someone. Code is available at github [2].

[1] http://f-squared.org/wikiwho 


On 14.12.2014, at 07:23, Jeremy Baron <[hidden email]> wrote:

On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <[hidden email]> wrote:
> 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Cheers, 
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: <a href="tel:%2B%2049%20%280%29%20221-47694-208" value="+4922147694208" target="_blank">+ 49 (0) 221-47694-208
[hidden email]
 
www.gesis.org
www.facebook.com/gesis.org






_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: How to track all the diffs in real time?

Mitar
Hi!

What more do you have in mind that could be in "augmented stream" than
the current RCstream data + difffs as they are provided by the API?


Mitar

On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein <[hidden email]> wrote:

> All,
> Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and
> Mitar are now all offering overlapping solutions to the real-time diff
> monitoring problem. The one thing I take away from that is that if the API
> is robust enough to serve these 4 clients in real time, then adding another
> is a drop in the bucket.
>
> However, as others like Yuvi pointed out, and Aaron has prototyped we could
> make this better, by serving an augmented RCstream. I wonder how easy it
> would be to allow community development on that project since it seems that
> it would require access to the full databases, which only WMF developers
> seem to have access to at the moment.
>
> Make a great day,
> Max Klein ‽ http://notconfusing.com/
>
> On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian <[hidden email]>
> wrote:
>>
>> If anyone is interested in a faster processing of revision differences,
>> you could also adapt the strategy we implemented for wikiwho [1], which is
>> keeping track of bigger unchanged text chunks with hashes and just diffing
>> the remaining text (usually a relatively small part oft the article). We
>> specifically introduced that technique because diffing all the text was too
>> expensive. And in principle, it can produce the same output, although we
>> currently use it for authorship detection, which is a slightly different
>> task.  Anyway, it is on average >100 times faster than pure "traditional"
>> diffing. Maybe that is useful for someone. Code is available at github [2].
>>
>> [1] http://f-squared.org/wikiwho
>> [2] https://github.com/maribelacosta/wikiwho
>>
>>
>> On 14.12.2014, at 07:23, Jeremy Baron <[hidden email]> wrote:
>>
>> On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <[hidden email]>
>> wrote:
>> > 1. It turns out that generating diffs is computationally complex, so
>> > generating them in real time is slow and lame.  I'm working to generate all
>> > diffs historically using Hadoop and then have a live system listening to
>> > recent changes to keep the data up-to-date[2].
>>
>> IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
>> all enwiki diffs for all time. (don't remember if this is namespace limited)
>> But also using an extraordinary amount of RAM. i.e. hundreds of GB
>>
>> AIUI, there's no dynamic memory allocation. revisions are loaded into
>> fixed-size buffers larger than the largest revision.
>>
>> https://github.com/makoshark/wikiq
>>
>> -Jeremy
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>>
>>
>> Cheers,
>> Fabian
>>
>> --
>> Fabian Flöck
>> Research Associate
>> Computational Social Science department @GESIS
>> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
>> Tel: + 49 (0) 221-47694-208
>> [hidden email]
>>
>> www.gesis.org
>> www.facebook.com/gesis.org
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
http://mitar.tnode.com/
https://twitter.com/mitar_m

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l