Public Event Streams (AKA RCStream replacement) question

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Public Event Streams (AKA RCStream replacement) question

Andrew Otto
Hi all,

We’ve been busy working on building a replacement for RCStream.  This new
service would expose recentchanges as a stream as usual, but also other
types of event streams that we can make public.

But we’re having a bit of an existential crisis!  We had originally chosen
to implement this using an up to date socket.io server, as RCStream also
uses socket.io.  We’re mostly finished with this, but now we are taking a
step back and wondering if socket.io/websockets are the best technology to
use to expose stream data these days.

The alternative is to just use ‘streaming’ HTTP chunked transfer encoding.
That is, the client makes a HTTP request for a stream, and the server
declares that it will be sending back data indefinitely in the response
body.  Clients just read (and parse) events out of the HTTP response body.
There is some event tooling built on top of this (namely SSE /
EventSource), but the basic idea is a never ending streamed HTTP response
body.

So, I’m reaching out to to gather some input to help inform a decision.
What will be easier for you users of RCStream in the future?  Would you
prefer to keep using socket.io (newer version), or would you prefer to work
directly with HTTP?  There seem to be good clients for socket.io and for
SSE/EventSource in many languages.

https://phabricator.wikimedia.org/T130651 has more context, but don’t worry
about reading it; it is getting a little long.  Feel free to chime in there
or on this thread.

Thanks!
-Andrew Otto
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Brad Jorsch (Anomie)
The few times I've tried to look at the existing rcstream service, I've
quickly been stymied by not finding any documentation of the actual
protocol involved.

Whatever solution is chosen, it would be very nice if there would be
easy-to-find documentation that a skilled developer could use to consume
the service starting with the ability to make an SSL connection to a
server, instead of starting from "use python or nodejs, then require '
socket.io'".

On Fri, Sep 23, 2016 at 5:15 PM, Andrew Otto <[hidden email]> wrote:

> Hi all,
>
> We’ve been busy working on building a replacement for RCStream.  This new
> service would expose recentchanges as a stream as usual, but also other
> types of event streams that we can make public.
>
> But we’re having a bit of an existential crisis!  We had originally chosen
> to implement this using an up to date socket.io server, as RCStream also
> uses socket.io.  We’re mostly finished with this, but now we are taking a
> step back and wondering if socket.io/websockets are the best technology to
> use to expose stream data these days.
>
> The alternative is to just use ‘streaming’ HTTP chunked transfer encoding.
> That is, the client makes a HTTP request for a stream, and the server
> declares that it will be sending back data indefinitely in the response
> body.  Clients just read (and parse) events out of the HTTP response body.
> There is some event tooling built on top of this (namely SSE /
> EventSource), but the basic idea is a never ending streamed HTTP response
> body.
>
> So, I’m reaching out to to gather some input to help inform a decision.
> What will be easier for you users of RCStream in the future?  Would you
> prefer to keep using socket.io (newer version), or would you prefer to
> work
> directly with HTTP?  There seem to be good clients for socket.io and for
> SSE/EventSource in many languages.
>
> https://phabricator.wikimedia.org/T130651 has more context, but don’t
> worry
> about reading it; it is getting a little long.  Feel free to chime in there
> or on this thread.
>
> Thanks!
> -Andrew Otto
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l




--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Andrew Otto
​So, since most of the dev work for a socket.io implementation is already
done, you can see what the protocol would look like here:
https://github.com/wikimedia/kasocki#socketio-client-set-up

Kasocki is just a library, the actual WMF deployment and documentation
would be more specific about MediaWiki type events, but the interface would
be the same.  (Likely there would be client libraries to abstract the
actual socket.io interaction.)

For HTTP, instead of an RPC style protocol where you configure the stream
you want via several socket.emit calls, you’d construct the URI that
specifies the event streams, (and partitions and offsets if necessary), and
filters you want, and then request it.  Perhaps something like ‘http://
.../stream/mediawiki.revsision-create?database=plwiki;rev_len:gt100' (I
totally just made this URL up, no idea if it would work this way.).
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Brad Jorsch (Anomie)
On Sat, Sep 24, 2016 at 11:41 AM, Andrew Otto <[hidden email]> wrote:

> ​So, since most of the dev work for a socket.io implementation is already
> done, you can see what the protocol would look like here:
> https://github.com/wikimedia/kasocki#socketio-client-set-up
>
> Kasocki is just a library, the actual WMF deployment and documentation
> would be more specific about MediaWiki type events, but the interface would
> be the same.  (Likely there would be client libraries to abstract the
> actual socket.io interaction.)
>

See, that's the sort of thing I was complaining about. If I'm not using
whatever language happens to have a library already written, there's no
spec so I have to reverse-engineer it from an implementation. And in this
case that seems like socket.io on top of engine.io on top of who knows what
else.


--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Antoine Musso-3
Le 24/09/2016 à 22:51, Brad Jorsch (Anomie) a écrit :

> On Sat, Sep 24, 2016 at 11:41 AM, Andrew Otto <[hidden email]> wrote:
>
>> ​So, since most of the dev work for a socket.io implementation is already
>> done, you can see what the protocol would look like here:
>> https://github.com/wikimedia/kasocki#socketio-client-set-up
>>
>> Kasocki is just a library, the actual WMF deployment and documentation
>> would be more specific about MediaWiki type events, but the interface would
>> be the same.  (Likely there would be client libraries to abstract the
>> actual socket.io interaction.)
>>
>
> See, that's the sort of thing I was complaining about. If I'm not using
> whatever language happens to have a library already written, there's no
> spec so I have to reverse-engineer it from an implementation. And in this
> case that seems like socket.io on top of engine.io on top of who knows what
> else.

socket.io has libraries in several languages. The RCStream shows example
for JavaScript and Python:
https://wikitech.wikimedia.org/wiki/RCStream#Client

It is true though that a lib has to be written on top of that to be
aware of MediaWiki events dialect.

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Antoine Musso-3
In reply to this post by Andrew Otto
Le 23/09/2016 à 23:15, Andrew Otto a écrit :

> Hi all,
>
> We’ve been busy working on building a replacement for RCStream.  This new
> service would expose recentchanges as a stream as usual, but also other
> types of event streams that we can make public.
>
> But we’re having a bit of an existential crisis!  We had originally chosen
> to implement this using an up to date socket.io server, as RCStream also
> uses socket.io.  We’re mostly finished with this, but now we are taking a
> step back and wondering if socket.io/websockets are the best technology to
> use to expose stream data these days.
>
> The alternative is to just use ‘streaming’ HTTP chunked transfer encoding.
<snip>

Hello,

As I understand it we have a legacy system we want to replace. It uses
an old socket.io with a set of events A.

Since you "are mostly finished with" a replacement that has the latest
socket.io I would ship that now and drop/replace the legacy system. With
no new events.

From there survey people about changing the transport layer. Which leads
me to a few questions:

- is RCStream actually used?
- how many clients?
- typology of clients (big corp like Yahoo, Google, volunteers, WMF
internal use) ...

Then survey about the change of transport.  The red hearing is that if
you get mostly volunteers, it is going to be long and tedious to have
them change to the new system.  AFAIK WMF still maintains an IRC server
to stream events which was supposed to be dropped by RCStream.  There
are still tools and bot relying on IRC protocol with no developers able
to do the migration.

You will face the exact same problem by changing to HTTP chunks, and we
would end up with:
- IRC (legacy)
- socket.io (on a legacy / outdated infra)
- HTTP chunk


My recommendations are:
- to upgrade the current socket.io since it is apparently already done.
- Find out who are the consumers of the IRC feed and RCStream, run a
survey and figure out what would fit their need best.
- Come with a plan to DROP the old systems

And hopefully we end up with a single system from which people can build
upon and on which we can introduce new type of events.


My 0.02 €

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Merlijn van Deen
In reply to this post by Andrew Otto
Hi Andrew,

On 23 September 2016 at 23:15, Andrew Otto <[hidden email]> wrote:

> We’ve been busy working on building a replacement for RCStream.  This new
> service would expose recentchanges as a stream as usual, but also other
> types of event streams that we can make public.
>

First of all, why does it need to be a replacement, rather than something
that builds on existing infrastructure? Re-using the existing
infrastructure provides a much more convenient path for consumers to
upgrade.


> But we’re having a bit of an existential crisis!  We had originally chosen
> to implement this using an up to date socket.io server, as RCStream also
> uses socket.io.  We’re mostly finished with this, but now we are taking a
> step back and wondering if socket.io/websockets are the best technology to
> use to expose stream data these days.
>

For what it's worth, I'm on the fence about socket.io. My biggest argument
for socket.io is the fact that rcstream already uses it, but my experience
with implementing the pywikibot consumer for rcstream is that the Python
libraries are lacking, especially when it comes to stuff like reconnecting.
In addition, debugging issues requires knowledge of both socket.io and the
underlying websockets layer, which are both very different from regular
http.

From the task description, I understand that the goal is to allow easy
resumation by passing information on the the last received message. You
could consider not implementing streaming /at all/, and just ask clients to
poll an http endpoint, which is much easier to implement client-side than
anything streaming (especially when it comes to handling disconnects).

So: My preference would be extending the existing rcstream framework, but
if that's not possible, my preference would be with not streaming at all.

Merlijn
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Joaquin Oltra Hernandez
Why not expose the websockets as a standard websocket server so that it can
be consumed by any language/platform that has a standard websocket
implementation?

https://www.npmjs.com/package/ws

Pinning to socket.io versions or other abstractions leads to what happened
before, you can get stuck on an old version, and the protocol is specific
to the library and platforms where that library has been implemented.

By using a standard websocket server, you can provide a minimal standards
compliant service that can be consumed across other languages/clients, and
if there are services that need the socket.io features you can provide a
different service that proxies the original one but puts socket.io on top
of it.

---

For the RCStream use case server sent events (SSE) are a great use case too
(given you don't need bidirectional communication), so that would make a
lot of sense too instead of websockets (it'd probably be easier to scale).

Whatever it is I'd vote for sticking to standard implementations, either
pure websockets or http server sent events, and let shim layers that
provide other features like socket.io be implemented in other proxy servers.



On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) <
[hidden email]> wrote:

> Hi Andrew,
>
> On 23 September 2016 at 23:15, Andrew Otto <[hidden email]> wrote:
>
> > We’ve been busy working on building a replacement for RCStream.  This new
> > service would expose recentchanges as a stream as usual, but also other
> > types of event streams that we can make public.
> >
>
> First of all, why does it need to be a replacement, rather than something
> that builds on existing infrastructure? Re-using the existing
> infrastructure provides a much more convenient path for consumers to
> upgrade.
>
>
> > But we’re having a bit of an existential crisis!  We had originally
> chosen
> > to implement this using an up to date socket.io server, as RCStream also
> > uses socket.io.  We’re mostly finished with this, but now we are taking
> a
> > step back and wondering if socket.io/websockets are the best technology
> to
> > use to expose stream data these days.
> >
>
> For what it's worth, I'm on the fence about socket.io. My biggest argument
> for socket.io is the fact that rcstream already uses it, but my experience
> with implementing the pywikibot consumer for rcstream is that the Python
> libraries are lacking, especially when it comes to stuff like reconnecting.
> In addition, debugging issues requires knowledge of both socket.io and the
> underlying websockets layer, which are both very different from regular
> http.
>
> From the task description, I understand that the goal is to allow easy
> resumation by passing information on the the last received message. You
> could consider not implementing streaming /at all/, and just ask clients to
> poll an http endpoint, which is much easier to implement client-side than
> anything streaming (especially when it comes to handling disconnects).
>
> So: My preference would be extending the existing rcstream framework, but
> if that's not possible, my preference would be with not streaming at all.
>
> Merlijn
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Andrew Otto
Thanks for feedback so far, this is great.

> If I’m not using whatever language happens to have a library already
written, there’s no spec so I have to reverse-engineer it from an
implementation.
Brad, sorry, that is just an example on the nodejs Kasocki library.  We
will need more non language specific docs about how to interact with this
service, no matter what it might be.  In either the socket.io or
SSE/EventSource (HTTP) cases, you will need a client library, so there will
be language specific code and documentation needed.  But, the interface
will be documented in a non language specific way.  Keep on me about this.
When this thing is ‘done’, if the documentation is not what you are looking
for, yell at me and I will make it better! :)

BTW, given that halfak wrote the original proposal[1] for this project, and
that he maintains a Python abstraction for MediaWiki Events[2] based on
recent changes, I wouldn’t be surprised if he (or someone) incorporated a
Python abstraction top of EventStreams, whatever transport we end up
choosing.


Antoine, you are right that we that we don’t really have a  plan for
phasing out the older systems.  RCFeed especially, since there are so many
tools built on it.  RCStream will be similar to EventStreams / Kasocki
stuff, and we know that people at least want it to use an up to date
socket.io version, so that might be easier to phase out.  I don’t even know
who maintains RCFeed.  I’ll reach out and see if I can understand this and
make a phase out plan as a subtask of the larger Public Event Streams
project.


> First of all, why does it need to be a replacement, rather than something that
builds on existing infrastructure?
​​
We want a larger feature set than the existing infrastructure provides.
RCStream is built for only the Recent Changes events, and has no historical
addressing.  Clients should be able to reconnect and start the stream from
where they last left off, or even wherever they choose.  In a dream world,
I’d love to see this thing support timestamp based consumption for any
point in time.  That is, if you wanted to start consuming a stream of edits
starting in March 2013, you could do it.

>You could consider not implementing streaming /at all/, and just ask
clients to poll an http endpoint, which is much easier to implement
client-side than anything streaming (especially when it comes to handling
disconnects).
True, but I think this would change the way people interact with this
data.  But, maybe that is ok?  I’m not sure.  I’m not a browser developer,
so I don’t know a lot about what is easy or hard in browsers (which is why
I started this thread :) ). But, keeping the stream model intact will be
powerful.  A public resumable stream of Wikimedia events would allow folks
outside of WMF networks to build realtime stream processing tooling on top
of our data.  Folks with their own Spark or Flink or Storm clusters (in
Amazon or labs or wherever) could consume this and perform complex stream
processing (e.g. machine learning algorithms (like ORES), windowed trending
aggregations, etc.).

>Why not expose the websockets as a standard web socket server so that it
can be consumed by any language/platform that has a standard web socket
implementation?
This is a good question, and not something I had considered.  I started
with socket.io because that was what RCStream used, and it seemed to have a
lot of really nice abstractions and solved problems that I’d have to deal
with myself if I used websockets.  I had assumed that socket.io was
generally preferred to working with websockets, but maybe this is not the
case?



[1]
https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource
[2] https://github.com/mediawiki-utilities/python-mwevents


On Mon, Sep 26, 2016 at 12:28 PM, Joaquin Oltra Hernandez <
[hidden email]> wrote:

> Why not expose the websockets as a standard websocket server so that it can
> be consumed by any language/platform that has a standard websocket
> implementation?
>
> https://www.npmjs.com/package/ws
>
> Pinning to socket.io versions or other abstractions leads to what happened
> before, you can get stuck on an old version, and the protocol is specific
> to the library and platforms where that library has been implemented.
>
> By using a standard websocket server, you can provide a minimal standards
> compliant service that can be consumed across other languages/clients, and
> if there are services that need the socket.io features you can provide a
> different service that proxies the original one but puts socket.io on top
> of it.
>
> ---
>
> For the RCStream use case server sent events (SSE) are a great use case too
> (given you don't need bidirectional communication), so that would make a
> lot of sense too instead of websockets (it'd probably be easier to scale).
>
> Whatever it is I'd vote for sticking to standard implementations, either
> pure websockets or http server sent events, and let shim layers that
> provide other features like socket.io be implemented in other proxy
> servers.
>
>
>
> On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) <
> [hidden email]> wrote:
>
> > Hi Andrew,
> >
> > On 23 September 2016 at 23:15, Andrew Otto <[hidden email]> wrote:
> >
> > > We’ve been busy working on building a replacement for RCStream.  This
> new
> > > service would expose recentchanges as a stream as usual, but also other
> > > types of event streams that we can make public.
> > >
> >
> > First of all, why does it need to be a replacement, rather than something
> > that builds on existing infrastructure? Re-using the existing
> > infrastructure provides a much more convenient path for consumers to
> > upgrade.
> >
> >
> > > But we’re having a bit of an existential crisis!  We had originally
> > chosen
> > > to implement this using an up to date socket.io server, as RCStream
> also
> > > uses socket.io.  We’re mostly finished with this, but now we are
> taking
> > a
> > > step back and wondering if socket.io/websockets are the best
> technology
> > to
> > > use to expose stream data these days.
> > >
> >
> > For what it's worth, I'm on the fence about socket.io. My biggest
> argument
> > for socket.io is the fact that rcstream already uses it, but my
> experience
> > with implementing the pywikibot consumer for rcstream is that the Python
> > libraries are lacking, especially when it comes to stuff like
> reconnecting.
> > In addition, debugging issues requires knowledge of both socket.io and
> the
> > underlying websockets layer, which are both very different from regular
> > http.
> >
> > From the task description, I understand that the goal is to allow easy
> > resumation by passing information on the the last received message. You
> > could consider not implementing streaming /at all/, and just ask clients
> to
> > poll an http endpoint, which is much easier to implement client-side than
> > anything streaming (especially when it comes to handling disconnects).
> >
> > So: My preference would be extending the existing rcstream framework, but
> > if that's not possible, my preference would be with not streaming at all.
> >
> > Merlijn
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Brad Jorsch (Anomie)
In reply to this post by Merlijn van Deen
On Sun, Sep 25, 2016 at 10:02 AM, Merlijn van Deen (valhallasw) <
[hidden email]> wrote:

> You could consider not implementing streaming /at all/, and just ask
> clients to poll an http endpoint, which is much easier to implement
> client-side than anything streaming (especially when it comes to handling
> disconnects).
>

On the other hand, polling requires repeated TCP handshakes, repeated HTTP
headers sent and received, all that work done even when there aren't any
new events, non-real-time reception of events (i.e. you only get events
when you poll), and decision on what acceptable minimum values for the
polling interval are.

And chances are that clients that want to do polling are already doing it
with the action API. ;) Although I don't know what events are planned to be
made available from this new service to be able to say whether they're all
already available via the action API.


--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Gergo Tisza
In reply to this post by Andrew Otto
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]> wrote:

>  A public resumable stream of Wikimedia events would allow folks
> outside of WMF networks to build realtime stream processing tooling on top
> of our data.  Folks with their own Spark or Flink or Storm clusters (in
> Amazon or labs or wherever) could consume this and perform complex stream
> processing (e.g. machine learning algorithms (like ORES), windowed trending
> aggregations, etc.).
>

I recall WMDE trying something similar a year ago (via PubSubHubbub) and
getting vetoed by ops. If they are not aware yet, might be worth contacting
them and asking if the new streaming service would cover their use cases
(it was about Wikidata change invalidation on third-party wikis, I think).
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Daniel Kinzler-2
Hey Gergo, thanks for the heads up!

The big questions here is: how does it scale? Sending events to 100 clients may
work, but does it work for 100 thousand?

And then there's several more important details to sort out: What's the
granularity of subscription - a wiki? A page? Where does filtering by namespace
etc happen? How big is the latency? How does recovery/re-sync work after
disconnect/downtime?

I have not read the entire conversation, so the answers might already be there -
my appologies if they are, just point me there.

Anyway, if anyone has a good solution for sending wiki-events to a large number
of subscribers, yes, please let us (WMDE/Wikidata) know about it!

Am 26.09.2016 um 22:07 schrieb Gergo Tisza:

> On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]> wrote:
>
>>  A public resumable stream of Wikimedia events would allow folks
>> outside of WMF networks to build realtime stream processing tooling on top
>> of our data.  Folks with their own Spark or Flink or Storm clusters (in
>> Amazon or labs or wherever) could consume this and perform complex stream
>> processing (e.g. machine learning algorithms (like ORES), windowed trending
>> aggregations, etc.).
>>
>
> I recall WMDE trying something similar a year ago (via PubSubHubbub) and
> getting vetoed by ops. If they are not aware yet, might be worth contacting
> them and asking if the new streaming service would cover their use cases
> (it was about Wikidata change invalidation on third-party wikis, I think).
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>


--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Andrew Otto
> The big questions here is: how does it scale?
This new service is stateless and is backed by Kafka.  So, theoretically at
least, it should be horizontally scalable. (Add more Kafka brokers, add
more service workers.)


> And then there’s several more important details to sort out: What's the granularity
of subscription
​.
A topic, which is generically defined, and does not need to be tied to
anything MediaWiki specific.  If you are interested in recentchanges
events​, the granularity will be the same as RCStream.

(Well ok, technically the granularity is topic-partition.  But for streams
with low enough volume, topics will only have a single partition, so in
practice the granularity is topic.)


​> ​
Where does filtering by namespace
​ ​
etc happen?
Filtering is not yet totally hammered out.  We aren’t sure what kind of
server side filtering we want to actually support in production.  Ideally
we’d get real fancy and allow complex filtering, but there are likely
performance and security concerns here.  Even so, filtering will be
configured by the client, and at the least you will be able to do glob
filtering on any number of keys, and maybe an array of possible values.
E.g. if you wanted to filter recentchanges events for plwiki and namespace
== 0, filters might look like:
{
   “database”: “plwiki”,
   “page_namespace”: 0
}


> How big is the latency?
For MediaWiki origin streams, in normal operation, probably around a few
seconds.  This highly depends on how many Kafka clusters we have to go
through before the event gets to the one from which this service is
backed.  This isn’t productionized yet, so we aren’t totally sure which
Kafka cluster these events will be served from.


> How does recovery/re-sync work after disconnect/downtime?
Events will be given to the client with their offsets in the stream.
During connection, a client can configure the offset that it wants to start
consuming at.  This is kind of like seeking to a particular location in a
file, but instead of a byte offset, you are starting at a certain event
offset in the stream.  In the future (when Kafka supports it), we will
support timestamp based subscription as well.  E.g. ‘ subscribe to
recentchanges event starting at time T.’  This will only work as long as
event at offset N or time T still exist in Kafka.  Kafka is usually used as
a rolling buffer from which old events are removed.  We will at least keep
events for 7 days, but at this time I don’t see a technical reason we
couldn’t keep events for much longer.


> Anyway, if anyone has a good solution for sending wiki-events to a large
number of subscribers yes, please let us (WMDE/Wikidata) know about it!
The first use case is not something like this.  The upcoming production
deployment will likely not be large enough to support many thousands of
connections.  BUT!  There is no technical reason we couldn’t.  If all goes
well, and WMF can be convinced to buy enough hardware, this may be
possible! :)







On Tue, Sep 27, 2016 at 3:50 PM, Daniel Kinzler <[hidden email]
> wrote:

> Hey Gergo, thanks for the heads up!
>
> The big questions here is: how does it scale? Sending events to 100
> clients may
> work, but does it work for 100 thousand?
>
> And then there's several more important details to sort out: What's the
> granularity of subscription - a wiki? A page? Where does filtering by
> namespace
> etc happen? How big is the latency? How does recovery/re-sync work after
> disconnect/downtime?
>
> I have not read the entire conversation, so the answers might already be
> there -
> my appologies if they are, just point me there.
>
> Anyway, if anyone has a good solution for sending wiki-events to a large
> number
> of subscribers, yes, please let us (WMDE/Wikidata) know about it!
>
> Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
> > On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]> wrote:
> >
> >>  A public resumable stream of Wikimedia events would allow folks
> >> outside of WMF networks to build realtime stream processing tooling on
> top
> >> of our data.  Folks with their own Spark or Flink or Storm clusters (in
> >> Amazon or labs or wherever) could consume this and perform complex
> stream
> >> processing (e.g. machine learning algorithms (like ORES), windowed
> trending
> >> aggregations, etc.).
> >>
> >
> > I recall WMDE trying something similar a year ago (via PubSubHubbub) and
> > getting vetoed by ops. If they are not aware yet, might be worth
> contacting
> > them and asking if the new streaming service would cover their use cases
> > (it was about Wikidata change invalidation on third-party wikis, I
> think).
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Marko Obrovac
In reply to this post by Daniel Kinzler-2
Hello,

Regarding Wikidata, it is important to make the distinction here between
the WMF internal use and public-facing facilities. The underlying
sub-system that the public event streams will be relying on is called
EventBus~[1], which is (currently) comprised of:

(i) The producer HTTP proxy service. It allows (internal) users to produce
events using a REST HTTP interface. It also validates events against the
currently-supported set of JSON event schemas~[2].
(ii) The Kafka cluster, which is in charge of queuing the produced events
and delivering them to consumer clients. The event streams are separated
into topics, e.g. a revision-create topic, a page-move topic, etc.
(iii) The Change Propagation service~[3]. It is the main Kafka consumer at
this point. In its most basic form, it executes HTTP requests triggered by
user-defined rules for certain topics. The aim of the service is to able to
update dependant entities starting from a resource/event. One example is
recreating the needed data for a page when it is edited. When a user edits
a page, ChangeProp receives an event in the revision-create topic and sends
a no-cache request to RESTBase to render it. After RB has completed the
request, another request is sent to the mobile content service to do the
same, because the output of the mobile content service for a given page
relies on the latest RB/Parsoid HTML.

Currently, the biggest producer of events is MediaWiki itself. The aim of
this e-mail thread is to add a forth component to the system - the public
event stream consumption. However, for the Wikidata case, we think the
Change Propagation service should be used (i.e. we need to keep it
internal). If you recall, Daniel, we did kind of start talking about
putting WD updates onto EventBus in Esino Lario.

In-lined the responses to your questions.

On 27 September 2016 at 14:50, Daniel Kinzler <[hidden email]>
wrote:

> Hey Gergo, thanks for the heads up!
>
> The big questions here is: how does it scale? Sending events to 100
> clients may
> work, but does it work for 100 thousand?
>

Yes, it does. Albeit, not instantly. We limit the concurrency of execution
to mitigate huge spikes and overloading the system. For example, Change
Propagation handles template transclusions: when a template is edited, all
of the pages it is transcluded in need to re-rendered, i.e. their HTMLs
have to be recreated. For important templates, that might mean re-rendering
millions of pages. The queue is populated with the relevant pages and the
backlog is "slowly" processed. "Slowly" here refers to the fact that at
most X pages are re-rendered at the same time, where X is governed by the
concurrency factor. In the concrete example of important templates, it
usually takes a couple of days to go through the backlog of re-renders.


>
> And then there's several more important details to sort out: What's the
> granularity of subscription - a wiki? A page? Where does filtering by
> namespace
> etc happen?


As Andrew noted, the basic granularity is the topic, i.e. the type/schema
of the events that are to be received. Roughly, that means that a consumer
can obtain either all page edits, or page renames (for all WMF wikis)
without performing any kind of filtering. Change Propagation, however,
allows one to filter events out based on any of the fields contained in the
events themselves, which means you are able to receive only events for a
specific wiki, a specific page or namespace. For example, Change
Propagation already handles situations where a Wikidata item is edited: it
re-renders the page summaries for all pages that the given item is
transcluded in, but does so only for the www.wikidata.org domain and
namespace 0~[4].


> How big is the latency?


For MediaWiki events, the observed latency of acting on an event has been
at most a couple of hundred milliseconds on average, but it is usually
below that threshold. There are some events, though, which lag behind up to
a couple of days, most notably big template updates / transclusions. This
graph~[5] plots Change Propagation's delay in processing the events for
each defined rule. The "backlog per rule" metric measures the delay between
event production and event consumption. Here, event production refers to
the time stamp MediaWiki observed the event, while event consumption refers
to the time that Change Propagation dequeues it from Kafka and starts
executing it.


> How does recovery/re-sync work after
> disconnect/downtime?
>

Because relying on EventBus and, specifically, Change Propagation, means
consuming events via push HTTP requests, the receiving entity does not have
to worry about this in this context (public event streams are different
matter, though). EventBus handles offsets internally, so even if Change
Propagation stops working for some time or cannot connect to Kafka, it will
resume processing events form where it left off once the pipeline is
accessible again. If, on the other hand, the service receiving the HTTP
requests is down or unreachable, Change Propagation has a built-in retry
mechanism that is triggered to resend requests whenever an erroneous
response is received from the service.

I hope this helps. Would be happy to talk more about this specific topic
some more.

Cheers,
Marko


>
> I have not read the entire conversation, so the answers might already be
> there -
> my appologies if they are, just point me there.
>
> Anyway, if anyone has a good solution for sending wiki-events to a large
> number
> of subscribers, yes, please let us (WMDE/Wikidata) know about it!
>
> Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
> > On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]> wrote:
> >
> >>  A public resumable stream of Wikimedia events would allow folks
> >> outside of WMF networks to build realtime stream processing tooling on
> top
> >> of our data.  Folks with their own Spark or Flink or Storm clusters (in
> >> Amazon or labs or wherever) could consume this and perform complex
> stream
> >> processing (e.g. machine learning algorithms (like ORES), windowed
> trending
> >> aggregations, etc.).
> >>
> >
> > I recall WMDE trying something similar a year ago (via PubSubHubbub) and
> > getting vetoed by ops. If they are not aware yet, might be worth
> contacting
> > them and asking if the new streaming service would cover their use cases
> > (it was about Wikidata change invalidation on third-party wikis, I
> think).
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



[1] https://www.mediawiki.org/wiki/EventBus
[2]
https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema
[3] https://www.mediawiki.org/wiki/Change_propagation
[4]
https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/blob/ea8cdf85e700b74918a3e59ac6058a1a952b3e60/scap/templates/config.yaml.j2#L556
[5]
https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscreen

--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Andrew Otto
Thanks for the feedback everyone!

Due to the simplicity of the HTTP stream model, we are moving forward with
that, instead of websockets/socket.io.  We hope to have an initial version
of this serving existent EventBus events this quarter.  Next we will focus
on more features (filtering), and also work towards deprecating both
RCStream and RCFeed.

You can follow progress of this effort on Fabricator:
https://phabricator.wikimedia.org/T130651



On Thu, Sep 29, 2016 at 10:29 AM, Marko Obrovac <[hidden email]>
wrote:

> Hello,
>
> Regarding Wikidata, it is important to make the distinction here between
> the WMF internal use and public-facing facilities. The underlying
> sub-system that the public event streams will be relying on is called
> EventBus~[1], which is (currently) comprised of:
>
> (i) The producer HTTP proxy service. It allows (internal) users to produce
> events using a REST HTTP interface. It also validates events against the
> currently-supported set of JSON event schemas~[2].
> (ii) The Kafka cluster, which is in charge of queuing the produced events
> and delivering them to consumer clients. The event streams are separated
> into topics, e.g. a revision-create topic, a page-move topic, etc.
> (iii) The Change Propagation service~[3]. It is the main Kafka consumer at
> this point. In its most basic form, it executes HTTP requests triggered by
> user-defined rules for certain topics. The aim of the service is to able to
> update dependant entities starting from a resource/event. One example is
> recreating the needed data for a page when it is edited. When a user edits
> a page, ChangeProp receives an event in the revision-create topic and sends
> a no-cache request to RESTBase to render it. After RB has completed the
> request, another request is sent to the mobile content service to do the
> same, because the output of the mobile content service for a given page
> relies on the latest RB/Parsoid HTML.
>
> Currently, the biggest producer of events is MediaWiki itself. The aim of
> this e-mail thread is to add a forth component to the system - the public
> event stream consumption. However, for the Wikidata case, we think the
> Change Propagation service should be used (i.e. we need to keep it
> internal). If you recall, Daniel, we did kind of start talking about
> putting WD updates onto EventBus in Esino Lario.
>
> In-lined the responses to your questions.
>
> On 27 September 2016 at 14:50, Daniel Kinzler <[hidden email]
> >
> wrote:
>
> > Hey Gergo, thanks for the heads up!
> >
> > The big questions here is: how does it scale? Sending events to 100
> > clients may
> > work, but does it work for 100 thousand?
> >
>
> Yes, it does. Albeit, not instantly. We limit the concurrency of execution
> to mitigate huge spikes and overloading the system. For example, Change
> Propagation handles template transclusions: when a template is edited, all
> of the pages it is transcluded in need to re-rendered, i.e. their HTMLs
> have to be recreated. For important templates, that might mean re-rendering
> millions of pages. The queue is populated with the relevant pages and the
> backlog is "slowly" processed. "Slowly" here refers to the fact that at
> most X pages are re-rendered at the same time, where X is governed by the
> concurrency factor. In the concrete example of important templates, it
> usually takes a couple of days to go through the backlog of re-renders.
>
>
> >
> > And then there's several more important details to sort out: What's the
> > granularity of subscription - a wiki? A page? Where does filtering by
> > namespace
> > etc happen?
>
>
> As Andrew noted, the basic granularity is the topic, i.e. the type/schema
> of the events that are to be received. Roughly, that means that a consumer
> can obtain either all page edits, or page renames (for all WMF wikis)
> without performing any kind of filtering. Change Propagation, however,
> allows one to filter events out based on any of the fields contained in the
> events themselves, which means you are able to receive only events for a
> specific wiki, a specific page or namespace. For example, Change
> Propagation already handles situations where a Wikidata item is edited: it
> re-renders the page summaries for all pages that the given item is
> transcluded in, but does so only for the www.wikidata.org domain and
> namespace 0~[4].
>
>
> > How big is the latency?
>
>
> For MediaWiki events, the observed latency of acting on an event has been
> at most a couple of hundred milliseconds on average, but it is usually
> below that threshold. There are some events, though, which lag behind up to
> a couple of days, most notably big template updates / transclusions. This
> graph~[5] plots Change Propagation's delay in processing the events for
> each defined rule. The "backlog per rule" metric measures the delay between
> event production and event consumption. Here, event production refers to
> the time stamp MediaWiki observed the event, while event consumption refers
> to the time that Change Propagation dequeues it from Kafka and starts
> executing it.
>
>
> > How does recovery/re-sync work after
> > disconnect/downtime?
> >
>
> Because relying on EventBus and, specifically, Change Propagation, means
> consuming events via push HTTP requests, the receiving entity does not have
> to worry about this in this context (public event streams are different
> matter, though). EventBus handles offsets internally, so even if Change
> Propagation stops working for some time or cannot connect to Kafka, it will
> resume processing events form where it left off once the pipeline is
> accessible again. If, on the other hand, the service receiving the HTTP
> requests is down or unreachable, Change Propagation has a built-in retry
> mechanism that is triggered to resend requests whenever an erroneous
> response is received from the service.
>
> I hope this helps. Would be happy to talk more about this specific topic
> some more.
>
> Cheers,
> Marko
>
>
> >
> > I have not read the entire conversation, so the answers might already be
> > there -
> > my appologies if they are, just point me there.
> >
> > Anyway, if anyone has a good solution for sending wiki-events to a large
> > number
> > of subscribers, yes, please let us (WMDE/Wikidata) know about it!
> >
> > Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
> > > On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]>
> wrote:
> > >
> > >>  A public resumable stream of Wikimedia events would allow folks
> > >> outside of WMF networks to build realtime stream processing tooling on
> > top
> > >> of our data.  Folks with their own Spark or Flink or Storm clusters
> (in
> > >> Amazon or labs or wherever) could consume this and perform complex
> > stream
> > >> processing (e.g. machine learning algorithms (like ORES), windowed
> > trending
> > >> aggregations, etc.).
> > >>
> > >
> > > I recall WMDE trying something similar a year ago (via PubSubHubbub)
> and
> > > getting vetoed by ops. If they are not aware yet, might be worth
> > contacting
> > > them and asking if the new streaming service would cover their use
> cases
> > > (it was about Wikidata change invalidation on third-party wikis, I
> > think).
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
> >
> > --
> > Daniel Kinzler
> > Senior Software Developer
> >
> > Wikimedia Deutschland
> > Gesellschaft zur Förderung Freien Wissens e.V.
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> [1] https://www.mediawiki.org/wiki/EventBus
> [2]
> https://github.com/wikimedia/mediawiki-event-schemas/tree/
> master/jsonschema
> [3] https://www.mediawiki.org/wiki/Change_propagation
> [4]
> https://github.com/wikimedia/mediawiki-services-change-
> propagation-deploy/blob/ea8cdf85e700b74918a3e59ac6058a
> 1a952b3e60/scap/templates/config.yaml.j2#L556
> [5]
> https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscreen
>
> --
> Marko Obrovac, PhD
> Senior Services Engineer
> Wikimedia Foundation
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Public Event Streams (AKA RCStream replacement) question

Andrew Otto
s/Fabricator/Phabricator/ (gmail auto correct GRR)

On Thu, Oct 20, 2016 at 4:00 PM, Andrew Otto <[hidden email]> wrote:

> Thanks for the feedback everyone!
>
> Due to the simplicity of the HTTP stream model, we are moving forward with
> that, instead of websockets/socket.io.  We hope to have an initial
> version of this serving existent EventBus events this quarter.  Next we
> will focus on more features (filtering), and also work towards deprecating
> both RCStream and RCFeed.
>
> You can follow progress of this effort on Fabricator: https://
> phabricator.wikimedia.org/T130651
>
>
>
> On Thu, Sep 29, 2016 at 10:29 AM, Marko Obrovac <[hidden email]>
> wrote:
>
>> Hello,
>>
>> Regarding Wikidata, it is important to make the distinction here between
>> the WMF internal use and public-facing facilities. The underlying
>> sub-system that the public event streams will be relying on is called
>> EventBus~[1], which is (currently) comprised of:
>>
>> (i) The producer HTTP proxy service. It allows (internal) users to produce
>> events using a REST HTTP interface. It also validates events against the
>> currently-supported set of JSON event schemas~[2].
>> (ii) The Kafka cluster, which is in charge of queuing the produced events
>> and delivering them to consumer clients. The event streams are separated
>> into topics, e.g. a revision-create topic, a page-move topic, etc.
>> (iii) The Change Propagation service~[3]. It is the main Kafka consumer at
>> this point. In its most basic form, it executes HTTP requests triggered by
>> user-defined rules for certain topics. The aim of the service is to able
>> to
>> update dependant entities starting from a resource/event. One example is
>> recreating the needed data for a page when it is edited. When a user edits
>> a page, ChangeProp receives an event in the revision-create topic and
>> sends
>> a no-cache request to RESTBase to render it. After RB has completed the
>> request, another request is sent to the mobile content service to do the
>> same, because the output of the mobile content service for a given page
>> relies on the latest RB/Parsoid HTML.
>>
>> Currently, the biggest producer of events is MediaWiki itself. The aim of
>> this e-mail thread is to add a forth component to the system - the public
>> event stream consumption. However, for the Wikidata case, we think the
>> Change Propagation service should be used (i.e. we need to keep it
>> internal). If you recall, Daniel, we did kind of start talking about
>> putting WD updates onto EventBus in Esino Lario.
>>
>> In-lined the responses to your questions.
>>
>> On 27 September 2016 at 14:50, Daniel Kinzler <
>> [hidden email]>
>> wrote:
>>
>> > Hey Gergo, thanks for the heads up!
>> >
>> > The big questions here is: how does it scale? Sending events to 100
>> > clients may
>> > work, but does it work for 100 thousand?
>> >
>>
>> Yes, it does. Albeit, not instantly. We limit the concurrency of execution
>> to mitigate huge spikes and overloading the system. For example, Change
>> Propagation handles template transclusions: when a template is edited, all
>> of the pages it is transcluded in need to re-rendered, i.e. their HTMLs
>> have to be recreated. For important templates, that might mean
>> re-rendering
>> millions of pages. The queue is populated with the relevant pages and the
>> backlog is "slowly" processed. "Slowly" here refers to the fact that at
>> most X pages are re-rendered at the same time, where X is governed by the
>> concurrency factor. In the concrete example of important templates, it
>> usually takes a couple of days to go through the backlog of re-renders.
>>
>>
>> >
>> > And then there's several more important details to sort out: What's the
>> > granularity of subscription - a wiki? A page? Where does filtering by
>> > namespace
>> > etc happen?
>>
>>
>> As Andrew noted, the basic granularity is the topic, i.e. the type/schema
>> of the events that are to be received. Roughly, that means that a consumer
>> can obtain either all page edits, or page renames (for all WMF wikis)
>> without performing any kind of filtering. Change Propagation, however,
>> allows one to filter events out based on any of the fields contained in
>> the
>> events themselves, which means you are able to receive only events for a
>> specific wiki, a specific page or namespace. For example, Change
>> Propagation already handles situations where a Wikidata item is edited: it
>> re-renders the page summaries for all pages that the given item is
>> transcluded in, but does so only for the www.wikidata.org domain and
>> namespace 0~[4].
>>
>>
>> > How big is the latency?
>>
>>
>> For MediaWiki events, the observed latency of acting on an event has been
>> at most a couple of hundred milliseconds on average, but it is usually
>> below that threshold. There are some events, though, which lag behind up
>> to
>> a couple of days, most notably big template updates / transclusions. This
>> graph~[5] plots Change Propagation's delay in processing the events for
>> each defined rule. The "backlog per rule" metric measures the delay
>> between
>> event production and event consumption. Here, event production refers to
>> the time stamp MediaWiki observed the event, while event consumption
>> refers
>> to the time that Change Propagation dequeues it from Kafka and starts
>> executing it.
>>
>>
>> > How does recovery/re-sync work after
>> > disconnect/downtime?
>> >
>>
>> Because relying on EventBus and, specifically, Change Propagation, means
>> consuming events via push HTTP requests, the receiving entity does not
>> have
>> to worry about this in this context (public event streams are different
>> matter, though). EventBus handles offsets internally, so even if Change
>> Propagation stops working for some time or cannot connect to Kafka, it
>> will
>> resume processing events form where it left off once the pipeline is
>> accessible again. If, on the other hand, the service receiving the HTTP
>> requests is down or unreachable, Change Propagation has a built-in retry
>> mechanism that is triggered to resend requests whenever an erroneous
>> response is received from the service.
>>
>> I hope this helps. Would be happy to talk more about this specific topic
>> some more.
>>
>> Cheers,
>> Marko
>>
>>
>> >
>> > I have not read the entire conversation, so the answers might already be
>> > there -
>> > my appologies if they are, just point me there.
>> >
>> > Anyway, if anyone has a good solution for sending wiki-events to a large
>> > number
>> > of subscribers, yes, please let us (WMDE/Wikidata) know about it!
>> >
>> > Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
>> > > On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <[hidden email]>
>> wrote:
>> > >
>> > >>  A public resumable stream of Wikimedia events would allow folks
>> > >> outside of WMF networks to build realtime stream processing tooling
>> on
>> > top
>> > >> of our data.  Folks with their own Spark or Flink or Storm clusters
>> (in
>> > >> Amazon or labs or wherever) could consume this and perform complex
>> > stream
>> > >> processing (e.g. machine learning algorithms (like ORES), windowed
>> > trending
>> > >> aggregations, etc.).
>> > >>
>> > >
>> > > I recall WMDE trying something similar a year ago (via PubSubHubbub)
>> and
>> > > getting vetoed by ops. If they are not aware yet, might be worth
>> > contacting
>> > > them and asking if the new streaming service would cover their use
>> cases
>> > > (it was about Wikidata change invalidation on third-party wikis, I
>> > think).
>> > > _______________________________________________
>> > > Wikitech-l mailing list
>> > > [hidden email]
>> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> > >
>> >
>> >
>> > --
>> > Daniel Kinzler
>> > Senior Software Developer
>> >
>> > Wikimedia Deutschland
>> > Gesellschaft zur Förderung Freien Wissens e.V.
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >
>>
>>
>>
>> [1] https://www.mediawiki.org/wiki/EventBus
>> [2]
>> https://github.com/wikimedia/mediawiki-event-schemas/tree/ma
>> ster/jsonschema
>> [3] https://www.mediawiki.org/wiki/Change_propagation
>> [4]
>> https://github.com/wikimedia/mediawiki-services-change-propa
>> gation-deploy/blob/ea8cdf85e700b74918a3e59ac6058a1a952b3e60/
>> scap/templates/config.yaml.j2#L556
>> [5]
>> https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscreen
>>
>> --
>> Marko Obrovac, PhD
>> Senior Services Engineer
>> Wikimedia Foundation
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l