Peer-to-peer sharing of the content of Wikipedia through WebRTC

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Peer-to-peer sharing of the content of Wikipedia through WebRTC

Yeongjin Jang

Hi,

I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.

In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
we are working on a project called B2BWiki,
which enables users to share the contents of Wikipedia through WebRTC
(peer-to-peer sharing).

Website is at here: http://b2bwiki.cc.gatech.edu/

The project aims to help Wikipedia by donating computing resources
from the community; users can donate their traffic (by P2P communication)
and storage (indexedDB) to reduce the load of Wikipedia servers.
For larger organizations, e.g. schools or companies that
have many local users, they can donate a mirror server
similar to GNU FTP servers, which can bootstrap peer sharing.


Potential benefits that we think of are following.
1) Users can easily donate their resources to the community.
Just visit the website.

2) Users can get performance benefit if a page is loaded from
multiple local peers / local mirror (page load time got faster!).

3) Wikipedia can reduce its server workload, network traffic, etc.

4) Local network operators can reduce network traffic transit
(e.g. cost that is caused by delivering the traffic to the outside).


While we are working on enhancing the implementation,
we would like to ask the opinions from actual developers of Wikipedia.
For example, we want to know whether our direction is correct or not
(will it actually reduce the load?), or if there are some other concerns
that we missed, that can potentially prevent this system from
working as intended. We really want to do somewhat meaningful work
that actually helps run Wikipedia!

Please feel free to give as any suggestions, comments, etc.
If you want to express your opinion privately,
please contact [hidden email].

Thanks,

--- Appendix ---

I added some detailed information about B2BWiki in the following.

# Accessing data
When accessing a page on B2BWiki, the browser will query peers first.
1) If there exist peers that hold the contents, peer to peer download happens.
2) otherwise, if there is no peer, client will download the content
from the mirror server.
3) If mirror server does not have the content, it downloads from
Wikipedia server (1 access per first download, and update).


# Peer lookup
To enable content lookup for peers,
we manage a lookup server that holds a page_name-to-peer map.
A client (a user's browser) can query the list of peers that
currently hold the content, and select the peer by its freshness
(has hash/timestamp of the content,
has top 2 octet of IP address
(figuring out whether it is local peer or not), etc.


# Update, and integrity check
Mirror server updates its content per each day
(can be configured to update per each hour, etc).
Update check is done by using If-Modified-Since header from Wikipedia server.
On retrieving the content from Wikipedia, the mirror server stamps a timestamp
and sha1 checksum, to ensure the freshness of data and its integrity.
When clients lookup and download the content from the peers,
client will compare the sha1 checksum of data
with the checksum from lookup server.

In this settings, users can get older data
(they can configure how to tolerate the freshness of data,
e.g. 1day older, 3day, 1 week older, etc.), and
the integrity is guaranteed by mirror/lookup server.


More detailed information can be obtained from the following website.

http://goo.gl/pSNrjR
(URL redirects to SSLab@gatech website)

Please feel free to give as any suggestions, comments, etc.

Thanks,
--
Yeongjin Jang


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Pine W
Thanks for this initiative.

I think that concerns at the moment would be in the domains of privacy,
security, lack of WMF analytics intstrumentation, and WMF fundraising
limitations.

That said, looking in the longer term, a number of us in the community are
interested in decreasing our dependencies on the Wikimedia Foundation as
insurance against possible catastrophes and as a backup plan in case of
another significant WMF dispute with the community. It might be worth
exploring the options for setting up Wikipedia on infrastructure outside of
WMF. I would be interested in talking with you to discuss this further;
please let me know if you have time for a Hangout session in early to mid
December.

Thank you for your interest!
Pine
On Nov 27, 2015 10:50 PM, "Yeongjin Jang" <[hidden email]>
wrote:

>
> Hi,
>
> I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
>
> In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> we are working on a project called B2BWiki,
> which enables users to share the contents of Wikipedia through WebRTC
> (peer-to-peer sharing).
>
> Website is at here: http://b2bwiki.cc.gatech.edu/
>
> The project aims to help Wikipedia by donating computing resources
> from the community; users can donate their traffic (by P2P communication)
> and storage (indexedDB) to reduce the load of Wikipedia servers.
> For larger organizations, e.g. schools or companies that
> have many local users, they can donate a mirror server
> similar to GNU FTP servers, which can bootstrap peer sharing.
>
>
> Potential benefits that we think of are following.
> 1) Users can easily donate their resources to the community.
> Just visit the website.
>
> 2) Users can get performance benefit if a page is loaded from
> multiple local peers / local mirror (page load time got faster!).
>
> 3) Wikipedia can reduce its server workload, network traffic, etc.
>
> 4) Local network operators can reduce network traffic transit
> (e.g. cost that is caused by delivering the traffic to the outside).
>
>
> While we are working on enhancing the implementation,
> we would like to ask the opinions from actual developers of Wikipedia.
> For example, we want to know whether our direction is correct or not
> (will it actually reduce the load?), or if there are some other concerns
> that we missed, that can potentially prevent this system from
> working as intended. We really want to do somewhat meaningful work
> that actually helps run Wikipedia!
>
> Please feel free to give as any suggestions, comments, etc.
> If you want to express your opinion privately,
> please contact [hidden email].
>
> Thanks,
>
> --- Appendix ---
>
> I added some detailed information about B2BWiki in the following.
>
> # Accessing data
> When accessing a page on B2BWiki, the browser will query peers first.
> 1) If there exist peers that hold the contents, peer to peer download
> happens.
> 2) otherwise, if there is no peer, client will download the content
> from the mirror server.
> 3) If mirror server does not have the content, it downloads from
> Wikipedia server (1 access per first download, and update).
>
>
> # Peer lookup
> To enable content lookup for peers,
> we manage a lookup server that holds a page_name-to-peer map.
> A client (a user's browser) can query the list of peers that
> currently hold the content, and select the peer by its freshness
> (has hash/timestamp of the content,
> has top 2 octet of IP address
> (figuring out whether it is local peer or not), etc.
>
>
> # Update, and integrity check
> Mirror server updates its content per each day
> (can be configured to update per each hour, etc).
> Update check is done by using If-Modified-Since header from Wikipedia
> server.
> On retrieving the content from Wikipedia, the mirror server stamps a
> timestamp
> and sha1 checksum, to ensure the freshness of data and its integrity.
> When clients lookup and download the content from the peers,
> client will compare the sha1 checksum of data
> with the checksum from lookup server.
>
> In this settings, users can get older data
> (they can configure how to tolerate the freshness of data,
> e.g. 1day older, 3day, 1 week older, etc.), and
> the integrity is guaranteed by mirror/lookup server.
>
>
> More detailed information can be obtained from the following website.
>
> http://goo.gl/pSNrjR
> (URL redirects to SSLab@gatech website)
>
> Please feel free to give as any suggestions, comments, etc.
>
> Thanks,
> --
> Yeongjin Jang
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Brian Wolff
In reply to this post by Yeongjin Jang
On 11/28/15, Yeongjin Jang <[hidden email]> wrote:

>
> Hi,
>
> I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
>
> In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> we are working on a project called B2BWiki,
> which enables users to share the contents of Wikipedia through WebRTC
> (peer-to-peer sharing).
>
> Website is at here: http://b2bwiki.cc.gatech.edu/
>
> The project aims to help Wikipedia by donating computing resources
> from the community; users can donate their traffic (by P2P communication)
> and storage (indexedDB) to reduce the load of Wikipedia servers.
> For larger organizations, e.g. schools or companies that
> have many local users, they can donate a mirror server
> similar to GNU FTP servers, which can bootstrap peer sharing.
>
>
> Potential benefits that we think of are following.
> 1) Users can easily donate their resources to the community.
> Just visit the website.
>
> 2) Users can get performance benefit if a page is loaded from
> multiple local peers / local mirror (page load time got faster!).
>
> 3) Wikipedia can reduce its server workload, network traffic, etc.
>
> 4) Local network operators can reduce network traffic transit
> (e.g. cost that is caused by delivering the traffic to the outside).
>
>
> While we are working on enhancing the implementation,
> we would like to ask the opinions from actual developers of Wikipedia.
> For example, we want to know whether our direction is correct or not
> (will it actually reduce the load?), or if there are some other concerns
> that we missed, that can potentially prevent this system from
> working as intended. We really want to do somewhat meaningful work
> that actually helps run Wikipedia!
>
> Please feel free to give as any suggestions, comments, etc.
> If you want to express your opinion privately,
> please contact [hidden email].
>
> Thanks,
>
> --- Appendix ---
>
> I added some detailed information about B2BWiki in the following.
>
> # Accessing data
> When accessing a page on B2BWiki, the browser will query peers first.
> 1) If there exist peers that hold the contents, peer to peer download
> happens.
> 2) otherwise, if there is no peer, client will download the content
> from the mirror server.
> 3) If mirror server does not have the content, it downloads from
> Wikipedia server (1 access per first download, and update).
>
>
> # Peer lookup
> To enable content lookup for peers,
> we manage a lookup server that holds a page_name-to-peer map.
> A client (a user's browser) can query the list of peers that
> currently hold the content, and select the peer by its freshness
> (has hash/timestamp of the content,
> has top 2 octet of IP address
> (figuring out whether it is local peer or not), etc.
>
>
> # Update, and integrity check
> Mirror server updates its content per each day
> (can be configured to update per each hour, etc).
> Update check is done by using If-Modified-Since header from Wikipedia
> server.
> On retrieving the content from Wikipedia, the mirror server stamps a
> timestamp
> and sha1 checksum, to ensure the freshness of data and its integrity.
> When clients lookup and download the content from the peers,
> client will compare the sha1 checksum of data
> with the checksum from lookup server.
>
> In this settings, users can get older data
> (they can configure how to tolerate the freshness of data,
> e.g. 1day older, 3day, 1 week older, etc.), and
> the integrity is guaranteed by mirror/lookup server.
>
>
> More detailed information can be obtained from the following website.
>
> http://goo.gl/pSNrjR
> (URL redirects to SSLab@gatech website)
>
> Please feel free to give as any suggestions, comments, etc.
>
> Thanks,
> --
> Yeongjin Jang
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi,

This is some interesting stuff, and I think research along these lines
(That is, leveraging webrtc to deliver content in a P2P manner over
web browsers) will really change the face of the internet in the years
to come.

As for Wikipedia specifically (This is all just my personal opinion.
Others may disagree with me):

*Wikipedia is a a fairly stable/mature site. I think we're past the
point where its a good idea to experiment with experimental
technologies (Although mirrors of Wikipedia are a good place to do
that). We need stability and proven technology.

*Bandwidth makes up a very small portion of WMF's expenses (I'm not
sure how small. Someone once told me that donation processing costs
takes up more money then raw bandwidth costs. Don't know if that's
true, but bandwidth is certainly not the biggest expense).

Your scheme primarily serves to offload bandwidth of cached content to
other people. But serving cached content (by which I mean, anonymous
users getting results from varnish) is probably the cheapest (in terms
of computational resources) part of our setup. The hard part is things
like parsing wikitext->html, and otherwise generating pages.

*24 hours to page update is generally considered much too slow
(Tolerance for anons is probably a bit higher than logged in users,
but still).  People expect their changes to appear for everyone,
immediately. We want the delay to be in seconds, not days. I think its
unlikely that any sort of expire scheme would be acceptable. We need
active cache invalidation upon edit.

*Lack of analysis of scalability (I just briefly skimmed the google
cache version of the page you linked [your webserver had the
connection keep being reset], so its possible I missed this). I didn't
see any analysis of how your system scales with load. Perhaps that's
because you're still in the development stage, and the design isn't
finalized(?) Anyways, scalability analysis is important when we're
talking about wikipedia. Does this design still work if you have
100,000 (or even more) peers?

*Privacy concerns - Would a malicious person be able to force
themselves to be someone's preferred peer, and spy on everything they
read, etc.

*DOS concerns - Would a malicious peer or peers be able to prevent an
honest user from connecting to the website? (I didn't look in detail
at how you select peers and handle peer failure, so I'm not sure if
this applies)

--

Anyways, I think this sort of think is interesting, but I think your
system would be more suited to people running a small static website,
that want to scale to very high numbers, rather than Wikipedia's use
case.

--
-bawolff

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Yeongjin Jang
In reply to this post by Pine W
Thank you for your attention! We would love to talk with you.

Regarding Hangout meeting, we are available M-F in the next week
and further weeks. Please consider that we are living in
Eastern US Time so between 10am - 8pm EST will be
mostly available.

On Sat, Nov 28, 2015 at 3:50 AM, Pine W <[hidden email]> wrote:

> Thanks for this initiative.
>
> I think that concerns at the moment would be in the domains of privacy,
> security, lack of WMF analytics intstrumentation, and WMF fundraising
> limitations.
>
> That said, looking in the longer term, a number of us in the community are
> interested in decreasing our dependencies on the Wikimedia Foundation as
> insurance against possible catastrophes and as a backup plan in case of
> another significant WMF dispute with the community. It might be worth
> exploring the options for setting up Wikipedia on infrastructure outside of
> WMF. I would be interested in talking with you to discuss this further;
> please let me know if you have time for a Hangout session in early to mid
> December.
>
> Thank you for your interest!
> Pine
> On Nov 27, 2015 10:50 PM, "Yeongjin Jang" <[hidden email]>
> wrote:
>
> >
> > Hi,
> >
> > I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
> >
> > In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> > we are working on a project called B2BWiki,
> > which enables users to share the contents of Wikipedia through WebRTC
> > (peer-to-peer sharing).
> >
> > Website is at here: http://b2bwiki.cc.gatech.edu/
> >
> > The project aims to help Wikipedia by donating computing resources
> > from the community; users can donate their traffic (by P2P communication)
> > and storage (indexedDB) to reduce the load of Wikipedia servers.
> > For larger organizations, e.g. schools or companies that
> > have many local users, they can donate a mirror server
> > similar to GNU FTP servers, which can bootstrap peer sharing.
> >
> >
> > Potential benefits that we think of are following.
> > 1) Users can easily donate their resources to the community.
> > Just visit the website.
> >
> > 2) Users can get performance benefit if a page is loaded from
> > multiple local peers / local mirror (page load time got faster!).
> >
> > 3) Wikipedia can reduce its server workload, network traffic, etc.
> >
> > 4) Local network operators can reduce network traffic transit
> > (e.g. cost that is caused by delivering the traffic to the outside).
> >
> >
> > While we are working on enhancing the implementation,
> > we would like to ask the opinions from actual developers of Wikipedia.
> > For example, we want to know whether our direction is correct or not
> > (will it actually reduce the load?), or if there are some other concerns
> > that we missed, that can potentially prevent this system from
> > working as intended. We really want to do somewhat meaningful work
> > that actually helps run Wikipedia!
> >
> > Please feel free to give as any suggestions, comments, etc.
> > If you want to express your opinion privately,
> > please contact [hidden email].
> >
> > Thanks,
> >
> > --- Appendix ---
> >
> > I added some detailed information about B2BWiki in the following.
> >
> > # Accessing data
> > When accessing a page on B2BWiki, the browser will query peers first.
> > 1) If there exist peers that hold the contents, peer to peer download
> > happens.
> > 2) otherwise, if there is no peer, client will download the content
> > from the mirror server.
> > 3) If mirror server does not have the content, it downloads from
> > Wikipedia server (1 access per first download, and update).
> >
> >
> > # Peer lookup
> > To enable content lookup for peers,
> > we manage a lookup server that holds a page_name-to-peer map.
> > A client (a user's browser) can query the list of peers that
> > currently hold the content, and select the peer by its freshness
> > (has hash/timestamp of the content,
> > has top 2 octet of IP address
> > (figuring out whether it is local peer or not), etc.
> >
> >
> > # Update, and integrity check
> > Mirror server updates its content per each day
> > (can be configured to update per each hour, etc).
> > Update check is done by using If-Modified-Since header from Wikipedia
> > server.
> > On retrieving the content from Wikipedia, the mirror server stamps a
> > timestamp
> > and sha1 checksum, to ensure the freshness of data and its integrity.
> > When clients lookup and download the content from the peers,
> > client will compare the sha1 checksum of data
> > with the checksum from lookup server.
> >
> > In this settings, users can get older data
> > (they can configure how to tolerate the freshness of data,
> > e.g. 1day older, 3day, 1 week older, etc.), and
> > the integrity is guaranteed by mirror/lookup server.
> >
> >
> > More detailed information can be obtained from the following website.
> >
> > http://goo.gl/pSNrjR
> > (URL redirects to SSLab@gatech website)
> >
> > Please feel free to give as any suggestions, comments, etc.
> >
> > Thanks,
> > --
> > Yeongjin Jang
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Yeongjin Jang
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Yeongjin Jang
In reply to this post by Brian Wolff
Thank you for your comments!


On Sat, Nov 28, 2015 at 2:33 PM, Brian Wolff <[hidden email]> wrote:

> On 11/28/15, Yeongjin Jang <[hidden email]> wrote:
> >
> > Hi,
> >
> > I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
> >
> > In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> > we are working on a project called B2BWiki,
> > which enables users to share the contents of Wikipedia through WebRTC
> > (peer-to-peer sharing).
> >
> > Website is at here: http://b2bwiki.cc.gatech.edu/
> >
> > The project aims to help Wikipedia by donating computing resources
> > from the community; users can donate their traffic (by P2P communication)
> > and storage (indexedDB) to reduce the load of Wikipedia servers.
> > For larger organizations, e.g. schools or companies that
> > have many local users, they can donate a mirror server
> > similar to GNU FTP servers, which can bootstrap peer sharing.
> >
> >
> > Potential benefits that we think of are following.
> > 1) Users can easily donate their resources to the community.
> > Just visit the website.
> >
> > 2) Users can get performance benefit if a page is loaded from
> > multiple local peers / local mirror (page load time got faster!).
> >
> > 3) Wikipedia can reduce its server workload, network traffic, etc.
> >
> > 4) Local network operators can reduce network traffic transit
> > (e.g. cost that is caused by delivering the traffic to the outside).
> >
> >
> > While we are working on enhancing the implementation,
> > we would like to ask the opinions from actual developers of Wikipedia.
> > For example, we want to know whether our direction is correct or not
> > (will it actually reduce the load?), or if there are some other concerns
> > that we missed, that can potentially prevent this system from
> > working as intended. We really want to do somewhat meaningful work
> > that actually helps run Wikipedia!
> >
> > Please feel free to give as any suggestions, comments, etc.
> > If you want to express your opinion privately,
> > please contact [hidden email].
> >
> > Thanks,
> >
> > --- Appendix ---
> >
> > I added some detailed information about B2BWiki in the following.
> >
> > # Accessing data
> > When accessing a page on B2BWiki, the browser will query peers first.
> > 1) If there exist peers that hold the contents, peer to peer download
> > happens.
> > 2) otherwise, if there is no peer, client will download the content
> > from the mirror server.
> > 3) If mirror server does not have the content, it downloads from
> > Wikipedia server (1 access per first download, and update).
> >
> >
> > # Peer lookup
> > To enable content lookup for peers,
> > we manage a lookup server that holds a page_name-to-peer map.
> > A client (a user's browser) can query the list of peers that
> > currently hold the content, and select the peer by its freshness
> > (has hash/timestamp of the content,
> > has top 2 octet of IP address
> > (figuring out whether it is local peer or not), etc.
> >
> >
> > # Update, and integrity check
> > Mirror server updates its content per each day
> > (can be configured to update per each hour, etc).
> > Update check is done by using If-Modified-Since header from Wikipedia
> > server.
> > On retrieving the content from Wikipedia, the mirror server stamps a
> > timestamp
> > and sha1 checksum, to ensure the freshness of data and its integrity.
> > When clients lookup and download the content from the peers,
> > client will compare the sha1 checksum of data
> > with the checksum from lookup server.
> >
> > In this settings, users can get older data
> > (they can configure how to tolerate the freshness of data,
> > e.g. 1day older, 3day, 1 week older, etc.), and
> > the integrity is guaranteed by mirror/lookup server.
> >
> >
> > More detailed information can be obtained from the following website.
> >
> > http://goo.gl/pSNrjR
> > (URL redirects to SSLab@gatech website)
> >
> > Please feel free to give as any suggestions, comments, etc.
> >
> > Thanks,
> > --
> > Yeongjin Jang
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> Hi,
>
> This is some interesting stuff, and I think research along these lines
> (That is, leveraging webrtc to deliver content in a P2P manner over
> web browsers) will really change the face of the internet in the years
> to come.
>
> As for Wikipedia specifically (This is all just my personal opinion.
> Others may disagree with me):
>
> *Wikipedia is a a fairly stable/mature site. I think we're past the
> point where its a good idea to experiment with experimental
> technologies (Although mirrors of Wikipedia are a good place to do
> that). We need stability and proven technology.
>


That's true. Our current prototype works well for the testing,
but not sure for the robustness in the wild, yet.
We want to develop this to be more stable.


>
> *Bandwidth makes up a very small portion of WMF's expenses (I'm not
> sure how small. Someone once told me that donation processing costs
> takes up more money then raw bandwidth costs. Don't know if that's
> true, but bandwidth is certainly not the biggest expense).
>
> Your scheme primarily serves to offload bandwidth of cached content to
> other people. But serving cached content (by which I mean, anonymous
> users getting results from varnish) is probably the cheapest (in terms
> of computational resources) part of our setup. The hard part is things
> like parsing wikitext->html, and otherwise generating pages.
>

Yes, it can be.
But for the network traffic bandwidth, we think that it could benefit both
side (an organizations that runs B2BWiki and Wikipedia), because
it would reduce not only the traffic that hits Wikipedia, but also
egress traffic from the LAN (or ISP) to Wikipedia.

And we know that this is just a hypothesis, so we want to do
analysis on potential reduction in traffic/cost with network data stats.

Is there any point that I can get stats for the bandwidth,
such as daily traffic for serving Wikipedia servers, etc?
(Please let me know if you know any point)

I visited several stats pages, such as

https://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
https://en.wikipedia.org/wiki/Wikipedia:Statistics

, and those sites let me know about how many page accesses,
edits were happened but not for the traffic,
while ganglia site gave me too finer grained stats.


And what we got from our local network data is,
from Georgia Tech, it shows that 50GB of download per day
from Wikipedia
(including wikipedia.org and upload.wikimedia.org, based on source
IP address).
From the simple calculations, the cost of 18TB / year will be
around $600, for serving 30K person organizations.



> *24 hours to page update is generally considered much too slow
> (Tolerance for anons is probably a bit higher than logged in users,
> but still).  People expect their changes to appear for everyone,
> immediately. We want the delay to be in seconds, not days. I think its
> unlikely that any sort of expire scheme would be acceptable. We need
> active cache invalidation upon edit.
>
>
Yes, we have thought about cache expiring/update/invalidation,
for example, whenever a peer detects update, propagate the update
to its neighbor to update/invalidate the cache.
We did not concentrate on that part much,
as we had similar thoughts that non-editors can be more tolerable.
The scheme is not in there yet, but it is worth to implement if
editors care it much.


> *Lack of analysis of scalability (I just briefly skimmed the google
> cache version of the page you linked [your webserver had the
> connection keep being reset], so its possible I missed this). I didn't
> see any analysis of how your system scales with load. Perhaps that's
> because you're still in the development stage, and the design isn't
> finalized(?) Anyways, scalability analysis is important when we're
> talking about wikipedia. Does this design still work if you have
> 100,000 (or even more) peers?
>

We think scalability is very important for B2BWiki.
While we were doing micro-benchmark on the servers of B2BWiki,
we realized that current one cannot support more than 10K concurrent
peers. We've been update the internal structure from September,
and now we are targeting support up to 100K peers
(we think this would be enough for
supporting a metropolitan local area).

Implementation with new data structure will be available very soon,
I hope we can show the good result that shows its scalability.



>
> *Privacy concerns - Would a malicious person be able to force
> themselves to be someone's preferred peer, and spy on everything they
> read, etc.
>
> *DOS concerns - Would a malicious peer or peers be able to prevent an
> honest user from connecting to the website? (I didn't look in detail
> at how you select peers and handle peer failure, so I'm not sure if
> this applies)
>
>
Nice points! For privacy, we want to implement k-anonymity scheme on
the page access. However, it incurs more bandwidth consumption and
potential performance overhead on the system.

Malicious peers can act as if they hold legitimate content
(while actually not), or making null request to the peers.
We are currently thinking about black-listing such malicious peers,
and live-migration of mirror/peer servers if they fails,
but more fundamental remedy is required.

--

>
> Anyways, I think this sort of think is interesting, but I think your
> system would be more suited to people running a small static website,
> that want to scale to very high numbers, rather than Wikipedia's use
> case.
>
> --
> -bawolff
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Yeongjin Jang
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Brian Wolff
On Saturday, November 28, 2015, Yeongjin Jang <[hidden email]>
wrote:

> Thank you for your comments!
>
>
> On Sat, Nov 28, 2015 at 2:33 PM, Brian Wolff <[hidden email]> wrote:
>
>> On 11/28/15, Yeongjin Jang <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
>> >
>> > In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
>> > we are working on a project called B2BWiki,
>> > which enables users to share the contents of Wikipedia through WebRTC
>> > (peer-to-peer sharing).
>> >
>> > Website is at here: http://b2bwiki.cc.gatech.edu/
>> >
>> > The project aims to help Wikipedia by donating computing resources
>> > from the community; users can donate their traffic (by P2P
communication)

>> > and storage (indexedDB) to reduce the load of Wikipedia servers.
>> > For larger organizations, e.g. schools or companies that
>> > have many local users, they can donate a mirror server
>> > similar to GNU FTP servers, which can bootstrap peer sharing.
>> >
>> >
>> > Potential benefits that we think of are following.
>> > 1) Users can easily donate their resources to the community.
>> > Just visit the website.
>> >
>> > 2) Users can get performance benefit if a page is loaded from
>> > multiple local peers / local mirror (page load time got faster!).
>> >
>> > 3) Wikipedia can reduce its server workload, network traffic, etc.
>> >
>> > 4) Local network operators can reduce network traffic transit
>> > (e.g. cost that is caused by delivering the traffic to the outside).
>> >
>> >
>> > While we are working on enhancing the implementation,
>> > we would like to ask the opinions from actual developers of Wikipedia.
>> > For example, we want to know whether our direction is correct or not
>> > (will it actually reduce the load?), or if there are some other
concerns

>> > that we missed, that can potentially prevent this system from
>> > working as intended. We really want to do somewhat meaningful work
>> > that actually helps run Wikipedia!
>> >
>> > Please feel free to give as any suggestions, comments, etc.
>> > If you want to express your opinion privately,
>> > please contact [hidden email].
>> >
>> > Thanks,
>> >
>> > --- Appendix ---
>> >
>> > I added some detailed information about B2BWiki in the following.
>> >
>> > # Accessing data
>> > When accessing a page on B2BWiki, the browser will query peers first.
>> > 1) If there exist peers that hold the contents, peer to peer download
>> > happens.
>> > 2) otherwise, if there is no peer, client will download the content
>> > from the mirror server.
>> > 3) If mirror server does not have the content, it downloads from
>> > Wikipedia server (1 access per first download, and update).
>> >
>> >
>> > # Peer lookup
>> > To enable content lookup for peers,
>> > we manage a lookup server that holds a page_name-to-peer map.
>> > A client (a user's browser) can query the list of peers that
>> > currently hold the content, and select the peer by its freshness
>> > (has hash/timestamp of the content,
>> > has top 2 octet of IP address
>> > (figuring out whether it is local peer or not), etc.
>> >
>> >
>> > # Update, and integrity check
>> > Mirror server updates its content per each day
>> > (can be configured to update per each hour, etc).
>> > Update check is done by using If-Modified-Since header from Wikipedia
>> > server.
>> > On retrieving the content from Wikipedia, the mirror server stamps a
>> > timestamp
>> > and sha1 checksum, to ensure the freshness of data and its integrity.
>> > When clients lookup and download the content from the peers,
>> > client will compare the sha1 checksum of data
>> > with the checksum from lookup server.
>> >
>> > In this settings, users can get older data
>> > (they can configure how to tolerate the freshness of data,
>> > e.g. 1day older, 3day, 1 week older, etc.), and
>> > the integrity is guaranteed by mirror/lookup server.
>> >
>> >
>> > More detailed information can be obtained from the following website.
>> >
>> > http://goo.gl/pSNrjR
>> > (URL redirects to SSLab@gatech website)
>> >
>> > Please feel free to give as any suggestions, comments, etc.
>> >
>> > Thanks,
>> > --
>> > Yeongjin Jang
>> >
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>> Hi,
>>
>> This is some interesting stuff, and I think research along these lines
>> (That is, leveraging webrtc to deliver content in a P2P manner over
>> web browsers) will really change the face of the internet in the years
>> to come.
>>
>> As for Wikipedia specifically (This is all just my personal opinion.
>> Others may disagree with me):
>>
>> *Wikipedia is a a fairly stable/mature site. I think we're past the
>> point where its a good idea to experiment with experimental
>> technologies (Although mirrors of Wikipedia are a good place to do
>> that). We need stability and proven technology.
>>
>
>
> That's true. Our current prototype works well for the testing,
> but not sure for the robustness in the wild, yet.
> We want to develop this to be more stable.
>
>
>>
>> *Bandwidth makes up a very small portion of WMF's expenses (I'm not
>> sure how small. Someone once told me that donation processing costs
>> takes up more money then raw bandwidth costs. Don't know if that's
>> true, but bandwidth is certainly not the biggest expense).
>>
>> Your scheme primarily serves to offload bandwidth of cached content to
>> other people. But serving cached content (by which I mean, anonymous
>> users getting results from varnish) is probably the cheapest (in terms
>> of computational resources) part of our setup. The hard part is things
>> like parsing wikitext->html, and otherwise generating pages.
>>
>
> Yes, it can be.
> But for the network traffic bandwidth, we think that it could benefit both
> side (an organizations that runs B2BWiki and Wikipedia), because
> it would reduce not only the traffic that hits Wikipedia, but also
> egress traffic from the LAN (or ISP) to Wikipedia.
>
> And we know that this is just a hypothesis, so we want to do
> analysis on potential reduction in traffic/cost with network data stats.
>
> Is there any point that I can get stats for the bandwidth,
> such as daily traffic for serving Wikipedia servers, etc?
> (Please let me know if you know any point)
>

Maybe the (third column of) projectcounts files at
http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-11/

Financial cost of bandwidth would probably be in one of the annual plan
documents somewhere.

> I visited several stats pages, such as
>
> https://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
> https://en.wikipedia.org/wiki/Wikipedia:Statistics
>
> , and those sites let me know about how many page accesses,
> edits were happened but not for the traffic,
> while ganglia site gave me too finer grained stats.
>
>
> And what we got from our local network data is,
> from Georgia Tech, it shows that 50GB of download per day
> from Wikipedia
> (including wikipedia.org and upload.wikimedia.org, based on source
> IP address).
> From the simple calculations, the cost of 18TB / year will be
> around $600, for serving 30K person organizations.
>
>
>
>> *24 hours to page update is generally considered much too slow
>> (Tolerance for anons is probably a bit higher than logged in users,
>> but still).  People expect their changes to appear for everyone,
>> immediately. We want the delay to be in seconds, not days. I think its
>> unlikely that any sort of expire scheme would be acceptable. We need
>> active cache invalidation upon edit.
>>
>>
> Yes, we have thought about cache expiring/update/invalidation,
> for example, whenever a peer detects update, propagate the update
> to its neighbor to update/invalidate the cache.
> We did not concentrate on that part much,
> as we had similar thoughts that non-editors can be more tolerable.
> The scheme is not in there yet, but it is worth to implement if
> editors care it much.
>
>
>> *Lack of analysis of scalability (I just briefly skimmed the google
>> cache version of the page you linked [your webserver had the
>> connection keep being reset], so its possible I missed this). I didn't
>> see any analysis of how your system scales with load. Perhaps that's
>> because you're still in the development stage, and the design isn't
>> finalized(?) Anyways, scalability analysis is important when we're
>> talking about wikipedia. Does this design still work if you have
>> 100,000 (or even more) peers?
>>
>
> We think scalability is very important for B2BWiki.
> While we were doing micro-benchmark on the servers of B2BWiki,
> we realized that current one cannot support more than 10K concurrent
> peers. We've been update the internal structure from September,
> and now we are targeting support up to 100K peers
> (we think this would be enough for
> supporting a metropolitan local area).
>
> Implementation with new data structure will be available very soon,
> I hope we can show the good result that shows its scalability.
>
The other scalability concern would be for obscure articles. I havent
really looked at your code, so maybe you cover it - but wikipedia has over
5 million articles (and a lot more when you count non-content pages). The
group of peers is presumably going to have high churn (since they go away
when you browse somewhere else). Id worry the overhead of keeping track of
which peer knows what, especially given how fast the peers change to be a
lot. I also expect that for lots of articles, only a very small number of
peers will know them.


>
>>
>> *Privacy concerns - Would a malicious person be able to force
>> themselves to be someone's preferred peer, and spy on everything they
>> read, etc.
>>
>> *DOS concerns - Would a malicious peer or peers be able to prevent an
>> honest user from connecting to the website? (I didn't look in detail
>> at how you select peers and handle peer failure, so I'm not sure if
>> this applies)
>>
>>
> Nice points! For privacy, we want to implement k-anonymity scheme on
> the page access. However, it incurs more bandwidth consumption and
> potential performance overhead on the system.
>
> Malicious peers can act as if they hold legitimate content
> (while actually not), or making null request to the peers.
> We are currently thinking about black-listing such malicious peers,
> and live-migration of mirror/peer servers if they fails,
> but more fundamental remedy is required.
>
> --
>>
>> Anyways, I think this sort of think is interesting, but I think your
>> system would be more suited to people running a small static website,
>> that want to scale to very high numbers, rather than Wikipedia's use
>> case.
>>
>> --
>> -bawolff
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> Yeongjin Jang
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Federico Leva (Nemo)
On the matter of who has what etc., Brewster Kahle has opinions on
implementation:
http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/
┬źFortunately, the needed technologies are now available in JavaScript,
Bitcoin, IPFS/Bittorrent, Namecoin, and others.┬╗

Nemo

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Yeongjin Jang
In reply to this post by Brian Wolff
On Sun, Nov 29, 2015 at 1:59 PM, Brian Wolff <[hidden email]> wrote:

> On Saturday, November 28, 2015, Yeongjin Jang <[hidden email]>
> wrote:
> > Thank you for your comments!
> >
> >
> > On Sat, Nov 28, 2015 at 2:33 PM, Brian Wolff <[hidden email]> wrote:
> >
> >> On 11/28/15, Yeongjin Jang <[hidden email]> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
> >> >
> >> > In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> >> > we are working on a project called B2BWiki,
> >> > which enables users to share the contents of Wikipedia through WebRTC
> >> > (peer-to-peer sharing).
> >> >
> >> > Website is at here: http://b2bwiki.cc.gatech.edu/
> >> >
> >> > The project aims to help Wikipedia by donating computing resources
> >> > from the community; users can donate their traffic (by P2P
> communication)
> >> > and storage (indexedDB) to reduce the load of Wikipedia servers.
> >> > For larger organizations, e.g. schools or companies that
> >> > have many local users, they can donate a mirror server
> >> > similar to GNU FTP servers, which can bootstrap peer sharing.
> >> >
> >> >
> >> > Potential benefits that we think of are following.
> >> > 1) Users can easily donate their resources to the community.
> >> > Just visit the website.
> >> >
> >> > 2) Users can get performance benefit if a page is loaded from
> >> > multiple local peers / local mirror (page load time got faster!).
> >> >
> >> > 3) Wikipedia can reduce its server workload, network traffic, etc.
> >> >
> >> > 4) Local network operators can reduce network traffic transit
> >> > (e.g. cost that is caused by delivering the traffic to the outside).
> >> >
> >> >
> >> > While we are working on enhancing the implementation,
> >> > we would like to ask the opinions from actual developers of Wikipedia.
> >> > For example, we want to know whether our direction is correct or not
> >> > (will it actually reduce the load?), or if there are some other
> concerns
> >> > that we missed, that can potentially prevent this system from
> >> > working as intended. We really want to do somewhat meaningful work
> >> > that actually helps run Wikipedia!
> >> >
> >> > Please feel free to give as any suggestions, comments, etc.
> >> > If you want to express your opinion privately,
> >> > please contact [hidden email].
> >> >
> >> > Thanks,
> >> >
> >> > --- Appendix ---
> >> >
> >> > I added some detailed information about B2BWiki in the following.
> >> >
> >> > # Accessing data
> >> > When accessing a page on B2BWiki, the browser will query peers first.
> >> > 1) If there exist peers that hold the contents, peer to peer download
> >> > happens.
> >> > 2) otherwise, if there is no peer, client will download the content
> >> > from the mirror server.
> >> > 3) If mirror server does not have the content, it downloads from
> >> > Wikipedia server (1 access per first download, and update).
> >> >
> >> >
> >> > # Peer lookup
> >> > To enable content lookup for peers,
> >> > we manage a lookup server that holds a page_name-to-peer map.
> >> > A client (a user's browser) can query the list of peers that
> >> > currently hold the content, and select the peer by its freshness
> >> > (has hash/timestamp of the content,
> >> > has top 2 octet of IP address
> >> > (figuring out whether it is local peer or not), etc.
> >> >
> >> >
> >> > # Update, and integrity check
> >> > Mirror server updates its content per each day
> >> > (can be configured to update per each hour, etc).
> >> > Update check is done by using If-Modified-Since header from Wikipedia
> >> > server.
> >> > On retrieving the content from Wikipedia, the mirror server stamps a
> >> > timestamp
> >> > and sha1 checksum, to ensure the freshness of data and its integrity.
> >> > When clients lookup and download the content from the peers,
> >> > client will compare the sha1 checksum of data
> >> > with the checksum from lookup server.
> >> >
> >> > In this settings, users can get older data
> >> > (they can configure how to tolerate the freshness of data,
> >> > e.g. 1day older, 3day, 1 week older, etc.), and
> >> > the integrity is guaranteed by mirror/lookup server.
> >> >
> >> >
> >> > More detailed information can be obtained from the following website.
> >> >
> >> > http://goo.gl/pSNrjR
> >> > (URL redirects to SSLab@gatech website)
> >> >
> >> > Please feel free to give as any suggestions, comments, etc.
> >> >
> >> > Thanks,
> >> > --
> >> > Yeongjin Jang
> >> >
> >> >
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > [hidden email]
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >> Hi,
> >>
> >> This is some interesting stuff, and I think research along these lines
> >> (That is, leveraging webrtc to deliver content in a P2P manner over
> >> web browsers) will really change the face of the internet in the years
> >> to come.
> >>
> >> As for Wikipedia specifically (This is all just my personal opinion.
> >> Others may disagree with me):
> >>
> >> *Wikipedia is a a fairly stable/mature site. I think we're past the
> >> point where its a good idea to experiment with experimental
> >> technologies (Although mirrors of Wikipedia are a good place to do
> >> that). We need stability and proven technology.
> >>
> >
> >
> > That's true. Our current prototype works well for the testing,
> > but not sure for the robustness in the wild, yet.
> > We want to develop this to be more stable.
> >
> >
> >>
> >> *Bandwidth makes up a very small portion of WMF's expenses (I'm not
> >> sure how small. Someone once told me that donation processing costs
> >> takes up more money then raw bandwidth costs. Don't know if that's
> >> true, but bandwidth is certainly not the biggest expense).
> >>
> >> Your scheme primarily serves to offload bandwidth of cached content to
> >> other people. But serving cached content (by which I mean, anonymous
> >> users getting results from varnish) is probably the cheapest (in terms
> >> of computational resources) part of our setup. The hard part is things
> >> like parsing wikitext->html, and otherwise generating pages.
> >>
> >
> > Yes, it can be.
> > But for the network traffic bandwidth, we think that it could benefit
> both
> > side (an organizations that runs B2BWiki and Wikipedia), because
> > it would reduce not only the traffic that hits Wikipedia, but also
> > egress traffic from the LAN (or ISP) to Wikipedia.
> >
> > And we know that this is just a hypothesis, so we want to do
> > analysis on potential reduction in traffic/cost with network data stats.
> >
> > Is there any point that I can get stats for the bandwidth,
> > such as daily traffic for serving Wikipedia servers, etc?
> > (Please let me know if you know any point)
> >
>
> Maybe the (third column of) projectcounts files at
> http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-11/
>
> Financial cost of bandwidth would probably be in one of the annual plan
> documents somewhere.
>


Thank you for the pointer!

I presume that only counts HTML data,
since the counted number is far smaller than mediacount stat result.
From projectcounts, traffic on the day of 11/01/2015 is like
around 4.4TB for en.wikipedia (including mobile),
and around 7TB for overall languages.

And, from the mediacount stats at
http://dumps.wikimedia.org/other/mediacounts/
There was 110TB downloads referred within WMF sites on that day.

I recall that I saw financial statement of WMF that states around $2.3M
was spent for Internet Hosting. I am not sure whether it includes
management cost for computing resources
(server clusters such as eqiad) or not.

Not sure following simple calculation works;
117 TB per day, for 365 days, if $0.05 per GB, then it is around $2.2M.
Maybe it would be more accurate if I contact analytics team directly.



> > I visited several stats pages, such as
> >
> > https://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
> > https://en.wikipedia.org/wiki/Wikipedia:Statistics
> >
> > , and those sites let me know about how many page accesses,
> > edits were happened but not for the traffic,
> > while ganglia site gave me too finer grained stats.
> >
> >
> > And what we got from our local network data is,
> > from Georgia Tech, it shows that 50GB of download per day
> > from Wikipedia
> > (including wikipedia.org and upload.wikimedia.org, based on source
> > IP address).
> > From the simple calculations, the cost of 18TB / year will be
> > around $600, for serving 30K person organizations.
> >
> >
> >
> >> *24 hours to page update is generally considered much too slow
> >> (Tolerance for anons is probably a bit higher than logged in users,
> >> but still).  People expect their changes to appear for everyone,
> >> immediately. We want the delay to be in seconds, not days. I think its
> >> unlikely that any sort of expire scheme would be acceptable. We need
> >> active cache invalidation upon edit.
> >>
> >>
> > Yes, we have thought about cache expiring/update/invalidation,
> > for example, whenever a peer detects update, propagate the update
> > to its neighbor to update/invalidate the cache.
> > We did not concentrate on that part much,
> > as we had similar thoughts that non-editors can be more tolerable.
> > The scheme is not in there yet, but it is worth to implement if
> > editors care it much.
> >
> >
> >> *Lack of analysis of scalability (I just briefly skimmed the google
> >> cache version of the page you linked [your webserver had the
> >> connection keep being reset], so its possible I missed this). I didn't
> >> see any analysis of how your system scales with load. Perhaps that's
> >> because you're still in the development stage, and the design isn't
> >> finalized(?) Anyways, scalability analysis is important when we're
> >> talking about wikipedia. Does this design still work if you have
> >> 100,000 (or even more) peers?
> >>
> >
> > We think scalability is very important for B2BWiki.
> > While we were doing micro-benchmark on the servers of B2BWiki,
> > we realized that current one cannot support more than 10K concurrent
> > peers. We've been update the internal structure from September,
> > and now we are targeting support up to 100K peers
> > (we think this would be enough for
> > supporting a metropolitan local area).
> >
> > Implementation with new data structure will be available very soon,
> > I hope we can show the good result that shows its scalability.
> >
> The other scalability concern would be for obscure articles. I havent
> really looked at your code, so maybe you cover it - but wikipedia has over
> 5 million articles (and a lot more when you count non-content pages). The
> group of peers is presumably going to have high churn (since they go away
> when you browse somewhere else). Id worry the overhead of keeping track of
> which peer knows what, especially given how fast the peers change to be a
> lot. I also expect that for lots of articles, only a very small number of
> peers will know them.
>
>
That's true. Dynamically registering / un-registering lookup table gives
high
overhead on the servers (in both computation & memory usage).
Distributed solutions like DHT is there, but we think there could be a
trade-off
on lookup time for using de-centralized (managing lookup table on the
server)
versus fully distributed architecture (DHT).

Our prior "naive" implementation costs like if each user has
5K pages cached (with around 50K images),
when 10K concurrent user presents it consumes around 35GB
of memory, and each registering incurs 500K bytes of network traffic.

We thought it is not that useful, so now we are trying to come up with
more lightweight implementation. We hope to have a practically meaningful
micro-benchmark result on the new implementation.



> >
> >>
> >> *Privacy concerns - Would a malicious person be able to force
> >> themselves to be someone's preferred peer, and spy on everything they
> >> read, etc.
> >>
> >> *DOS concerns - Would a malicious peer or peers be able to prevent an
> >> honest user from connecting to the website? (I didn't look in detail
> >> at how you select peers and handle peer failure, so I'm not sure if
> >> this applies)
> >>
> >>
> > Nice points! For privacy, we want to implement k-anonymity scheme on
> > the page access. However, it incurs more bandwidth consumption and
> > potential performance overhead on the system.
> >
> > Malicious peers can act as if they hold legitimate content
> > (while actually not), or making null request to the peers.
> > We are currently thinking about black-listing such malicious peers,
> > and live-migration of mirror/peer servers if they fails,
> > but more fundamental remedy is required.
> >
> > --
> >>
> >> Anyways, I think this sort of think is interesting, but I think your
> >> system would be more suited to people running a small static website,
> >> that want to scale to very high numbers, rather than Wikipedia's use
> >> case.
> >>
> >> --
> >> -bawolff
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >
> >
> >
> > --
> > Yeongjin Jang
> > _______________________________________________
> > Wikitech-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Yeongjin Jang
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Chris Steipp
In reply to this post by Yeongjin Jang
On Sat, Nov 28, 2015 at 1:36 PM, Yeongjin Jang <[hidden email]>
wrote:

> > *Privacy concerns - Would a malicious person be able to force
> > themselves to be someone's preferred peer, and spy on everything they
> > read, etc.
> >
> > *DOS concerns - Would a malicious peer or peers be able to prevent an
> > honest user from connecting to the website? (I didn't look in detail
> > at how you select peers and handle peer failure, so I'm not sure if
> > this applies)
> >
> >
> Nice points! For privacy, we want to implement k-anonymity scheme on
> the page access. However, it incurs more bandwidth consumption and
> potential performance overhead on the system.
>
> Malicious peers can act as if they hold legitimate content
> (while actually not), or making null request to the peers.
> We are currently thinking about black-listing such malicious peers,
> and live-migration of mirror/peer servers if they fails,
> but more fundamental remedy is required.



Those are interesting ideas, although I'm skeptical you're going to be able
to successfully keep malicious peers from tracking users' reading habits,
in the same way that law enforcement tracks bittorrent downloads. But it
would be great to hear proposals you come up with.

I haven't looked at the code, but are you also preventing malicious peers
from modifying the content?
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Ryan Lane-2
In reply to this post by Yeongjin Jang
On Sun, Nov 29, 2015 at 8:18 PM, Yeongjin Jang <[hidden email]>
wrote:

>
> I recall that I saw financial statement of WMF that states around $2.3M
> was spent for Internet Hosting. I am not sure whether it includes
> management cost for computing resources
> (server clusters such as eqiad) or not.
>
>
That's the cost for datacenters, hardware, bandwidth, etc..


> Not sure following simple calculation works;
> 117 TB per day, for 365 days, if $0.05 per GB, then it is around $2.2M.
> Maybe it would be more accurate if I contact analytics team directly.
>
>
That calculation doesn't work because it doesn't take into account peering
agreements, or donated (or heavily discounted) transit contracts. Bandwidth
is one of the cheaper overall costs.

Something your design doesn't take into account for bandwidth costs is that
the world is trending to mobile and mobile bandwidth costs are generally
very high. It's likely this p2p approach will be many orders of magnitude
more expensive than the current approach.

A decentralized approach doesn't benefit from the economics of scale.
Instead of being able to negotiate transit pricing and eliminating cost
through peering, you're externalizing the cost at the consumer rate, which
is the highest possible rate.

> The other scalability concern would be for obscure articles. I havent
> really looked at your code, so maybe you cover it - but wikipedia has over
> 5 million articles (and a lot more when you count non-content pages). The
> group of peers is presumably going to have high churn (since they go away
> when you browse somewhere else). Id worry the overhead of keeping track of
> which peer knows what, especially given how fast the peers change to be a
> lot. I also expect that for lots of articles, only a very small number of
> peers will know them.
>
>

> That's true. Dynamically registering / un-registering lookup table gives
> high
> overhead on the servers (in both computation & memory usage).
> Distributed solutions like DHT is there, but we think there could be a
> trade-off
> on lookup time for using de-centralized (managing lookup table on the
> server)
> versus fully distributed architecture (DHT).
>
> Our prior "naive" implementation costs like if each user has
> 5K pages cached (with around 50K images),
> when 10K concurrent user presents it consumes around 35GB
> of memory, and each registering incurs 500K bytes of network traffic.
>
> We thought it is not that useful, so now we are trying to come up with
> more lightweight implementation. We hope to have a practically meaningful
> micro-benchmark result on the new implementation.
>
>
Just the metadata for articles, images and revisions is going to be
massive. That data itself will need to be distributed too. The network
costs associated with just lookups is going to be quite expensive for peers.

It seems your project assumes that bandwidth is unlimited and unrated,
which for many parts of the world isn't true.

I don't mean to dissuade you. The idea of a p2p Wikipedia is an interesting
project, and at some point in the future if bandwidth is free and unrated
everywhere this may be a reasonable way to provide a method of access in
case of major disaster of Wikipedia itself. This idea has been brought up
numerous times in the past, though, and in general the potential gains are
never better than the latency, cost, and complexity associated with it.

- Ryan
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

C. Scott Ananian
In reply to this post by Yeongjin Jang
This is something we really wanted to have at One Laptop Per Child, and I'm
glad you're looking into it!

In our case, we wanted to be able to provide schools a complete copy of
wikipedia, but could only afford to dedicate about 100MB/student to the
cause (at the time, I think our entire machine had 4GB of storage).  We
ended up using a highly-compressed wikipedia "slice" combining the most
popular articles with a set of articles spidered from a list of educational
topics (so you'd be sure to get articles on all the elements, all the
planets, etc).  But what we really *wanted* to do was to split up the
content among the kid's machines, as you've done, so that between them all
they could have access to a much broader slice of wikipedia.

Now a days, with lower storage costs, it's possible we could give students
a much broader slice of the *text* of wikipedia articles, and still use the
peer-to-peer approach to serve the *media* associated with the articles
(which is much larger than the text content).

The other side of this coin is supporting editability.  We were always
dissatisfied with our read-only slices of the Wiki -- true empowerment
means being able to add and edit content, not just passively consume it.
Of course, collaboratively editing a wikipedia in a peer-to-peer fashion is
a very interesting research project.  I wonder if you think this sort of
thing is in scope for your work.
 --scott


On Sat, Nov 28, 2015 at 1:45 AM, Yeongjin Jang <[hidden email]>
wrote:

>
> Hi,
>
> I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
>
> In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
> we are working on a project called B2BWiki,
> which enables users to share the contents of Wikipedia through WebRTC
> (peer-to-peer sharing).
>
> Website is at here: http://b2bwiki.cc.gatech.edu/
>
> The project aims to help Wikipedia by donating computing resources
> from the community; users can donate their traffic (by P2P communication)
> and storage (indexedDB) to reduce the load of Wikipedia servers.
> For larger organizations, e.g. schools or companies that
> have many local users, they can donate a mirror server
> similar to GNU FTP servers, which can bootstrap peer sharing.
>
>
> Potential benefits that we think of are following.
> 1) Users can easily donate their resources to the community.
> Just visit the website.
>
> 2) Users can get performance benefit if a page is loaded from
> multiple local peers / local mirror (page load time got faster!).
>
> 3) Wikipedia can reduce its server workload, network traffic, etc.
>
> 4) Local network operators can reduce network traffic transit
> (e.g. cost that is caused by delivering the traffic to the outside).
>
>
> While we are working on enhancing the implementation,
> we would like to ask the opinions from actual developers of Wikipedia.
> For example, we want to know whether our direction is correct or not
> (will it actually reduce the load?), or if there are some other concerns
> that we missed, that can potentially prevent this system from
> working as intended. We really want to do somewhat meaningful work
> that actually helps run Wikipedia!
>
> Please feel free to give as any suggestions, comments, etc.
> If you want to express your opinion privately,
> please contact [hidden email].
>
> Thanks,
>
> --- Appendix ---
>
> I added some detailed information about B2BWiki in the following.
>
> # Accessing data
> When accessing a page on B2BWiki, the browser will query peers first.
> 1) If there exist peers that hold the contents, peer to peer download
> happens.
> 2) otherwise, if there is no peer, client will download the content
> from the mirror server.
> 3) If mirror server does not have the content, it downloads from
> Wikipedia server (1 access per first download, and update).
>
>
> # Peer lookup
> To enable content lookup for peers,
> we manage a lookup server that holds a page_name-to-peer map.
> A client (a user's browser) can query the list of peers that
> currently hold the content, and select the peer by its freshness
> (has hash/timestamp of the content,
> has top 2 octet of IP address
> (figuring out whether it is local peer or not), etc.
>
>
> # Update, and integrity check
> Mirror server updates its content per each day
> (can be configured to update per each hour, etc).
> Update check is done by using If-Modified-Since header from Wikipedia
> server.
> On retrieving the content from Wikipedia, the mirror server stamps a
> timestamp
> and sha1 checksum, to ensure the freshness of data and its integrity.
> When clients lookup and download the content from the peers,
> client will compare the sha1 checksum of data
> with the checksum from lookup server.
>
> In this settings, users can get older data
> (they can configure how to tolerate the freshness of data,
> e.g. 1day older, 3day, 1 week older, etc.), and
> the integrity is guaranteed by mirror/lookup server.
>
>
> More detailed information can be obtained from the following website.
>
> http://goo.gl/pSNrjR
> (URL redirects to SSLab@gatech website)
>
> Please feel free to give as any suggestions, comments, etc.
>
> Thanks,
> --
> Yeongjin Jang
>
>
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l




--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Purodha Blissenbach
In reply to this post by Ryan Lane-2
On 30.11.2015 20:47, Ryan Lane wrote:

> On Sun, Nov 29, 2015 at 8:18 PM, Yeongjin Jang
> <[hidden email]>
> wrote:
>
>>
>> I recall that I saw financial statement of WMF that states around
>> $2.3M
>> was spent for Internet Hosting. I am not sure whether it includes
>> management cost for computing resources
>> (server clusters such as eqiad) or not.
>>
>>
> That's the cost for datacenters, hardware, bandwidth, etc..
>
>
>> Not sure following simple calculation works;
>> 117 TB per day, for 365 days, if $0.05 per GB, then it is around
>> $2.2M.
>> Maybe it would be more accurate if I contact analytics team
>> directly.
>>
>>
> That calculation doesn't work because it doesn't take into account
> peering
> agreements, or donated (or heavily discounted) transit contracts.
> Bandwidth
> is one of the cheaper overall costs.
>
> Something your design doesn't take into account for bandwidth costs
> is that
> the world is trending to mobile and mobile bandwidth costs are
> generally
> very high. It's likely this p2p approach will be many orders of
> magnitude
> more expensive than the current approach.
>
> A decentralized approach doesn't benefit from the economics of scale.
> Instead of being able to negotiate transit pricing and eliminating
> cost
> through peering, you're externalizing the cost at the consumer rate,
> which
> is the highest possible rate.

While that is often true, there are notable exception, growing both in
scale
and number.

a) We have campus situations where a large university, company, or
public
agency with tens or hundreds of thousands of peers run a network that
they
pay for anyways, that is needed for peers to connect to Wiki* anyways,
and
that is available to peers at no (additional) cost. While external
traffic cost
are of relatively little concern, quick response times often are,
especially
in classroom situations where up to several hundred students may look
at
the same articles virtually at once.
b) There is a fast growing international movement for open wireless
radio
networks (freifunk) of comparatively low bandwidth in neighborhoods.
Those
can benefit a lot from local peering, imho.

Purodha

>> The other scalability concern would be for obscure articles. I
>> havent
>> really looked at your code, so maybe you cover it - but wikipedia
>> has over
>> 5 million articles (and a lot more when you count non-content
>> pages). The
>> group of peers is presumably going to have high churn (since they go
>> away
>> when you browse somewhere else). Id worry the overhead of keeping
>> track of
>> which peer knows what, especially given how fast the peers change to
>> be a
>> lot. I also expect that for lots of articles, only a very small
>> number of
>> peers will know them.
>>
>>
>
>> That's true. Dynamically registering / un-registering lookup table
>> gives
>> high
>> overhead on the servers (in both computation & memory usage).
>> Distributed solutions like DHT is there, but we think there could be
>> a
>> trade-off
>> on lookup time for using de-centralized (managing lookup table on
>> the
>> server)
>> versus fully distributed architecture (DHT).
>>
>> Our prior "naive" implementation costs like if each user has
>> 5K pages cached (with around 50K images),
>> when 10K concurrent user presents it consumes around 35GB
>> of memory, and each registering incurs 500K bytes of network
>> traffic.
>>
>> We thought it is not that useful, so now we are trying to come up
>> with
>> more lightweight implementation. We hope to have a practically
>> meaningful
>> micro-benchmark result on the new implementation.
>>
>>
> Just the metadata for articles, images and revisions is going to be
> massive. That data itself will need to be distributed too. The
> network
> costs associated with just lookups is going to be quite expensive for
> peers.
>
> It seems your project assumes that bandwidth is unlimited and
> unrated,
> which for many parts of the world isn't true.
>
> I don't mean to dissuade you. The idea of a p2p Wikipedia is an
> interesting
> project, and at some point in the future if bandwidth is free and
> unrated
> everywhere this may be a reasonable way to provide a method of access
> in
> case of major disaster of Wikipedia itself. This idea has been
> brought up
> numerous times in the past, though, and in general the potential
> gains are
> never better than the latency, cost, and complexity associated with
> it.
>
> - Ryan
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Brian Wolff
On 11/30/15, Purodha Blissenbach <[hidden email]> wrote:

> On 30.11.2015 20:47, Ryan Lane wrote:
>> On Sun, Nov 29, 2015 at 8:18 PM, Yeongjin Jang
>> <[hidden email]>
>> wrote:
>>
>>>
>>> I recall that I saw financial statement of WMF that states around
>>> $2.3M
>>> was spent for Internet Hosting. I am not sure whether it includes
>>> management cost for computing resources
>>> (server clusters such as eqiad) or not.
>>>
>>>
>> That's the cost for datacenters, hardware, bandwidth, etc..
>>
>>
>>> Not sure following simple calculation works;
>>> 117 TB per day, for 365 days, if $0.05 per GB, then it is around
>>> $2.2M.
>>> Maybe it would be more accurate if I contact analytics team
>>> directly.
>>>
>>>
>> That calculation doesn't work because it doesn't take into account
>> peering
>> agreements, or donated (or heavily discounted) transit contracts.
>> Bandwidth
>> is one of the cheaper overall costs.
>>
>> Something your design doesn't take into account for bandwidth costs
>> is that
>> the world is trending to mobile and mobile bandwidth costs are
>> generally
>> very high. It's likely this p2p approach will be many orders of
>> magnitude
>> more expensive than the current approach.
>>
>> A decentralized approach doesn't benefit from the economics of scale.
>> Instead of being able to negotiate transit pricing and eliminating
>> cost
>> through peering, you're externalizing the cost at the consumer rate,
>> which
>> is the highest possible rate.
>
> While that is often true, there are notable exception, growing both in
> scale
> and number.
>
> a) We have campus situations where a large university, company, or
> public
> agency with tens or hundreds of thousands of peers run a network that
> they
> pay for anyways, that is needed for peers to connect to Wiki* anyways,
> and
> that is available to peers at no (additional) cost. While external
> traffic cost
> are of relatively little concern, quick response times often are,
> especially
> in classroom situations where up to several hundred students may look
> at
> the same articles virtually at once.


If we wanted to address such a situation, it sounds like it would be
less complex to just setup a varnish box (With access to the HTCP
cache clear packets), on that campus.

--
-bawolff

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Bryan Davis
On Mon, Nov 30, 2015 at 4:03 PM, Brian Wolff <[hidden email]> wrote:
>
> If we wanted to address such a situation, it sounds like it would be
> less complex to just setup a varnish box (With access to the HTCP
> cache clear packets), on that campus.

This is an idea I've casually thought about but never put any real
work towards. It would be pretty neat to have something similar to the
Netflix Open Connect appliance [0] available for Wikimedia projects.

[0]: https://openconnect.netflix.com/

Bryan
--
Bryan Davis              Wikimedia Foundation    <[hidden email]>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Peer-to-peer sharing of the content of Wikipedia through WebRTC

Gabriel Wicke-3
On Mon, Nov 30, 2015 at 4:02 PM, Bryan Davis <[hidden email]> wrote:
> On Mon, Nov 30, 2015 at 4:03 PM, Brian Wolff <[hidden email]> wrote:
>>
>> If we wanted to address such a situation, it sounds like it would be
>> less complex to just setup a varnish box (With access to the HTCP
>> cache clear packets), on that campus.
>
> This is an idea I've casually thought about but never put any real
> work towards. It would be pretty neat to have something similar to the
> Netflix Open Connect appliance [0] available for Wikimedia projects.

This has a very strong "back to the future" ring to it. We started our
caching layers back in 2004 with an eye towards bandwidth / housing
donations from universities and ISPs [1], and indeed benefited from
such donations in Amsterdam (Kennisnet), Paris and Seoul (Yahoo). Of
these, only the Amsterdam PoP has survived, and is now in our own
management.

Gabriel

[1]: http://web.archive.org/web/20040710213535/http://www.aulinx.de/oss/code/wikipedia/

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l