SMW on large sites [Was: Roadmaps and getting and keeping devs]

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

SMW on large sites [Was: Roadmaps and getting and keeping devs]

Markus Krötzsch-2
[Making this into a new thread]

Hi Krzysztof,

I was already wondering when I would hear from Wikia ...

As you have noticed, running SMW and extensions on large sites (large in
terms of content, or in terms of users) has special requirements.
Typically, we suggest to use more conservative settings for querying, so
that long and difficult queries do not occur. Similarly, some SMW
extensions have not been developed for large sites, and can be
problematic in their own right. But your users obviously want to keep
the features that they already have, so we need to find better ways of
addressing your problem.

But first we need to separate concerns a little bit. You mention the
following distinct problems:

(1) Too many DB writes (about 60% in total)

(2) Too many slow queries (about 90% from SMW)

Moreover, your problem is not caused by SMW alone but by a number of
SMW-related extensions. So there will be multiple issues that need
addressing to fix this, and maybe even in multiple extensions.

Let us first see how big the impact of the extensions you mention could
be. Semantic Forms mainly leads to some additional reads (apparently no
problem for you); the total number could possibly be reduced. It may
also have some effect on query activity if certain autocompletion
features are used. But otherwise I think it is unlikely to be the root
of the problem. Semantic Drilldown might be more of a problem regarding
complex queries. But it uses its own SQL queries, so it should be
possible to find out how much of (2) comes from this extension. Semantic
Drilldown should not contribute to (1).

Are there any other extensions that use SMW on your site?

Regarding SMW, I have some concrete ideas on what could be done for (1)
and (2) but this will need more careful consideration first. I am
grateful if you can help to track down the cause of the problem, but I
am afraid that the changes in SMW core will still need to be done or at
least reviewed carefully by myself -- which makes me kind of a
bottleneck for the SMW part of your problem. I need to think about the
required work a little further before I can promise anything.

Regards,

Markus


On 22/02/2011 22:38, Krzysztof Krzyżaniak wrote:
 > I think it's would be right place to jump in.
 >
 > Hello, my name is Krzysztof Krzyżaniak a.k.a. eloy and I work for Wikia
 > Inc as backend team leader. We are probably (correct me if I am wrong)
 > on of the biggest user of Semantic Mediawiki suite. We currently have
 > enabled it on about 100 wikis for example on familypedia.wikia.com or
 > yugioh.wikia.com or www.wowwiki.com (but also on wikis which you
 > probably don't suspect for SMW interest like glee.wikia.com or
 > madmen.wikia.com). We would like to expand existence of SMW on Wikia
 > (for example lyrics would love it) but currently we cannot afford it
 > because of performance reasons. For example, our first cluster contains
 > about 30.000 wikis, mostly biggest ones. About 60% of writes in
 > databases came from SMW extensions (SemanticMediawiki,
 > SemanticDrilldown, SemanticForms), also about 90% queries from slow logs
 > are from SMW.
 >
 > I am here to find a way for scaling SMW on our wikis. But also I think
 > that it will be benefit for every SMW user because we want to help
 > improve SMW.
 >
 > What you can expect:
 > - "real world" cases, actually lot of them :)
 > - bugs :) (filled in bugzilla of course)
 > - bug fixes and patches (either as diff or direct svn commits if you
 > prefer that way)
 > - questions
 >
 > We can offer engineering hours and testbeds.
 >
 > For a start I have question for Roadmap: SMW light - how complete it is?
 > What's missing? When you expect it will be ready? How can we help?
 >
 >     eloy
 >

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Yaron Koren
Hi,

I agree that it's great to hear from Wikia, and also great to know that Wikia is willing to put in some development time and effort to help with SMW. A few thoughts:

- Wikia has already contributed somewhat to improving performance - I've been talking for a while to Tim Quievryn (who was at the Boston SMWCon last year), and his feedback helped lead to the faster handling of red links in Semantic Forms that was added in version 2.0.8.

- Semantic Drilldown might actually be contributing to DB writes - it creates a temporary database table on every hit to Special:BrowseData. (I don't know if temporary tables get counted.)

- This might not be the right place to discuss the specifics of the "SMW light" initiative, but it's my personal belief that the best approach to it is to do the triple-store integration, [1] so that SMW can use an RDF triple-store directly to store its data, rather than trying to improve or limit SMW's queries. It would theoretically speed up queries, but, more importantly, even if it didn't, it would basically eliminate SMW's impact on the wiki's database. That's just my personal opinion, though - I'm not involved in either of those projects.


-Yaron


2011/2/23 Markus Krötzsch <[hidden email]>
[Making this into a new thread]

Hi Krzysztof,

I was already wondering when I would hear from Wikia ...

As you have noticed, running SMW and extensions on large sites (large in
terms of content, or in terms of users) has special requirements.
Typically, we suggest to use more conservative settings for querying, so
that long and difficult queries do not occur. Similarly, some SMW
extensions have not been developed for large sites, and can be
problematic in their own right. But your users obviously want to keep
the features that they already have, so we need to find better ways of
addressing your problem.

But first we need to separate concerns a little bit. You mention the
following distinct problems:

(1) Too many DB writes (about 60% in total)

(2) Too many slow queries (about 90% from SMW)

Moreover, your problem is not caused by SMW alone but by a number of
SMW-related extensions. So there will be multiple issues that need
addressing to fix this, and maybe even in multiple extensions.

Let us first see how big the impact of the extensions you mention could
be. Semantic Forms mainly leads to some additional reads (apparently no
problem for you); the total number could possibly be reduced. It may
also have some effect on query activity if certain autocompletion
features are used. But otherwise I think it is unlikely to be the root
of the problem. Semantic Drilldown might be more of a problem regarding
complex queries. But it uses its own SQL queries, so it should be
possible to find out how much of (2) comes from this extension. Semantic
Drilldown should not contribute to (1).

Are there any other extensions that use SMW on your site?

Regarding SMW, I have some concrete ideas on what could be done for (1)
and (2) but this will need more careful consideration first. I am
grateful if you can help to track down the cause of the problem, but I
am afraid that the changes in SMW core will still need to be done or at
least reviewed carefully by myself -- which makes me kind of a
bottleneck for the SMW part of your problem. I need to think about the
required work a little further before I can promise anything.

Regards,

Markus


On 22/02/2011 22:38, Krzysztof Krzyżaniak wrote:
 > I think it's would be right place to jump in.
 >
 > Hello, my name is Krzysztof Krzyżaniak a.k.a. eloy and I work for Wikia
 > Inc as backend team leader. We are probably (correct me if I am wrong)
 > on of the biggest user of Semantic Mediawiki suite. We currently have
 > enabled it on about 100 wikis for example on familypedia.wikia.com or
 > yugioh.wikia.com or www.wowwiki.com (but also on wikis which you
 > probably don't suspect for SMW interest like glee.wikia.com or
 > madmen.wikia.com). We would like to expand existence of SMW on Wikia
 > (for example lyrics would love it) but currently we cannot afford it
 > because of performance reasons. For example, our first cluster contains
 > about 30.000 wikis, mostly biggest ones. About 60% of writes in
 > databases came from SMW extensions (SemanticMediawiki,
 > SemanticDrilldown, SemanticForms), also about 90% queries from slow logs
 > are from SMW.
 >
 > I am here to find a way for scaling SMW on our wikis. But also I think
 > that it will be benefit for every SMW user because we want to help
 > improve SMW.
 >
 > What you can expect:
 > - "real world" cases, actually lot of them :)
 > - bugs :) (filled in bugzilla of course)
 > - bug fixes and patches (either as diff or direct svn commits if you
 > prefer that way)
 > - questions
 >
 > We can offer engineering hours and testbeds.
 >
 > For a start I have question for Roadmap: SMW light - how complete it is?
 > What's missing? When you expect it will be ready? How can we help?
 >
 >     eloy
 >

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel



--
WikiWorks · MediaWiki Consulting · http://wikiworks.com

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Markus Krötzsch-2
On 23/02/2011 18:17, Yaron Koren wrote:

> Hi,
>
> I agree that it's great to hear from Wikia, and also great to know that
> Wikia is willing to put in some development time and effort to help with
> SMW. A few thoughts:
>
> - Wikia has already contributed somewhat to improving performance - I've
> been talking for a while to Tim Quievryn (who was at the Boston SMWCon
> last year), and his feedback helped lead to the faster handling of red
> links in Semantic Forms that was added in version 2.0.8.
>
> - Semantic Drilldown might actually be contributing to DB writes - it
> creates a temporary database table on every hit to Special:BrowseData.
> (I don't know if temporary tables get counted.)

Oh. This should be checked.

>
> - This might not be the right place to discuss the specifics of the "SMW
> light" initiative, but it's my personal belief that the best approach to
> it is to do the triple-store integration, [1] so that SMW can use an RDF
> triple-store directly to store its data, rather than trying to improve
> or limit SMW's queries. It would theoretically speed up queries, but,
> more importantly, even if it didn't, it would basically eliminate SMW's
> impact on the wiki's database. That's just my personal opinion, though -
> I'm not involved in either of those projects.

Yes, this is also what I had in mind. I have been planning the RDF store
integration for some time now but did not yet manage to really do the
necessary adjustments. This effort is still related to the SMW Light
initiative since the SMW Light SQL backend would be used (it is much
simpler than the SMW backend which has to do all the queries). I think
in this combination a significant performance gain will be possible,
since one can also streamline the DB writing in this simple store. But
you are right that SMW Light is really not about improving performance
but about reducing code size (and quite a lot of the functionality, too).

I need to do quite some more integration work to make sure that all
current features of SMW are fully suported via an RDF store backend. I
have a good concept of what I need to do, but I did not find the time
yet to actually do it. (But explaining it to someone else in sufficient
detail would probably take just as long, and I would still need to
review the code.)

Another interesting option with RDF store integration would be to use an
RDF-based faceted browser instead of Semantic Drilldown (if it is
actually found to cause problems on such large/high-traffic wikis).

- Markus


> 2011/2/23 Markus Krötzsch <[hidden email]
> <mailto:[hidden email]>>
>
>     [Making this into a new thread]
>
>     Hi Krzysztof,
>
>     I was already wondering when I would hear from Wikia ...
>
>     As you have noticed, running SMW and extensions on large sites (large in
>     terms of content, or in terms of users) has special requirements.
>     Typically, we suggest to use more conservative settings for querying, so
>     that long and difficult queries do not occur. Similarly, some SMW
>     extensions have not been developed for large sites, and can be
>     problematic in their own right. But your users obviously want to keep
>     the features that they already have, so we need to find better ways of
>     addressing your problem.
>
>     But first we need to separate concerns a little bit. You mention the
>     following distinct problems:
>
>     (1) Too many DB writes (about 60% in total)
>
>     (2) Too many slow queries (about 90% from SMW)
>
>     Moreover, your problem is not caused by SMW alone but by a number of
>     SMW-related extensions. So there will be multiple issues that need
>     addressing to fix this, and maybe even in multiple extensions.
>
>     Let us first see how big the impact of the extensions you mention could
>     be. Semantic Forms mainly leads to some additional reads (apparently no
>     problem for you); the total number could possibly be reduced. It may
>     also have some effect on query activity if certain autocompletion
>     features are used. But otherwise I think it is unlikely to be the root
>     of the problem. Semantic Drilldown might be more of a problem regarding
>     complex queries. But it uses its own SQL queries, so it should be
>     possible to find out how much of (2) comes from this extension. Semantic
>     Drilldown should not contribute to (1).
>
>     Are there any other extensions that use SMW on your site?
>
>     Regarding SMW, I have some concrete ideas on what could be done for (1)
>     and (2) but this will need more careful consideration first. I am
>     grateful if you can help to track down the cause of the problem, but I
>     am afraid that the changes in SMW core will still need to be done or at
>     least reviewed carefully by myself -- which makes me kind of a
>     bottleneck for the SMW part of your problem. I need to think about the
>     required work a little further before I can promise anything.
>
>     Regards,
>
>     Markus
>
>
>     On 22/02/2011 22:38, Krzysztof Krzyżaniak wrote:
>      > I think it's would be right place to jump in.
>      >
>      > Hello, my name is Krzysztof Krzyżaniak a.k.a. eloy and I work for
>     Wikia
>      > Inc as backend team leader. We are probably (correct me if I am
>     wrong)
>      > on of the biggest user of Semantic Mediawiki suite. We currently have
>      > enabled it on about 100 wikis for example on
>     familypedia.wikia.com <http://familypedia.wikia.com> or
>      > yugioh.wikia.com <http://yugioh.wikia.com> or www.wowwiki.com
>     <http://www.wowwiki.com> (but also on wikis which you
>      > probably don't suspect for SMW interest like glee.wikia.com
>     <http://glee.wikia.com> or
>      > madmen.wikia.com <http://madmen.wikia.com>). We would like to
>     expand existence of SMW on Wikia
>      > (for example lyrics would love it) but currently we cannot afford it
>      > because of performance reasons. For example, our first cluster
>     contains
>      > about 30.000 wikis, mostly biggest ones. About 60% of writes in
>      > databases came from SMW extensions (SemanticMediawiki,
>      > SemanticDrilldown, SemanticForms), also about 90% queries from
>     slow logs
>      > are from SMW.
>      >
>      > I am here to find a way for scaling SMW on our wikis. But also I
>     think
>      > that it will be benefit for every SMW user because we want to help
>      > improve SMW.
>      >
>      > What you can expect:
>      > - "real world" cases, actually lot of them :)
>      > - bugs :) (filled in bugzilla of course)
>      > - bug fixes and patches (either as diff or direct svn commits if you
>      > prefer that way)
>      > - questions
>      >
>      > We can offer engineering hours and testbeds.
>      >
>      > For a start I have question for Roadmap: SMW light - how complete
>     it is?
>      > What's missing? When you expect it will be ready? How can we help?
>      >
>      >     eloy
>      >
>
>     ------------------------------------------------------------------------------
>     Free Software Download: Index, Search & Analyze Logs and other IT
>     data in
>     Real-Time with Splunk. Collect, index and harness all the fast
>     moving IT data
>     generated by your applications, servers and devices whether
>     physical, virtual
>     or in the cloud. Deliver compliance at lower cost and gain new business
>     insights. http://p.sf.net/sfu/splunk-dev2dev
>     _______________________________________________
>     Semediawiki-devel mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
>
>
>
>
> --
> WikiWorks · MediaWiki Consulting · http://wikiworks.com


------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Jeroen De Dauw-2
Hey,

Since this discussion is about performance and changes to the SMW storage layer, I'd like to bring up an issue I've been having as Semantic Maps dev.

To efficiently do spatial operations on geographical data stored by SMW, support for spatial extensions in MySQL and PostGis in PostGres is needed. Currently there is no way to make use of these database extensions, as the SMW storage layer does not allow for:
* specifying what type of index to place on a field (it only allows specifying a field should be indexed)
* putting SQL functions in insert and select statements (which is needed to insert or select geographical entities)

I'm not sure to what extend supporting this is possible, but it would make a huge difference for working with geographical data in SMW. So I'd be nice if this was kept into consideration when modifications to the storage layer are made.

In any case, the current distance query in Semantic Maps already performs way better then the one in older versions of SMW (which was really really ... really bad). More incentive for Wikia to update :)

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Thomas Fellows
Hey all,

I would like to also voice my support of Jeroen's requests -- these features are indeed essential to geographic queries (being able to sort by distance).  I believe it is possible to do in the same manner in which 'where' statements are able to be modified by extensions (I have written up a beta expansion of SM and SMW to do this a few months ago - not using the built in GIS mysql features however but the haversine formula).

Right now I have an unadulterated SMW installation with about ~30million pages, ~23million of them have geographic coordinates.  Distance queries take about ~240 seconds across them if loaded for the first time (lots of time building temp tables). Running Dual i7, 12gb ram - 8gb innodb buffer, 4x 7200 1tb drives in raid 10.

We are beginning to look into external stores - but with so much data and other work the testing and transfer would take some time.

We've also had to remove Semantic Drilldown and turn off auto-completion in Semantic Forms (be careful of Type:Page, automatically uses auto-completion).

-Tom

p.s. Apologies for the really really bad distance query in the original SMW, was my first time working with SMW, and was impossible to do bounding-box with lat/lon being stored in the same field in MySQL ;)

On Wed, Feb 23, 2011 at 2:09 PM, Jeroen De Dauw <[hidden email]> wrote:
Hey,

Since this discussion is about performance and changes to the SMW storage layer, I'd like to bring up an issue I've been having as Semantic Maps dev.

To efficiently do spatial operations on geographical data stored by SMW, support for spatial extensions in MySQL and PostGis in PostGres is needed. Currently there is no way to make use of these database extensions, as the SMW storage layer does not allow for:
* specifying what type of index to place on a field (it only allows specifying a field should be indexed)
* putting SQL functions in insert and select statements (which is needed to insert or select geographical entities)

I'm not sure to what extend supporting this is possible, but it would make a huge difference for working with geographical data in SMW. So I'd be nice if this was kept into consideration when modifications to the storage layer are made.

In any case, the current distance query in Semantic Maps already performs way better then the one in older versions of SMW (which was really really ... really bad). More incentive for Wikia to update :)

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel



------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Thomas Fellows
Sorry for the second email - also forgot to mention that we have very few number of users right now (internal company wiki) and that we had a major problem with sub-properties slowing everything down to unusable states, but not the time to figure out why (something like having 3+ sub properties broke everything).

-tom

On Wed, Feb 23, 2011 at 2:59 PM, Thomas Fellows <[hidden email]> wrote:
Hey all,

I would like to also voice my support of Jeroen's requests -- these features are indeed essential to geographic queries (being able to sort by distance).  I believe it is possible to do in the same manner in which 'where' statements are able to be modified by extensions (I have written up a beta expansion of SM and SMW to do this a few months ago - not using the built in GIS mysql features however but the haversine formula).

Right now I have an unadulterated SMW installation with about ~30million pages, ~23million of them have geographic coordinates.  Distance queries take about ~240 seconds across them if loaded for the first time (lots of time building temp tables). Running Dual i7, 12gb ram - 8gb innodb buffer, 4x 7200 1tb drives in raid 10.

We are beginning to look into external stores - but with so much data and other work the testing and transfer would take some time.

We've also had to remove Semantic Drilldown and turn off auto-completion in Semantic Forms (be careful of Type:Page, automatically uses auto-completion).

-Tom

p.s. Apologies for the really really bad distance query in the original SMW, was my first time working with SMW, and was impossible to do bounding-box with lat/lon being stored in the same field in MySQL ;)

On Wed, Feb 23, 2011 at 2:09 PM, Jeroen De Dauw <[hidden email]> wrote:
Hey,

Since this discussion is about performance and changes to the SMW storage layer, I'd like to bring up an issue I've been having as Semantic Maps dev.

To efficiently do spatial operations on geographical data stored by SMW, support for spatial extensions in MySQL and PostGis in PostGres is needed. Currently there is no way to make use of these database extensions, as the SMW storage layer does not allow for:
* specifying what type of index to place on a field (it only allows specifying a field should be indexed)
* putting SQL functions in insert and select statements (which is needed to insert or select geographical entities)

I'm not sure to what extend supporting this is possible, but it would make a huge difference for working with geographical data in SMW. So I'd be nice if this was kept into consideration when modifications to the storage layer are made.

In any case, the current distance query in Semantic Maps already performs way better then the one in older versions of SMW (which was really really ... really bad). More incentive for Wikia to update :)

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel




------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Markus Krötzsch-2
In reply to this post by Jeroen De Dauw-2
On 23/02/2011 19:09, Jeroen De Dauw wrote:

> Hey,
>
> Since this discussion is about performance and changes to the SMW
> storage layer, I'd like to bring up an issue I've been having as
> Semantic Maps dev.
>
> To efficiently do spatial operations on geographical data stored by SMW,
> support for spatial extensions in MySQL and PostGis in PostGres is
> needed. Currently there is no way to make use of these database
> extensions, as the SMW storage layer does not allow for:
> * specifying what type of index to place on a field (it only allows
> specifying a field should be indexed)
> * putting SQL functions in insert and select statements (which is needed
> to insert or select geographical entities)

If this is indeed so special, and we see a big need for changing this,
then we may need to consider tying geo support more closely into the SMW
architecture. I would be interested to find out if this type of distance
query is an issue in some Wikia wiki.

We have a general architectural problem of implementation independence
vs. performance here. We could probably make SMW run faster on MySQL if
we would commit to supporting only (My)SQL backends. At the same time,
MySQL is largely unsuitable for most other types of queries that we want
to answer, leading to the slow queries that some wikis see (MySQL simply
dies completely when queries reach a certain complexity -- AFAIKT it
looses most of its optimization capabilities for queries that involve a
single table many times; in particular it seems to ignore query
structure in favour of table- or column-based selectivity measures that
don't help at all if the same table is used many times). But to be fair,
our queries are quite unusual for classical RDBMs applications.

We have reasons to hope that RDF stores would provide much better
performance on such queries, but these systems have a completely
different data model and different capabilities. It should be
appreciated that the current architecture allows such paradigm shifts in
the backend to happen without code changes in most parts of SMW and in
most of its extensions. Since it seems unavoidable to move to RDF stores
for higher query performance, it might not be very useful to try and
exploit additional MySQL optimizations now (this would only help sites
which have mostly simple queries but many distance computations which
are currently too slow).

If one looks for a more general solution, the question then is how
specific the MySQL coordinate format actually is. To keep the current
flexibility of architecture, it might be necessary to make the SQLStore
implementation aware of geo coordinates (this could also be done with
hooks). But we must avoid to make the higher levels of the API (e.g.
datavalue implementations) specific to MySQL. I think there are
solutions that would meet these requirements, they just need to be
designed and implemented. The main point it that higher levels should
exchange data in standard formats (e.g. floating point numbers for
latitude and longitude) and MySQL specific syntax (e.g. some kind of
other syntactic formats for geo coords) should only be created in the
storage layer.

- Markus


>
> I'm not sure to what extend supporting this is possible, but it would
> make a huge difference for working with geographical data in SMW. So I'd
> be nice if this was kept into consideration when modifications to the
> storage layer are made.
>
> In any case, the current distance query in Semantic Maps already
> performs way better then the one in older versions of SMW (which was
> really really ... really bad). More incentive for Wikia to update :)
>
> Cheers
>
> --
> Jeroen De Dauw
> http://www.bn2vs.com
> Don't panic. Don't be evil.
> --
>
>
>
> ------------------------------------------------------------------------------
> Free Software Download: Index, Search&  Analyze Logs and other IT data in
> Real-Time with Splunk. Collect, index and harness all the fast moving IT data
> generated by your applications, servers and devices whether physical, virtual
> or in the cloud. Deliver compliance at lower cost and gain new business
> insights. http://p.sf.net/sfu/splunk-dev2dev
>
>
>
> _______________________________________________
> Semediawiki-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/semediawiki-devel


------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Krzysztof Krzyżaniak-3
In reply to this post by Yaron Koren
On 23.02.2011 19:17, Yaron Koren wrote:

> Hi,
>
> I agree that it's great to hear from Wikia, and also great to know that
> Wikia is willing to put in some development time and effort to help with
> SMW. A few thoughts:
>
> - Wikia has already contributed somewhat to improving performance - I've
> been talking for a while to Tim Quievryn (who was at the Boston SMWCon
> last year), and his feedback helped lead to the faster handling of red
> links in Semantic Forms that was added in version 2.0.8.
>
> - Semantic Drilldown might actually be contributing to DB writes - it
> creates a temporary database table on every hit to Special:BrowseData.
> (I don't know if temporary tables get counted.)

Yes, creating tables is writting in our case.


> - This might not be the right place to discuss the specifics of the "SMW
> light" initiative, but it's my personal belief that the best approach to
> it is to do the triple-store integration, [1] so that SMW can use an RDF
> triple-store directly to store its data, rather than trying to improve
> or limit SMW's queries. It would theoretically speed up queries, but,
> more importantly, even if it didn't, it would basically eliminate SMW's
> impact on the wiki's database. That's just my personal opinion, though -
> I'm not involved in either of those projects.
>
> [1] http://semantic-mediawiki.org/wiki/SPARQL_and_RDF_stores_for_SMW
>
> -Yaron
>
>
> 2011/2/23 Markus Krötzsch <[hidden email]
> <mailto:[hidden email]>>
>
>     [Making this into a new thread]
>
>     Hi Krzysztof,
>
>     I was already wondering when I would hear from Wikia ...
>
>     As you have noticed, running SMW and extensions on large sites (large in
>     terms of content, or in terms of users) has special requirements.
>     Typically, we suggest to use more conservative settings for querying, so
>     that long and difficult queries do not occur. Similarly, some SMW
>     extensions have not been developed for large sites, and can be
>     problematic in their own right. But your users obviously want to keep
>     the features that they already have, so we need to find better ways of
>     addressing your problem.
>
>     But first we need to separate concerns a little bit. You mention the
>     following distinct problems:
>
>     (1) Too many DB writes (about 60% in total)
>
>     (2) Too many slow queries (about 90% from SMW)
>
>     Moreover, your problem is not caused by SMW alone but by a number of
>     SMW-related extensions. So there will be multiple issues that need
>     addressing to fix this, and maybe even in multiple extensions.
>
>     Let us first see how big the impact of the extensions you mention could
>     be. Semantic Forms mainly leads to some additional reads (apparently no
>     problem for you); the total number could possibly be reduced. It may
>     also have some effect on query activity if certain autocompletion
>     features are used. But otherwise I think it is unlikely to be the root
>     of the problem. Semantic Drilldown might be more of a problem regarding
>     complex queries. But it uses its own SQL queries, so it should be
>     possible to find out how much of (2) comes from this extension. Semantic
>     Drilldown should not contribute to (1).
>
>     Are there any other extensions that use SMW on your site?
>
>     Regarding SMW, I have some concrete ideas on what could be done for (1)
>     and (2) but this will need more careful consideration first. I am
>     grateful if you can help to track down the cause of the problem, but I
>     am afraid that the changes in SMW core will still need to be done or at
>     least reviewed carefully by myself -- which makes me kind of a
>     bottleneck for the SMW part of your problem. I need to think about the
>     required work a little further before I can promise anything.
>
>     Regards,
>
>     Markus
>
>
>     On 22/02/2011 22:38, Krzysztof Krzyżaniak wrote:
>      > I think it's would be right place to jump in.
>      >
>      > Hello, my name is Krzysztof Krzyżaniak a.k.a. eloy and I work for
>     Wikia
>      > Inc as backend team leader. We are probably (correct me if I am
>     wrong)
>      > on of the biggest user of Semantic Mediawiki suite. We currently have
>      > enabled it on about 100 wikis for example on
>     familypedia.wikia.com <http://familypedia.wikia.com> or
>      > yugioh.wikia.com <http://yugioh.wikia.com> or www.wowwiki.com
>     <http://www.wowwiki.com> (but also on wikis which you
>      > probably don't suspect for SMW interest like glee.wikia.com
>     <http://glee.wikia.com> or
>      > madmen.wikia.com <http://madmen.wikia.com>). We would like to
>     expand existence of SMW on Wikia
>      > (for example lyrics would love it) but currently we cannot afford it
>      > because of performance reasons. For example, our first cluster
>     contains
>      > about 30.000 wikis, mostly biggest ones. About 60% of writes in
>      > databases came from SMW extensions (SemanticMediawiki,
>      > SemanticDrilldown, SemanticForms), also about 90% queries from
>     slow logs
>      > are from SMW.
>      >
>      > I am here to find a way for scaling SMW on our wikis. But also I
>     think
>      > that it will be benefit for every SMW user because we want to help
>      > improve SMW.
>      >
>      > What you can expect:
>      > - "real world" cases, actually lot of them :)
>      > - bugs :) (filled in bugzilla of course)
>      > - bug fixes and patches (either as diff or direct svn commits if you
>      > prefer that way)
>      > - questions
>      >
>      > We can offer engineering hours and testbeds.
>      >
>      > For a start I have question for Roadmap: SMW light - how complete
>     it is?
>      > What's missing? When you expect it will be ready? How can we help?
>      >
>      >     eloy
>      >
>
>     ------------------------------------------------------------------------------
>     Free Software Download: Index, Search & Analyze Logs and other IT
>     data in
>     Real-Time with Splunk. Collect, index and harness all the fast
>     moving IT data
>     generated by your applications, servers and devices whether
>     physical, virtual
>     or in the cloud. Deliver compliance at lower cost and gain new business
>     insights. http://p.sf.net/sfu/splunk-dev2dev
>     _______________________________________________
>     Semediawiki-devel mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
>
>
>
>
> --
> WikiWorks · MediaWiki Consulting · http://wikiworks.com


------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel
Reply | Threaded
Open this post in threaded view
|

Re: SMW on large sites [Was: Roadmaps and getting and keeping devs]

Krzysztof Krzyżaniak-3
In reply to this post by Markus Krötzsch-2
On 23.02.2011 15:16, Markus Krötzsch wrote:

> [Making this into a new thread]
>
> Hi Krzysztof,
>
> I was already wondering when I would hear from Wikia ...
>
> As you have noticed, running SMW and extensions on large sites (large in
> terms of content, or in terms of users) has special requirements.
> Typically, we suggest to use more conservative settings for querying, so
> that long and difficult queries do not occur. Similarly, some SMW
> extensions have not been developed for large sites, and can be
> problematic in their own right. But your users obviously want to keep
> the features that they already have, so we need to find better ways of
> addressing your problem.
>
> But first we need to separate concerns a little bit. You mention the
> following distinct problems:
>
> (1) Too many DB writes (about 60% in total)
>
> (2) Too many slow queries (about 90% from SMW)
>
> Moreover, your problem is not caused by SMW alone but by a number of
> SMW-related extensions. So there will be multiple issues that need
> addressing to fix this, and maybe even in multiple extensions.
 >

> Let us first see how big the impact of the extensions you mention could
> be. Semantic Forms mainly leads to some additional reads (apparently no
> problem for you); the total number could possibly be reduced. It may
> also have some effect on query activity if certain autocompletion
> features are used. But otherwise I think it is unlikely to be the root
> of the problem. Semantic Drilldown might be more of a problem regarding
> complex queries. But it uses its own SQL queries, so it should be
> possible to find out how much of (2) comes from this extension. Semantic
> Drilldown should not contribute to (1).
>
> Are there any other extensions that use SMW on your site?
We use:

- SemanticMediawiki
- SemanticDrilldown
- SemanticForms
- SemanticGallery
- SemanticResultFormats
- SemanticMaps

not all extensions are enabled to all wikis with SMW, most common
configuration is SemanticMediawiki + SemanticForms + SemanticDrilldown.

> Regarding SMW, I have some concrete ideas on what could be done for (1)
> and (2) but this will need more careful consideration first. I am
> grateful if you can help to track down the cause of the problem, but I
> am afraid that the changes in SMW core will still need to be done or at
> least reviewed carefully by myself -- which makes me kind of a
> bottleneck for the SMW part of your problem. I need to think about the
> required work a little further before I can promise anything.

Of course.

My short term solution is to separate smw tables from 'regular' wiki
database.
There is only one condition to achieve that, smw tables can't join with
regular
tables. So far I didn't find any joins in current sources. It of course use
additional database connection but it's not problem for us. There are
some changes
but not sure if applicable for wider audience. To have separation for
database I use
wfGetDB( DB_MASTER|DB_SLAVE, 'smw' ) for semantic tables and wfGetDB(
DB_MASTER|DB_SLAVE) for "local" tables (like "page" or "category").
Later we have our implementation of  LBFactory_Multi which switch
connections based on groups parameter in wfGetDB. It would be nice if
SMWSQLStore2 class would have two static methods (or one parametrized),
in stock version they would be something like

public static getSMWDB( $type ) {
        return wfGetDB( $type, 'smw' );
}

public static getLocalDB( $type ) {
        return wfGetDB( $type );
}

then it would be easier to us merge our changes with upstream changes.


Regards,

   eloy

> Regards,
>
> Markus
>
>
> On 22/02/2011 22:38, Krzysztof Krzyżaniak wrote:
>   >  I think it's would be right place to jump in.
>   >
>   >  Hello, my name is Krzysztof Krzyżaniak a.k.a. eloy and I work for Wikia
>   >  Inc as backend team leader. We are probably (correct me if I am wrong)
>   >  on of the biggest user of Semantic Mediawiki suite. We currently have
>   >  enabled it on about 100 wikis for example on familypedia.wikia.com or
>   >  yugioh.wikia.com or www.wowwiki.com (but also on wikis which you
>   >  probably don't suspect for SMW interest like glee.wikia.com or
>   >  madmen.wikia.com). We would like to expand existence of SMW on Wikia
>   >  (for example lyrics would love it) but currently we cannot afford it
>   >  because of performance reasons. For example, our first cluster contains
>   >  about 30.000 wikis, mostly biggest ones. About 60% of writes in
>   >  databases came from SMW extensions (SemanticMediawiki,
>   >  SemanticDrilldown, SemanticForms), also about 90% queries from slow logs
>   >  are from SMW.
>   >
>   >  I am here to find a way for scaling SMW on our wikis. But also I think
>   >  that it will be benefit for every SMW user because we want to help
>   >  improve SMW.
>   >
>   >  What you can expect:
>   >  - "real world" cases, actually lot of them :)
>   >  - bugs :) (filled in bugzilla of course)
>   >  - bug fixes and patches (either as diff or direct svn commits if you
>   >  prefer that way)
>   >  - questions
>   >
>   >  We can offer engineering hours and testbeds.
>   >
>   >  For a start I have question for Roadmap: SMW light - how complete it is?
>   >  What's missing? When you expect it will be ready? How can we help?
>   >
>   >      eloy
>   >


------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
Semediawiki-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel