edition performance

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

edition performance

wp1080397-lsrs wp1080397-lsrs

Dear friends, 

We have been working for some months in a wikidata project, and we have found an issue with edition performance, I began to work with wikidata java api, and when I tried to increase the edition speed the java system held editions, and inserted delays, which reduced edition output as well. 

I chose the option to edit with pywikibot, but my experience was that this reduced more the edition.

At the end we use the procedure indicated here:

https://www.mediawiki.org/wiki/API:Edit#Example

With multithreading, and we reach a maximum of 10,6 edition per second. 

my questions is if there is some experience when has been possible to have a higher speed?.

Currently we need to write 1.500.000 items, and we would require 5 working days for such a task.

Best regards

Luis Ramos

Senior Java Developer

(Semantic Web Developer)

PST.AG

Jena, Germany. 



_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

Valerio Bozzolan-2
Please note that - AFAIK - parallel requests are not well accepted.

https://www.mediawiki.org/wiki/API:Etiquette

(You may have a bigger problem now :^)

On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs wrote:

> Dear friends,
> We have been working for some months in a wikidata project, and we
> have found an issue with edition performance, I began to work with
> wikidata java api, and when I tried to increase the edition speed the
> java system held editions, and inserted delays, which reduced edition
> output as well.
> I chose the option to edit with pywikibot, but my experience was that
> this reduced more the edition.
> At the end we use the procedure indicated here:
> https://www.mediawiki.org/wiki/API:Edit#Example
> With multithreading, and we reach a maximum of 10,6 edition per
> second.
> my questions is if there is some experience when has been possible to
> have a higher speed?.
> Currently we need to write 1.500.000 items, and we would require 5
> working days for such a task.
> Best regards
> Luis Ramos
> Senior Java Developer
> (Semantic Web Developer)
> PST.AG
> Jena, Germany.
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

wp1080397-lsrs wp1080397-lsrs
Dear Valerio,

Thanks for the quick answer, if I understood your answer, we should be using an inappropriate approach at doing parallel programming in the edition process.
In this case, we are aiming to have the data available asap, as soon as we have it we should use another approach.

The question I made is about the necessity of loading  large data sets, because in the case of private instances, we need to load 20.000.000 of items for private use, and with a rate of 10 items per second, using the approach we are following we will require 25 days, with a script writing 24 hour a day, and speaking in big data terms, 20 M is an small data set.

So, I leave an open question:

my questions is if there is some experience when has been possible to
have a higher speed in edition rate?.

Best regards


> Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um 09:28 geschrieben:
>
>
> Please note that - AFAIK - parallel requests are not well accepted.
>
> https://www.mediawiki.org/wiki/API:Etiquette
>
> (You may have a bigger problem now :^)
>
> On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs wrote:
> > Dear friends,
> > We have been working for some months in a wikidata project, and we
> > have found an issue with edition performance, I began to work with
> > wikidata java api, and when I tried to increase the edition speed the
> > java system held editions, and inserted delays, which reduced edition
> > output as well.
> > I chose the option to edit with pywikibot, but my experience was that
> > this reduced more the edition.
> > At the end we use the procedure indicated here:
> > https://www.mediawiki.org/wiki/API:Edit#Example
> > With multithreading, and we reach a maximum of 10,6 edition per
> > second.
> > my questions is if there is some experience when has been possible to
> > have a higher speed?.
> > Currently we need to write 1.500.000 items, and we would require 5
> > working days for such a task.
> > Best regards
> > Luis Ramos
> > Senior Java Developer
> > (Semantic Web Developer)
> > PST.AG
> > Jena, Germany.
> >
> > _______________________________________________
> > Mediawiki-api mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Luis Ramos


Senior Java Developer


(Semantic Web Developer)


PST.AG


Jena, Germany.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

Valerio Bozzolan-2
In order to further help you, can I ask you your Wikidata bot approval
discussion?

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot

On Tue, 2020-01-28 at 10:19 +0100, wp1080397-lsrs wp1080397-lsrs wrote:

> Dear Valerio,
>
> Thanks for the quick answer, if I understood your answer, we should
> be using an inappropriate approach at doing parallel programming in
> the edition process.
> In this case, we are aiming to have the data available asap, as soon
> as we have it we should use another approach.
>
> The question I made is about the necessity of loading  large data
> sets, because in the case of private instances, we need to load
> 20.000.000 of items for private use, and with a rate of 10 items per
> second, using the approach we are following we will require 25 days,
> with a script writing 24 hour a day, and speaking in big data terms,
> 20 M is an small data set.
>
> So, I leave an open question:
>
> my questions is if there is some experience when has been possible to
> have a higher speed in edition rate?.
>
> Best regards
>
>
> > Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um
> > 09:28 geschrieben:
> >
> >
> > Please note that - AFAIK - parallel requests are not well accepted.
> >
> > https://www.mediawiki.org/wiki/API:Etiquette
> >
> > (You may have a bigger problem now :^)
> >
> > On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs
> > wrote:
> > > Dear friends,
> > > We have been working for some months in a wikidata project, and
> > > we
> > > have found an issue with edition performance, I began to work
> > > with
> > > wikidata java api, and when I tried to increase the edition speed
> > > the
> > > java system held editions, and inserted delays, which reduced
> > > edition
> > > output as well.
> > > I chose the option to edit with pywikibot, but my experience was
> > > that
> > > this reduced more the edition.
> > > At the end we use the procedure indicated here:
> > > https://www.mediawiki.org/wiki/API:Edit#Example
> > > With multithreading, and we reach a maximum of 10,6 edition per
> > > second.
> > > my questions is if there is some experience when has been
> > > possible to
> > > have a higher speed?.
> > > Currently we need to write 1.500.000 items, and we would require
> > > 5
> > > working days for such a task.
> > > Best regards
> > > Luis Ramos
> > > Senior Java Developer
> > > (Semantic Web Developer)
> > > PST.AG
> > > Jena, Germany.
> > >
> > > _______________________________________________
> > > Mediawiki-api mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >
> > _______________________________________________
> > Mediawiki-api mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
> Luis Ramos
>
>
> Senior Java Developer
>
>
> (Semantic Web Developer)
>
>
> PST.AG
>
>
> Jena, Germany.
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

wp1080397-lsrs wp1080397-lsrs
If I understand your request, I can not provide you such discussion,
 because I did not participate in any discussion for bot approval, our administrator configured
the bots in our private instance.


Hope you can provide me some additional support, and

if you require further information, please let me now, and I would answer ASAP.


Best regards


Luis Ramos


> Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um 17:43 geschrieben:
>
>
> In order to further help you, can I ask you your Wikidata bot approval
> discussion?
>
> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>
> On Tue, 2020-01-28 at 10:19 +0100, wp1080397-lsrs wp1080397-lsrs wrote:
> > Dear Valerio,
> >
> > Thanks for the quick answer, if I understood your answer, we should
> > be using an inappropriate approach at doing parallel programming in
> > the edition process.
> > In this case, we are aiming to have the data available asap, as soon
> > as we have it we should use another approach.
> >
> > The question I made is about the necessity of loading  large data
> > sets, because in the case of private instances, we need to load
> > 20.000.000 of items for private use, and with a rate of 10 items per
> > second, using the approach we are following we will require 25 days,
> > with a script writing 24 hour a day, and speaking in big data terms,
> > 20 M is an small data set.
> >
> > So, I leave an open question:
> >
> > my questions is if there is some experience when has been possible to
> > have a higher speed in edition rate?.
> >
> > Best regards
> >
> >
> > > Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um
> > > 09:28 geschrieben:
> > >
> > >
> > > Please note that - AFAIK - parallel requests are not well accepted.
> > >
> > > https://www.mediawiki.org/wiki/API:Etiquette
> > >
> > > (You may have a bigger problem now :^)
> > >
> > > On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs
> > > wrote:
> > > > Dear friends,
> > > > We have been working for some months in a wikidata project, and
> > > > we
> > > > have found an issue with edition performance, I began to work
> > > > with
> > > > wikidata java api, and when I tried to increase the edition speed
> > > > the
> > > > java system held editions, and inserted delays, which reduced
> > > > edition
> > > > output as well.
> > > > I chose the option to edit with pywikibot, but my experience was
> > > > that
> > > > this reduced more the edition.
> > > > At the end we use the procedure indicated here:
> > > > https://www.mediawiki.org/wiki/API:Edit#Example
> > > > With multithreading, and we reach a maximum of 10,6 edition per
> > > > second.
> > > > my questions is if there is some experience when has been
> > > > possible to
> > > > have a higher speed?.
> > > > Currently we need to write 1.500.000 items, and we would require
> > > > 5
> > > > working days for such a task.
> > > > Best regards
> > > > Luis Ramos
> > > > Senior Java Developer
> > > > (Semantic Web Developer)
> > > > PST.AG
> > > > Jena, Germany.
> > > >
> > > > _______________________________________________
> > > > Mediawiki-api mailing list
> > > > [hidden email]
> > > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> > >
> > > _______________________________________________
> > > Mediawiki-api mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >
> > Luis Ramos
> >
> >
> > Senior Java Developer
> >
> >
> > (Semantic Web Developer)
> >
> >
> > PST.AG
> >
> >
> > Jena, Germany.
> >
> > _______________________________________________
> > Mediawiki-api mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Luis Ramos


Senior Java Developer


(Semantic Web Developer)


PST.AG


Jena, Germany.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

Valerio Bozzolan-2
Thank you for the clarification,

First of all let me clarify that on you private Wikibase instance - on your own hardware - you can surely do whatever you want and flood your APIs without asking any permission. So, if you reached a pratical edit/second limitation, probably you may want to find some hardware bottlenecks with the help of a sysadmin.

As a note "in case of fire" you can just restore your database backup instead of re-running your bot another time. (You have a backup, isn't it? :)

Warm wishes

On January 29, 2020 8:12:40 AM GMT+01:00, wp1080397-lsrs wp1080397-lsrs <[hidden email]> wrote:

>If I understand your request, I can not provide you such discussion,
>because I did not participate in any discussion for bot approval, our
>administrator configured
>the bots in our private instance.
>
>
>Hope you can provide me some additional support, and
>
>if you require further information, please let me now, and I would
>answer ASAP.
>
>
>Best regards
>
>
>Luis Ramos
>
>
>> Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um 17:43
>geschrieben:
>>
>>
>> In order to further help you, can I ask you your Wikidata bot
>approval
>> discussion?
>>
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>>
>> On Tue, 2020-01-28 at 10:19 +0100, wp1080397-lsrs wp1080397-lsrs
>wrote:
>> > Dear Valerio,
>> >
>> > Thanks for the quick answer, if I understood your answer, we should
>> > be using an inappropriate approach at doing parallel programming in
>> > the edition process.
>> > In this case, we are aiming to have the data available asap, as
>soon
>> > as we have it we should use another approach.
>> >
>> > The question I made is about the necessity of loading  large data
>> > sets, because in the case of private instances, we need to load
>> > 20.000.000 of items for private use, and with a rate of 10 items
>per
>> > second, using the approach we are following we will require 25
>days,
>> > with a script writing 24 hour a day, and speaking in big data
>terms,
>> > 20 M is an small data set.
>> >
>> > So, I leave an open question:
>> >
>> > my questions is if there is some experience when has been possible
>to
>> > have a higher speed in edition rate?.
>> >
>> > Best regards
>> >
>> >
>> > > Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um
>> > > 09:28 geschrieben:
>> > >
>> > >
>> > > Please note that - AFAIK - parallel requests are not well
>accepted.
>> > >
>> > > https://www.mediawiki.org/wiki/API:Etiquette
>> > >
>> > > (You may have a bigger problem now :^)
>> > >
>> > > On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs
>> > > wrote:
>> > > > Dear friends,
>> > > > We have been working for some months in a wikidata project, and
>> > > > we
>> > > > have found an issue with edition performance, I began to work
>> > > > with
>> > > > wikidata java api, and when I tried to increase the edition
>speed
>> > > > the
>> > > > java system held editions, and inserted delays, which reduced
>> > > > edition
>> > > > output as well.
>> > > > I chose the option to edit with pywikibot, but my experience
>was
>> > > > that
>> > > > this reduced more the edition.
>> > > > At the end we use the procedure indicated here:
>> > > > https://www.mediawiki.org/wiki/API:Edit#Example
>> > > > With multithreading, and we reach a maximum of 10,6 edition per
>> > > > second.
>> > > > my questions is if there is some experience when has been
>> > > > possible to
>> > > > have a higher speed?.
>> > > > Currently we need to write 1.500.000 items, and we would
>require
>> > > > 5
>> > > > working days for such a task.
>> > > > Best regards
>> > > > Luis Ramos
>> > > > Senior Java Developer
>> > > > (Semantic Web Developer)
>> > > > PST.AG
>> > > > Jena, Germany.
>> > > >
>> > > > _______________________________________________
>> > > > Mediawiki-api mailing list
>> > > > [hidden email]
>> > > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>> > >
>> > > _______________________________________________
>> > > Mediawiki-api mailing list
>> > > [hidden email]
>> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>> >
>> > Luis Ramos
>> >
>> >
>> > Senior Java Developer
>> >
>> >
>> > (Semantic Web Developer)
>> >
>> >
>> > PST.AG
>> >
>> >
>> > Jena, Germany.
>> >
>> > _______________________________________________
>> > Mediawiki-api mailing list
>> > [hidden email]
>> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>Luis Ramos
>
>
>Senior Java Developer
>
>
>(Semantic Web Developer)
>
>
>PST.AG
>
>
>Jena, Germany.

--
E-mail sent from the "K-9 mail" app from F-Droid, installed in my LineageOS device without proprietary Google apps. I'm delivering through my Postfix mailserver installed in a Debian GNU/Linux.

Have fun with software freedom!

[[User:Valerio Bozzolan]]

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: edition performance

wp1080397-lsrs wp1080397-lsrs
Well Valerio, my expertise is in ontology and knowledge representation, so I hope not to give you wrong information.

I do not have a back up, because I have not finished the first load of data, which it is a 1,5 M items, around 60 M of triples.

At the beginning I realized that increasing threads increased the edition, with threads we increased edition to 4 items/second, then we improved hardware, and with the sysadmin made the following adjustment to our database:

For database:

We heave test with mariadb and mysql
Both we have set in tmpfs (temporary file storage) and done following settings in mysqld.cnf

tmpdir          = /var/lib/mysql/mysqltmp
query_cache_limit       = 0
query_cache_size        = 0
innodb_buffer_pool_size = 8G
innodb_flush_log_at_trx_commit=2

With this improvement we reached our max edition 10 items/second.


However, we do not know how to make further improvement, given that we must load 30 M items, which seems to be very long task.

Hope this information could shade some lights to our use case, and perhaps helps us to improve our work.


Luis





> Valerio Bozzolan <[hidden email]> hat am 29. Januar 2020 um 08:44 geschrieben:
>
>
> Thank you for the clarification,
>
> First of all let me clarify that on you private Wikibase instance - on your own hardware - you can surely do whatever you want and flood your APIs without asking any permission. So, if you reached a pratical edit/second limitation, probably you may want to find some hardware bottlenecks with the help of a sysadmin.
>
> As a note "in case of fire" you can just restore your database backup instead of re-running your bot another time. (You have a backup, isn't it? :)
>
> Warm wishes
>
> On January 29, 2020 8:12:40 AM GMT+01:00, wp1080397-lsrs wp1080397-lsrs <[hidden email]> wrote:
> >If I understand your request, I can not provide you such discussion,
> >because I did not participate in any discussion for bot approval, our
> >administrator configured
> >the bots in our private instance.
> >
> >
> >Hope you can provide me some additional support, and
> >
> >if you require further information, please let me now, and I would
> >answer ASAP.
> >
> >
> >Best regards
> >
> >
> >Luis Ramos
> >
> >
> >> Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um 17:43
> >geschrieben:
> >>
> >>
> >> In order to further help you, can I ask you your Wikidata bot
> >approval
> >> discussion?
> >>
> >> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
> >>
> >> On Tue, 2020-01-28 at 10:19 +0100, wp1080397-lsrs wp1080397-lsrs
> >wrote:
> >> > Dear Valerio,
> >> >
> >> > Thanks for the quick answer, if I understood your answer, we should
> >> > be using an inappropriate approach at doing parallel programming in
> >> > the edition process.
> >> > In this case, we are aiming to have the data available asap, as
> >soon
> >> > as we have it we should use another approach.
> >> >
> >> > The question I made is about the necessity of loading  large data
> >> > sets, because in the case of private instances, we need to load
> >> > 20.000.000 of items for private use, and with a rate of 10 items
> >per
> >> > second, using the approach we are following we will require 25
> >days,
> >> > with a script writing 24 hour a day, and speaking in big data
> >terms,
> >> > 20 M is an small data set.
> >> >
> >> > So, I leave an open question:
> >> >
> >> > my questions is if there is some experience when has been possible
> >to
> >> > have a higher speed in edition rate?.
> >> >
> >> > Best regards
> >> >
> >> >
> >> > > Valerio Bozzolan <[hidden email]> hat am 28. Januar 2020 um
> >> > > 09:28 geschrieben:
> >> > >
> >> > >
> >> > > Please note that - AFAIK - parallel requests are not well
> >accepted.
> >> > >
> >> > > https://www.mediawiki.org/wiki/API:Etiquette
> >> > >
> >> > > (You may have a bigger problem now :^)
> >> > >
> >> > > On Tue, 2020-01-28 at 08:13 +0100, wp1080397-lsrs wp1080397-lsrs
> >> > > wrote:
> >> > > > Dear friends,
> >> > > > We have been working for some months in a wikidata project, and
> >> > > > we
> >> > > > have found an issue with edition performance, I began to work
> >> > > > with
> >> > > > wikidata java api, and when I tried to increase the edition
> >speed
> >> > > > the
> >> > > > java system held editions, and inserted delays, which reduced
> >> > > > edition
> >> > > > output as well.
> >> > > > I chose the option to edit with pywikibot, but my experience
> >was
> >> > > > that
> >> > > > this reduced more the edition.
> >> > > > At the end we use the procedure indicated here:
> >> > > > https://www.mediawiki.org/wiki/API:Edit#Example
> >> > > > With multithreading, and we reach a maximum of 10,6 edition per
> >> > > > second.
> >> > > > my questions is if there is some experience when has been
> >> > > > possible to
> >> > > > have a higher speed?.
> >> > > > Currently we need to write 1.500.000 items, and we would
> >require
> >> > > > 5
> >> > > > working days for such a task.
> >> > > > Best regards
> >> > > > Luis Ramos
> >> > > > Senior Java Developer
> >> > > > (Semantic Web Developer)
> >> > > > PST.AG
> >> > > > Jena, Germany.
> >> > > >
> >> > > > _______________________________________________
> >> > > > Mediawiki-api mailing list
> >> > > > [hidden email]
> >> > > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >> > >
> >> > > _______________________________________________
> >> > > Mediawiki-api mailing list
> >> > > [hidden email]
> >> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >> >
> >> > Luis Ramos
> >> >
> >> >
> >> > Senior Java Developer
> >> >
> >> >
> >> > (Semantic Web Developer)
> >> >
> >> >
> >> > PST.AG
> >> >
> >> >
> >> > Jena, Germany.
> >> >
> >> > _______________________________________________
> >> > Mediawiki-api mailing list
> >> > [hidden email]
> >> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >>
> >>
> >> _______________________________________________
> >> Mediawiki-api mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
> >
> >Luis Ramos
> >
> >
> >Senior Java Developer
> >
> >
> >(Semantic Web Developer)
> >
> >
> >PST.AG
> >
> >
> >Jena, Germany.
>
> --
> E-mail sent from the "K-9 mail" app from F-Droid, installed in my LineageOS device without proprietary Google apps. I'm delivering through my Postfix mailserver installed in a Debian GNU/Linux.
>
> Have fun with software freedom!
>
> [[User:Valerio Bozzolan]]
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Luis Ramos


Senior Java Developer


(Semantic Web Developer)


PST.AG


Jena, Germany.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api