Wikipedia Suggest

classic Classic list List threaded Threaded
91 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Wikipedia Suggest

Julien Lemoine
Hello,

I have wrote a "google suggest" like service for wikipedia under GPL
licence [1].
By the way, I am not sure if this project will interest you, I am open
to all comments from your community.

I wrote a simple web site for english/french on
http://suggest.speedblue.org, all software are available in the download
section.

Best Regards.
Julien Lemoine

[1] If you want me to change the licence to something other than GPL,
please let me know.

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
On 8/2/06, Julien Lemoine <[hidden email]> wrote:
> Hello,
>
> I have wrote a "google suggest" like service for wikipedia under GPL
> licence [1].
> By the way, I am not sure if this project will interest you, I am open
> to all comments from your community.

This is absolutely fantastic, and would greatly benefit Wikipedia if
cleaned up a little and integrated. What data is it running on?

Also, what is the number on the right side, the number of incoming
links to the page?

Again, brilliant - would make the search feature far, far more useful.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Julien Lemoine
Hello,

Steve Bennett wrote:
> This is absolutely fantastic, and would greatly benefit Wikipedia if
> cleaned up a little and integrated. What data is it running on?
>  
First of all, I grabbed the whole content of english/french wikipedia in
june 2006. Then, everything is compiled is a final state machine (about
200Mb for english wikipedia).
> Also, what is the number on the right side, the number of incoming
> links to the page?
>
>  
Yes, you guessed right :)
> Again, brilliant - would make the search feature far, far more useful.
>  
Thanks you.
Best Regards.
Julien Lemoine

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
On 8/2/06, Julien Lemoine <[hidden email]> wrote:
> First of all, I grabbed the whole content of english/french wikipedia in
> june 2006. Then, everything is compiled is a final state machine (about
> 200Mb for english wikipedia).

How long does the compiling take? How close to a real-time update
could it be? This seems to be a recurring problem with search etc at
Wikipedia - the fact that it's always at best a few days out of date,
and sometimes months. Then with the problems with toolserver...

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Julien Lemoine
Hello Steve,

Steve Bennett wrote:

> On 8/2/06, Julien Lemoine <[hidden email]> wrote:
>  
>> First of all, I grabbed the whole content of english/french wikipedia in
>> june 2006. Then, everything is compiled is a final state machine (about
>> 200Mb for english wikipedia).
>>    
>
> How long does the compiling take? How close to a real-time update
> could it be? This seems to be a recurring problem with search etc at
> Wikipedia - the fact that it's always at best a few days out of date,
> and sometimes months. Then with the problems with toolserver...
>
>  
On my computer (Pentium D930, SATA HD, 1G RAM), it took about half an
hour for the 1.2M of english articles (I will give you the exact
processing time if you want). This time includes the reading of all
files on disk.

Best Regards.
Julien Lemoine

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
On 8/2/06, Julien Lemoine <[hidden email]> wrote:
> On my computer (Pentium D930, SATA HD, 1G RAM), it took about half an
> hour for the 1.2M of english articles (I will give you the exact
> processing time if you want). This time includes the reading of all
> files on disk.

I would be interested to hear from the server admins on the
feasibility of integrating this tool in Wikipedia - particularly that
compilation time.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Simetrical
On 8/2/06, Steve Bennett <[hidden email]> wrote:
> I would be interested to hear from the server admins on the
> feasibility of integrating this tool in Wikipedia - particularly that
> compilation time.

I strongly suspect that the number of runtime cycles this could take
is a much greater concern.  Half an hour to compile the list isn't far
from the length some special-page indices take to build, IIRC.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Arne 'Timwi' Heizmann
In reply to this post by Julien Lemoine
Julien Lemoine wrote:

> Hello,
>
> Steve Bennett wrote:
>
>>This is absolutely fantastic, and would greatly benefit Wikipedia if
>>cleaned up a little and integrated. What data is it running on?
>
> First of all, I grabbed the whole content of english/french wikipedia in
> june 2006. Then, everything is compiled is a final state machine (about
> 200Mb for english wikipedia).

Can you explain in greater detail what your finite-state machine looks
like? Is it like a directed graph?

Timwi

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Julien Lemoine
In reply to this post by Simetrical
Hello,

Simetrical wrote:

> On 8/2/06, Steve Bennett <[hidden email]> wrote:
>  
>> I would be interested to hear from the server admins on the
>> feasibility of integrating this tool in Wikipedia - particularly that
>> compilation time.
>>    
>
> I strongly suspect that the number of runtime cycles this could take
> is a much greater concern.  Half an hour to compile the list isn't far
> from the length some special-page indices take to build, IIRC.
>  
Here is the time output of compiler for the analysis of wikipedia
english articles :
622,65s user 53,69s system 70% cpu 15:58,91 total
95% of this time (16mins) was to read the files from disk (with a fresh
reboot so nothing was in the cache).

The algorithm uses very few cpu power. Imagine for example that all the
data needed are in memory, the compiled automaton could be obtained in 1
minutes.

Best Regards.
Julien Lemoine

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Julien Lemoine
In reply to this post by Arne 'Timwi' Heizmann
Hello,

Timwi wrote:
> Can you explain in greater detail what your finite-state machine looks
> like? Is it like a directed graph?
>
>  
Yes this is a directed graph, each transition from one node to another
is labeled by one UTF-8 character.
This graph is deterministic, you can not have two identic transitions at
the output of a node.
For each article I added the name of the article in this automaton, with
the final node containing a pointer to articles details (url, name, nb
of links).
After article names were added in automaton, I computed the table of
best content for each node containing pointer to articles with the
higher number of links.

I talked about pointers, these pointers are index to a vector storing
all details about articles.

You can have a small example (figure) in my FAQ :
http://suggest.speedblue.org/FAQ.php

I can give you more details about number of nodes/transitions in the
automaton if you want.
I hope this give you a more precise idea of how it works.

Please, don't hesitate to ask me more details.
Best Regards.
Julien Lemoine

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Nick Jenkins
In reply to this post by Julien Lemoine
Hi Julien,

> I have wrote a "google suggest" like service for wikipedia under GPL
> licence [1].
> By the way, I am not sure if this project will interest you, I am open
> to all comments from your community.

Yes!! Good stuff!

== UI stuff ==

The only two constructive suggestions for the User-Interface that I
would make are:

1) That if the user presses 'Enter' in the search textbox whilst
typing out a query, that it automatically choose/open/redirect to the
first item in the list. That way I can type out what I want, and press
enter to open the first link when I've typed enough to specify it well
enough to get it to the top of the list, all without using the mouse.

2) Allow the user to press the down/up arrows to select/highlight a
specified entry on the list (including but not limited to the first
item), and press enter to open it. That way again the user can be lazy
and can select a link without using the mouse, and without typing out
the full title.

== More Technical stuff ==

1) How do you handle pages with the same title, but different
capitalization? They're rare, but they do occur. My suspicion from
scanning Analyzer.cpp is that you just take the most popular. However,
if it's for search, it would be best to include everything (I think).

2) Doesn't seem to get include redirects. For example, when I search
for "Formula weight", it's not listed, but on the EN Wikipedia
"Formula weight" is a redirect to "Atomic mass". It would definitely
be better to include redirects (in my personal opinion).

However the downside of including these two things is that the amount
of data that you need to store goes up. I've actually had a go at a
very similar problem (storing a memory index of all article names, but
meeting the two conditions specified above, plus for redirects I would
also store the name of the article it redirected to [something which
could potentially maybe also be useful for your suggest service if you
wanted to show this information too]). However this was in PHP, and it
a complete memory hog (think > 1 Gb for the memory index). My solution
(since it was only for me) was to just "buy more RAM", however I like
your approach of getting more efficient. By the way, the reason I was
doing this was for suggesting links that could be made in wiki text -
just so you know I'm not in competition with you, but that the
problems we face are similar in some ways, and could maybe benefit
from a common solution.

How big would be a memory index be that had these properties (i.e.
including all NS:0 articles/redirects, and maybe including the targets
for redirects)?

All the best,
Nick.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Julien Lemoine
Hi Nick,

* Nick Jenkins <[hidden email]> [2006-08-03 16:16:23 +1000]:

> == UI stuff ==
>
> The only two constructive suggestions for the User-Interface that I
> would make are:
>
> 1) That if the user presses 'Enter' in the search textbox whilst
> typing out a query, that it automatically choose/open/redirect to the
> first item in the list. That way I can type out what I want, and press
> enter to open the first link when I've typed enough to specify it well
> enough to get it to the top of the list, all without using the mouse.
>
> 2) Allow the user to press the down/up arrows to select/highlight a
> specified entry on the list (including but not limited to the first
> item), and press enter to open it. That way again the user can be lazy
> and can select a link without using the mouse, and without typing out
> the full title.

You idea are goods, I added them on my TODO list :)

> == More Technical stuff ==
>
> 1) How do you handle pages with the same title, but different
> capitalization? They're rare, but they do occur. My suspicion from
> scanning Analyzer.cpp is that you just take the most popular. However,
> if it's for search, it would be best to include everything (I think).

Yeah you are right, the most popular is keept. I wanted to have a case
insensitive search and for the moment I have only one article by final
node. But it will be better, I also added it to my TODO list.

> 2) Doesn't seem to get include redirects. For example, when I search
> for "Formula weight", it's not listed, but on the EN Wikipedia
> "Formula weight" is a redirect to "Atomic mass". It would definitely
> be better to include redirects (in my personal opinion).

I will generates a new index with redirects, I will give you the size
before/after this evening.

> However the downside of including these two things is that the amount
> of data that you need to store goes up. I've actually had a go at a
> very similar problem (storing a memory index of all article names, but
> meeting the two conditions specified above, plus for redirects I would
> also store the name of the article it redirected to [something which
> could potentially maybe also be useful for your suggest service if you
> wanted to show this information too]). However this was in PHP, and it
> a complete memory hog (think > 1 Gb for the memory index). My solution
> (since it was only for me) was to just "buy more RAM", however I like
> your approach of getting more efficient. By the way, the reason I was
> doing this was for suggesting links that could be made in wiki text -
> just so you know I'm not in competition with you, but that the
> problems we face are similar in some ways, and could maybe benefit
> from a common solution.
> How big would be a memory index be that had these properties (i.e.
> including all NS:0 articles/redirects, and maybe including the targets
> for redirects)?

The current automaton (index) of all wikipedia articles (without
redirects) needs 127Mb, I don't think adding redirects will increass a lot his
size. I will give you the exact size with redirect this evening :)
(Paris time).

Best Regards.
Julien Lemoine

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Evan Martin-2
In reply to this post by Arne 'Timwi' Heizmann
On 8/3/06, Timwi <[hidden email]> wrote:
> Julien Lemoine wrote:
> > First of all, I grabbed the whole content of english/french wikipedia in
> > june 2006. Then, everything is compiled is a final state machine (about
> > 200Mb for english wikipedia).
>
> Can you explain in greater detail what your finite-state machine looks
> like? Is it like a directed graph?

The right data structure for this sort of thing is called a "trie",
and Googling it will turn up lots of information.
Though I'm not sure if the original poster knows this term.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Evan Martin-2
On 8/3/06, Evan Martin <[hidden email]> wrote:

> On 8/3/06, Timwi <[hidden email]> wrote:
> > Julien Lemoine wrote:
> > > First of all, I grabbed the whole content of english/french wikipedia in
> > > june 2006. Then, everything is compiled is a final state machine (about
> > > 200Mb for english wikipedia).
> >
> > Can you explain in greater detail what your finite-state machine looks
> > like? Is it like a directed graph?
>
> The right data structure for this sort of thing is called a "trie",
> and Googling it will turn up lots of information.
> Though I'm not sure if the original poster knows this term.

*grumble reply-to*

Sorry, was meant to be an off-list reply.
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
In reply to this post by Julien Lemoine
On 8/3/06, Julien Lemoine <[hidden email]> wrote:
> > 1) How do you handle pages with the same title, but different
> > capitalization? They're rare, but they do occur. My suspicion from
> > scanning Analyzer.cpp is that you just take the most popular. However,
> > if it's for search, it would be best to include everything (I think).
>
> Yeah you are right, the most popular is keept. I wanted to have a case
> insensitive search and for the moment I have only one article by final
> node. But it will be better, I also added it to my TODO list.

I didn't notice this - yes, a case-sensitive search is critical.
Problems with casing are one of the most likely reasons a user will
fail to find a given page. MediaWiki automatically finds the three
most common casing options - all lower, ALL CAPS, Title Caps - but
misses the likely "Caps of Important Words".

> I will generates a new index with redirects, I will give you the size
> before/after this evening.

Redirects should be indicated with what they redirect to, like:

...
Top-rope climbing -> top roping
Top roping
...

Or something - but maybe with the proviso that it doesn't show a
redirect if the actual target is currently being displayed.

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
In reply to this post by Julien Lemoine
Just a thought - if you want to *really* impress everyone, incorporate
special handling for disambiguation pages. This could be tricky, but
one of these two methods might work, for given search string
"Friends":

First, check if "Friends (disambiguation)" exists

1) If it does, simply display it in a panel to the right and let users
click on links in it

2) If it does, load it and attempt to parse it, showing the links in
the list, differentiated somehow.

On second thoughts, 1) is probably better, as you really need the line
of disambiguating text to know whether you want the 1971 film or the
2001 film, or even "Friend (film)". What to do about entries like the
last two (no real link, just synonyms) is tricky...

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Arne 'Timwi' Heizmann
In reply to this post by Evan Martin-2
Evan Martin wrote:

> On 8/3/06, Timwi <[hidden email]> wrote:
>
>>Julien Lemoine wrote:
>>
>>>First of all, I grabbed the whole content of english/french wikipedia in
>>>june 2006. Then, everything is compiled is a final state machine (about
>>>200Mb for english wikipedia).
>>
>>Can you explain in greater detail what your finite-state machine looks
>>like? Is it like a directed graph?
>
> The right data structure for this sort of thing is called a "trie",
> and Googling it will turn up lots of information.
> Though I'm not sure if the original poster knows this term.

I know, and his is indeed a trie. But that word is rarely used. Look at
his work: He has perfectly implemented a trie with a lot of thought, but
without knowing the word "trie" :-p

Timwi

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Arne 'Timwi' Heizmann
In reply to this post by Julien Lemoine
Julien Lemoine wrote:
> Hello,
>
> Timwi wrote:
>
>>Can you explain in greater detail what your finite-state machine looks
>>like? Is it like a directed graph?
>
> Yes this is a directed graph [... etc. ...]

I'm very impressed! You've clearly thought a lot about this, and
implemented it the right way.

I don't have any further questions, my curiosity is satisfied :)

_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Simetrical
In reply to this post by Julien Lemoine
On 8/3/06, Julien Lemoine <[hidden email]> wrote:
> The algorithm uses very few cpu power. Imagine for example that all the
> data needed are in memory, the compiled automaton could be obtained in 1
> minutes.

Okay, that's good.  But what I meant was, how much CPU power (and
other resources) will the suggestion feature eat on a day-to-day
basis?  I start typing "Metasyntactic variable" into the search box,
it helpfully provides some suggestions based on its index — how much
resources did that just use?  When thousands of people are doing it
every minute, how much will that slow down the servers compared to the
normal load of page views and edits and so on?
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Suggest

Steve Bennett-4
In reply to this post by Arne 'Timwi' Heizmann
On 8/3/06, Timwi <[hidden email]> wrote:
> I know, and his is indeed a trie. But that word is rarely used. Look at
> his work: He has perfectly implemented a trie with a lot of thought, but
> without knowing the word "trie" :-p

Oh, you're bringing back memories of computer science...I half expect
someone to mention BST's or radox or something next...

(but please don't :))

Steve
_______________________________________________
Wikitech-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
12345