Library to filter HTML

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Library to filter HTML

Felipe Ortega
Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I
now extract internal, external links, and so on, but I'd also like to
extract the plain text (without HTML code and, possibly, also filtering
wiki tags).

Does anyone nows a good python library to do that? I believe there
should be something out there, as there exist bots and crawlers automating
the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.



¿Con Mascota por primera vez? - Sé un mejor Amigo
Entra en Yahoo! Respuestas.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Library to filter HTML

Kurt Luther
Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt


----- Original Message -----
From: "Felipe Ortega" <[hidden email]>
To: [hidden email]
Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York
Subject: [Wiki-research-l] Library to filter HTML

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Library to filter HTML

Brian J Mingus
I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki

On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther <[hidden email]> wrote:
Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt


----- Original Message -----
From: "Felipe Ortega" <[hidden email]>
To: [hidden email]
Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York
Subject: [Wiki-research-l] Library to filter HTML

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Library to filter HTML

Brian J Mingus
s/right/write/. pre-morning coffee still :)

On Thu, Jan 31, 2008 at 9:33 AM, Brian <[hidden email]> wrote:
I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki


On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther <[hidden email]> wrote:
Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt


----- Original Message -----
From: "Felipe Ortega" <[hidden email]>
To: [hidden email]
Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York
Subject: [Wiki-research-l] Library to filter HTML

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Library to filter HTML

Felipe Ortega
Thanks a lot. Performance is an important issue in this case (think about parsing the entire enwiki).

I'll give it a chance and post my comments.

Thanks for the feedback.

Felipe.

Brian <[hidden email]> escribió:
s/right/write/. pre-morning coffee still :)

On Thu, Jan 31, 2008 at 9:33 AM, Brian <[hidden email]> wrote:
I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki


On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther <[hidden email]> wrote:
Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt


----- Original Message -----
From: "Felipe Ortega" <[hidden email]>
To: [hidden email]
Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York
Subject: [Wiki-research-l] Library to filter HTML

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l



¿Con Mascota por primera vez? - Sé un mejor Amigo
Entra en Yahoo! Respuestas.

_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l