Readability examples

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Readability examples

==Reposting with better title to get included in proper thread==

> ps: Does anyone know of a script that can strip out wiki syntax? This
> is pertinent. It will also be necessary to leve only paragraphs of
> text in the articles..the below data is noticably skewed in some (but
> not all) of the mesures.

Brian, here an inital reponse:

Some perl code from the WikiCounts job, that strips lots of markup code,
used to get cleaner text for word count and article size in chars.
It is not 100% accurate, and not all markup is removed, but these regexps
slow down the whole job big time.
The result is at least far closer to a decent word count than wc would be on
the raw data.

        $article =~ s/\'\'+//go ; # strip bold/italic formatting
        $article =~ s/\<[^\>]+\>//go ; # strip <...> html

  #     these are valid UTF-8 chars, but it takes way too long to process,
  #     I combine those in one set
  #     $article =~  s/[\xc0-\xdf][\x80-\xbf]|
  #                     [\xe0-\xef][\x80-\xbf]{2}|
  #                     [\xf0-\xf7][\x80-\xbf]{3}/x/gxo ;

  #     this one set selects UTF-8 faster (with 99.9% accuracy I would say)
        $article =~  s/[\xc0-\xf7][\x80-\xbf]+/x/gxo ; # count unicode chars
as one char

        $article =~ s/\&\w+\;/x/go ;   # count htlm chars as one char
        $article =~ s/\&\#\d+\;/x/go ; # count htlm chars as one char

        $article =~ s/\[\[ [^\:\]]+ \: [^\]]* \]\]//gxoi ; # strip
image/category/interwiki links
                                                            # a few internal
links with colon in title will get lost too
        $article =~ s/http \: [\w\.\/]+//gxoi ; # strip external links

        $article =~ s/\=\=+ [^\=]* \=\=+//gxo ; # strip headers
        $article =~ s/\n\**//go ; # strip linebreaks + unordered list tags
(other lists are relatively scarce)
        $article =~ s/\s+/ /go ; # remove extra spaces

Actually the code in is a bit more complicated as it
tries to find a decent solution for ja/zh/ko
Also numbers are counted as one word (including embedded points and commas).

         if ($language eq "ja")
         { $words = int ($unicodes * 0.37) ; }

> pss: I recall from the Wikimania meeting that someone had a script to
> convert a dump to tab-delimited data. That would be useful to me...
> could someone provide a link?

> Erik: The largest of articles takes approx. 1/10 of a second running
> the binary produced by this C code. Using Inline::C in perl, I could
> fairly easily embed the code (style.c from GNU Diction) into your
> script. It would take and return strings. "Simple!" =) Otherwise I can
> just produce the data in csv etc.. and provide it to you.

Questions and caveats:
1/10 secs x 2 million articles early in 2007 is 55 hours. Plus German is 80
hours. Of course you say 1/10 is for largest articles only.
Still it adds up big time when all months are processed, and running
WikiCounts incrementally only adding data for last month has its drawbacks
as explained in out meeting at Wikimania. Is it 1/10 sec for all tests
combined? Could we limit ourselves to the better researched tests or the
tests which are supported in more languages or deemed more sensible anyway ?
I would prefer tests that work in all alphabet based languages. When wiki
syntax is introduced that is not stripped by regexps above or some other
tool it would produce artificial drift in the results over the months.

> This data is very easy to reproduce. I provide a unix command for each
> that assumes you have installed the lynx text browser, which has a
> dump command to strip out html and leave text, and the GNU Diction
> package, which provides style. Style supports English/German.

Strip html is already done. See above.

I could imagine we run these tests on a yet to be determined sample of all
articles to save processing costs.
Tracking 10.000 or 50.000 articles from month to month, if chosen properly
(random ?) should give decent results.

Cheers, Erik Zachte

Wiki-research-l mailing list
[hidden email]