Readability examples

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Readability examples

Brian
Here are a few readability measure examples. Just a side-by-side
comparison of the text from the GWB article from en.wp and simple.wp,
and de.wp. I plan on parsing en, de and simple in full and exploring
how these measures might be correlated with quality.

ps: Does anyone know of a script that can strip out wiki syntax? This
is pertinent. It will also be necessary to leve only paragraphs of
text in the articles..the below data is noticably skewed in some (but
not all) of the mesures.

pss: I recall from the Wikimania meeting that someone had a script to
convert a dump to tab-delimited data. That would be useful to me...
could someone provide a link?

Erik: The largest of articles takes approx. 1/10 of a second running
the binary produced by this C code. Using Inline::C in perl, I could
fairly easily embed the code (style.c from GNU Diction) into your
script. It would take and return strings. "Simple!" =) Otherwise I can
just produce the data in csv etc.. and provide it to you.

See [[Readability]] and Google to get an idea of what these
readability grades mean. Briefly:
All of these explained quite simply: http://www.readability.info/info.shtml
Kincaid: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch-Kincaid_Grade_Level
ARI: http://en.wikipedia.org/wiki/Automated_Readability_Index
Coleman-Liau: http://en.wikipedia.org/wiki/Coleman-Liau_Index
Flesh Index: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch_Reading_Ease
Fog Index: http://en.wikipedia.org/wiki/Gunning-Fog_Index
Lix: http://www.readability.info/info.shtml
SMOG-Grading: http://en.wikipedia.org/wiki/SMOG_Index

This data is very easy to reproduce. I provide a unix command for each
that assumes you have installed the lynx text browser, which has a
dump command to strip out html and leave text, and the GNU Diction
package, which provides style. Style supports English/German.

----------------------------------------------------------------
[[George W. Bush]] on en.wp:
lynx -dump http://en.wikipedia.org/wiki/"George W. Bush" | style
YMMV: I removed all the hyperlinks in this article before running style
----------------------------------------------------------------
readability grades:
        Kincaid: 11.7
        ARI: 13.5
        Coleman-Liau: 12.8
        Flesch Index: 54.0
        Fog Index: 15.3
        Lix: 51.3 = school year 10
        SMOG-Grading: 13.1
sentence info:
        60081 characters
        12376 words, average length 4.85 characters = 1.52 syllables
        513 sentences, average length 24.1 words
        58% (299) short sentences (at most 19 words)
        18% (97) long sentences (at least 34 words)
        65 paragraphs, average length 7.9 sentences
        0% (3) questions
        22% (114) passive sentences
        longest sent 294 wds at sent 507; shortest sent 1 wds at sent 5
word usage:
        verb types:
        to be (155) auxiliary (49)
        types as % of total:
        conjunctions 4% (544) pronouns 3% (336) prepositions 11% (1311)
        nominalizations 3% (311)
sentence beginnings:
        pronoun (47) interrogative pronoun (3) article (40)
        subordinating conjunction (23) conjunction (5) preposition (40)

----------------------------------------------------------------
[[George W. Bush]] on simple.wp:
lynx -dump http://simple.wikipedia.org/wiki/"George W. Bush" | style
----------------------------------------------------------------
readability grades:
        Kincaid: 3.3
        ARI: 0.7
        Coleman-Liau: 6.0
        Flesch Index: 88.6
        Fog Index: 6.5
        Lix: 23.6 = below school year 5
        SMOG-Grading: 7.4
sentence info:
        8659 characters
        2344 words, average length 3.69 characters = 1.28 syllables
        248 sentences, average length 9.5 words
        65% (163) short sentences (at most 4 words)
        10% (26) long sentences (at least 19 words)
        14 paragraphs, average length 17.7 sentences
        0% (0) questions
        10% (27) passive sentences
        longest sent 253 wds at sent 39; shortest sent 1 wds at sent 4
word usage:
        verb types:
        to be (40) auxiliary (1)
        types as % of total:
        conjunctions 1% (24) pronouns 1% (33) prepositions 4% (95)
        nominalizations 1% (24)
sentence beginnings:
        pronoun (10) interrogative pronoun (0) article (3)
        subordinating conjunction (3) conjunction (1) preposition (2)
----------------------------------------------------------------
[[George W. Bush]] on de.wp:
lynx -dump http://de.wikipedia.org/wiki/"George W. Bush" | style -L de
----------------------------------------------------------------
readability grades:
        Kincaid: 8.0
        ARI: 6.7
        Coleman-Liau: 12.3
        Flesch Index: 57.7
        Fog Index: 10.8
        Lix: 34.4 = school year 5
        SMOG-Grading: 5.3
sentence info:
        37740 characters
        7909 words, average length 4.77 characters = 1.63 syllables
        694 sentences, average length 11.4 words
        63% (441) short sentences (at most 6 words)
        16% (116) long sentences (at least 21 words)
        56 paragraphs, average length 12.4 sentences
        0% (2) questions
        6% (44) passive sentences
        longest sent 274 wds at sent 256; shortest sent 1 wds at sent 191
sentence beginnings:
        pronoun (14) interrogative pronoun (3) article (37)

----------------------------------------------------------------
Cheers,
Brian Mingus
_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Readability examples

Brian
My 'a' key is sticky, sorry for the lack of readability of my e-mail =)

On 8/18/06, Brian <[hidden email]> wrote:

> Here are a few readability measure examples. Just a side-by-side
> comparison of the text from the GWB article from en.wp and simple.wp,
> and de.wp. I plan on parsing en, de and simple in full and exploring
> how these measures might be correlated with quality.
>
> ps: Does anyone know of a script that can strip out wiki syntax? This
> is pertinent. It will also be necessary to leve only paragraphs of
> text in the articles..the below data is noticably skewed in some (but
> not all) of the mesures.
>
> pss: I recall from the Wikimania meeting that someone had a script to
> convert a dump to tab-delimited data. That would be useful to me...
> could someone provide a link?
>
> Erik: The largest of articles takes approx. 1/10 of a second running
> the binary produced by this C code. Using Inline::C in perl, I could
> fairly easily embed the code (style.c from GNU Diction) into your
> script. It would take and return strings. "Simple!" =) Otherwise I can
> just produce the data in csv etc.. and provide it to you.
>
> See [[Readability]] and Google to get an idea of what these
> readability grades mean. Briefly:
> All of these explained quite simply: http://www.readability.info/info.shtml
> Kincaid: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch-Kincaid_Grade_Level
> ARI: http://en.wikipedia.org/wiki/Automated_Readability_Index
> Coleman-Liau: http://en.wikipedia.org/wiki/Coleman-Liau_Index
> Flesh Index: http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test#Flesch_Reading_Ease
> Fog Index: http://en.wikipedia.org/wiki/Gunning-Fog_Index
> Lix: http://www.readability.info/info.shtml
> SMOG-Grading: http://en.wikipedia.org/wiki/SMOG_Index
>
> This data is very easy to reproduce. I provide a unix command for each
> that assumes you have installed the lynx text browser, which has a
> dump command to strip out html and leave text, and the GNU Diction
> package, which provides style. Style supports English/German.
>
> ----------------------------------------------------------------
> [[George W. Bush]] on en.wp:
> lynx -dump http://en.wikipedia.org/wiki/"George W. Bush" | style
> YMMV: I removed all the hyperlinks in this article before running style
> ----------------------------------------------------------------
> readability grades:
>         Kincaid: 11.7
>         ARI: 13.5
>         Coleman-Liau: 12.8
>         Flesch Index: 54.0
>         Fog Index: 15.3
>         Lix: 51.3 = school year 10
>         SMOG-Grading: 13.1
> sentence info:
>         60081 characters
>         12376 words, average length 4.85 characters = 1.52 syllables
>         513 sentences, average length 24.1 words
>         58% (299) short sentences (at most 19 words)
>         18% (97) long sentences (at least 34 words)
>         65 paragraphs, average length 7.9 sentences
>         0% (3) questions
>         22% (114) passive sentences
>         longest sent 294 wds at sent 507; shortest sent 1 wds at sent 5
> word usage:
>         verb types:
>         to be (155) auxiliary (49)
>         types as % of total:
>         conjunctions 4% (544) pronouns 3% (336) prepositions 11% (1311)
>         nominalizations 3% (311)
> sentence beginnings:
>         pronoun (47) interrogative pronoun (3) article (40)
>         subordinating conjunction (23) conjunction (5) preposition (40)
>
> ----------------------------------------------------------------
> [[George W. Bush]] on simple.wp:
> lynx -dump http://simple.wikipedia.org/wiki/"George W. Bush" | style
> ----------------------------------------------------------------
> readability grades:
>         Kincaid: 3.3
>         ARI: 0.7
>         Coleman-Liau: 6.0
>         Flesch Index: 88.6
>         Fog Index: 6.5
>         Lix: 23.6 = below school year 5
>         SMOG-Grading: 7.4
> sentence info:
>         8659 characters
>         2344 words, average length 3.69 characters = 1.28 syllables
>         248 sentences, average length 9.5 words
>         65% (163) short sentences (at most 4 words)
>         10% (26) long sentences (at least 19 words)
>         14 paragraphs, average length 17.7 sentences
>         0% (0) questions
>         10% (27) passive sentences
>         longest sent 253 wds at sent 39; shortest sent 1 wds at sent 4
> word usage:
>         verb types:
>         to be (40) auxiliary (1)
>         types as % of total:
>         conjunctions 1% (24) pronouns 1% (33) prepositions 4% (95)
>         nominalizations 1% (24)
> sentence beginnings:
>         pronoun (10) interrogative pronoun (0) article (3)
>         subordinating conjunction (3) conjunction (1) preposition (2)
> ----------------------------------------------------------------
> [[George W. Bush]] on de.wp:
> lynx -dump http://de.wikipedia.org/wiki/"George W. Bush" | style -L de
> ----------------------------------------------------------------
> readability grades:
>         Kincaid: 8.0
>         ARI: 6.7
>         Coleman-Liau: 12.3
>         Flesch Index: 57.7
>         Fog Index: 10.8
>         Lix: 34.4 = school year 5
>         SMOG-Grading: 5.3
> sentence info:
>         37740 characters
>         7909 words, average length 4.77 characters = 1.63 syllables
>         694 sentences, average length 11.4 words
>         63% (441) short sentences (at most 6 words)
>         16% (116) long sentences (at least 21 words)
>         56 paragraphs, average length 12.4 sentences
>         0% (2) questions
>         6% (44) passive sentences
>         longest sent 274 wds at sent 256; shortest sent 1 wds at sent 191
> sentence beginnings:
>         pronoun (14) interrogative pronoun (3) article (37)
>
> ----------------------------------------------------------------
> Cheers,
> Brian Mingus
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
http://mail.wikipedia.org/mailman/listinfo/wiki-research-l