Finding and fixing cyclic dependencies

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding and fixing cyclic dependencies

Daniel Kinzler-3
Hi all!

I invite you to try out "Project Ruprecht"[1][2], a tool that measures the
"tangledness" of PHP code, and provides you with a "naughty list" of things to fix.

For now, you will have to install this locally. I hope however to soon have this
run automatically against core on a regular basis, perhaps by integrating it
with SonarQube. Maybe some day we can also integrate it with CI, to generate a
warning when a new cyclic dependency is about to be introduced.

So that was the tl;dr. Now for some context, history, and shout-outs. And some
actual real world science, too!

For a while now, I have been talking about the how much of a problem cyclic
dependencies are in MediaWiki core: When two components (classes, namespaces,
libraries, whatever) depend on each other, directly or indirectly, this means
that one cannot be used without the other, nor can it be tested, understood, or
modified without also considering the other. So, in effect, they behave as *one*
component, not two. Applied to MW core, this means that roughly half of our 1600
classes effectively behave like a single giant class. This makes the code rather
hard to deal with.

To fix this, I have been looking for tools that let me identify "tangles" of
classes that depend on each other, and metrics' to measure the progress of
"untangling" the code. However, the classic code quality metrics focus on
"local" properties of the code, so they can't tell us much about the progress of
untangling. And the tools I found that would detect cyclic dependencies in PHP
code would all choke on MediaWiki core: they would try to list all detected
cycles - which, by the super-exponential nature of possible paths through a
graph, would be millions and millions. So, the tools would choke and die. That
approach isn't practical for us.

Two discoveries allowed me to come up with a working solution:
First, I decided to leave the PHP world and turned towards graph analysis tools
built for large data sets. Python's graph-tool did the trick. It's build on top
of boost and numpy, and it's *fast*. It crunched through the 7500 or so class
dependencies in MW core in a split second, and told me that we have 14 "tangles"
(non-trivial strongly connected components), and that 43% of our classes are in
these tangles, with 40% being part of one big tangle that is essentially our
monolith manifest. So now I had a metric to work with: the number of classes in
tangles.

That was great, but still didn't tell me where to start. Graph-tool was still
not fast enough to deal with millions of cycles, and even if it had been, that
data wouldn't be very useful. I needed some smart heuristics. Luckily, I
(totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one evening
at the WMDE office by telling him about this problem. The next day, he told me
that he had been digging into the problem all night, and he had found a paper
that sounded relevant, and it also came with working code: "Breaking Cycles in
Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S.
Parthasarathy. I played with the code a bit, and yes! It spat out a list of 290
or so dependencies[4] that it thought were bad - and I agree for a good number
of them. It's not a clean working list, but it gives a very good idea of where
to start looking.

I find it quite fascinating that this works so well for cleaning up a codebase.
After all, the heuristic wasn't design for this - it was designed for fixing
messy ontologies. Indeed, one of their test data sets was (English language)
Wikipedia's category system! I'd love to see what it does with Wikidata's
subclass hierarchy :)

But I suppose it makes sense - dependencies in software are conceptually a lot
like an ontology, and the same strategies of stratification and abstraction
apply. And the same difficulties, too - it's easy enough to spot a problematic
cycle, but often hard to say where it should be cut. And how to cut it - often,
the solution is not to just remove the dependency, but to introduce a new
abstraction that allows the relationship to exist without a cycle. I'd love to
see the research continue in that direction!

So, a big shout out to the researchers, and to Amir who found the paper!

I hope my ramblings have made you curious to play with Ruprecht, and see what it
has to say about other code bases. There's also another feature to play with
which I haven't discussed here: detection of risky classes using the Page Rank
algorithm. Fun!

Cheers,
Daniel

[1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/
[2]
https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-analysis
[3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies
[4] https://phabricator.wikimedia.org/P8513
[5] https://xkcd.com/356/

--
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding and fixing cyclic dependencies

Amir Sarabadani-2
I wonder how many points you get for nerd snipping a software engineer :)

Wonderful work Daniel, kudos!
In the given list of bad dependencies, I find User class depending on Skin
class one of the most problematic ones. Also, ApiQuery class depending on
all of its subclasses is one of the biggest issue making all of API query
modules act as giant big monolith. That can (hopefully) easily addressed
and untangle significant portion of the 40%, a rather low hanging fruit.

Best

On Fri, May 10, 2019 at 6:39 PM Daniel Kinzler <[hidden email]>
wrote:

> Hi all!
>
> I invite you to try out "Project Ruprecht"[1][2], a tool that measures the
> "tangledness" of PHP code, and provides you with a "naughty list" of
> things to fix.
>
> For now, you will have to install this locally. I hope however to soon
> have this
> run automatically against core on a regular basis, perhaps by integrating
> it
> with SonarQube. Maybe some day we can also integrate it with CI, to
> generate a
> warning when a new cyclic dependency is about to be introduced.
>
> So that was the tl;dr. Now for some context, history, and shout-outs. And
> some
> actual real world science, too!
>
> For a while now, I have been talking about the how much of a problem cyclic
> dependencies are in MediaWiki core: When two components (classes,
> namespaces,
> libraries, whatever) depend on each other, directly or indirectly, this
> means
> that one cannot be used without the other, nor can it be tested,
> understood, or
> modified without also considering the other. So, in effect, they behave as
> *one*
> component, not two. Applied to MW core, this means that roughly half of
> our 1600
> classes effectively behave like a single giant class. This makes the code
> rather
> hard to deal with.
>
> To fix this, I have been looking for tools that let me identify "tangles"
> of
> classes that depend on each other, and metrics' to measure the progress of
> "untangling" the code. However, the classic code quality metrics focus on
> "local" properties of the code, so they can't tell us much about the
> progress of
> untangling. And the tools I found that would detect cyclic dependencies in
> PHP
> code would all choke on MediaWiki core: they would try to list all detected
> cycles - which, by the super-exponential nature of possible paths through a
> graph, would be millions and millions. So, the tools would choke and die.
> That
> approach isn't practical for us.
>
> Two discoveries allowed me to come up with a working solution:
> First, I decided to leave the PHP world and turned towards graph analysis
> tools
> built for large data sets. Python's graph-tool did the trick. It's build
> on top
> of boost and numpy, and it's *fast*. It crunched through the 7500 or so
> class
> dependencies in MW core in a split second, and told me that we have 14
> "tangles"
> (non-trivial strongly connected components), and that 43% of our classes
> are in
> these tangles, with 40% being part of one big tangle that is essentially
> our
> monolith manifest. So now I had a metric to work with: the number of
> classes in
> tangles.
>
> That was great, but still didn't tell me where to start. Graph-tool was
> still
> not fast enough to deal with millions of cycles, and even if it had been,
> that
> data wouldn't be very useful. I needed some smart heuristics. Luckily, I
> (totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one
> evening
> at the WMDE office by telling him about this problem. The next day, he
> told me
> that he had been digging into the problem all night, and he had found a
> paper
> that sounded relevant, and it also came with working code: "Breaking
> Cycles in
> Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S.
> Parthasarathy. I played with the code a bit, and yes! It spat out a list
> of 290
> or so dependencies[4] that it thought were bad - and I agree for a good
> number
> of them. It's not a clean working list, but it gives a very good idea of
> where
> to start looking.
>
> I find it quite fascinating that this works so well for cleaning up a
> codebase.
> After all, the heuristic wasn't design for this - it was designed for
> fixing
> messy ontologies. Indeed, one of their test data sets was (English
> language)
> Wikipedia's category system! I'd love to see what it does with Wikidata's
> subclass hierarchy :)
>
> But I suppose it makes sense - dependencies in software are conceptually a
> lot
> like an ontology, and the same strategies of stratification and abstraction
> apply. And the same difficulties, too - it's easy enough to spot a
> problematic
> cycle, but often hard to say where it should be cut. And how to cut it -
> often,
> the solution is not to just remove the dependency, but to introduce a new
> abstraction that allows the relationship to exist without a cycle. I'd
> love to
> see the research continue in that direction!
>
> So, a big shout out to the researchers, and to Amir who found the paper!
>
> I hope my ramblings have made you curious to play with Ruprecht, and see
> what it
> has to say about other code bases. There's also another feature to play
> with
> which I haven't discussed here: detection of risky classes using the Page
> Rank
> algorithm. Fun!
>
> Cheers,
> Daniel
>
> [1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/
> [2]
>
> https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-analysis
> [3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies
> [4] https://phabricator.wikimedia.org/P8513
> [5] https://xkcd.com/356/
>
> --
> Daniel Kinzler
> Principal Software Engineer, Core Platform
> Wikimedia Foundation
>


--
Amir (he/him)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding and fixing cyclic dependencies

Daniel Kinzler-3
Am 13.05.19 um 12:48 schrieb Amir Sarabadani:
> I wonder how many points you get for nerd snipping a software engineer :)

One, as it's rather easy ;)

> Wonderful work Daniel, kudos!

Thank you for your help!

> In the given list of bad dependencies, I find User class depending on Skin class
> one of the most problematic ones.

It's nasty on the level of the dependency graph, but it's a trivial dependency,
and easily resolved using a trait or utility method:
  Skin::normalizeKey( $wgDefaultSkin );

Want to give it a go?

Thinking this one step further, all logic related to user options should be
factored out of the User class anyway. Filed as
<https://phabricator.wikimedia.org/T223099Y>

> Also, ApiQuery class depending on all of its
> subclasses is one of the biggest issue making all of API query modules act as
> giant big monolith. That can (hopefully) easily addressed and untangle
> significant portion of the 40%, a rather low hanging fruit.

Yea... these dependencies a<re however not "real" - they are all dependencies on
a class name literal (ApiQueryFooBar::class), not the class itself. I have filed
this as a feature request with upstream
<https://github.com/mihaeu/dephpend/issues/47>.

Cheers,
Daniel

--
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding and fixing cyclic dependencies

Antoine Musso-3
In reply to this post by Daniel Kinzler-3
On 10/05/2019 18:39, Daniel Kinzler wrote:
> Hi all!
>
> I invite you to try out "Project Ruprecht"[1][2], a tool that measures the
> "tangledness" of PHP code, and provides you with a "naughty list" of things to fix.
>
<snip>
> [1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/
> [2]
> https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-analysis

Hello,

A few months ago, the code health group enquired about the Php metrics
static analysis tool [1] and it has been rather straightforward to run
it automatically and publish its report:

https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/


The report has a list of potential coupling and also show relation
between objects.  That might be a complement.


As for your tool, I am pretty sure we can easily run it automatically
and publish it next to the PHP Metrics report?   The addition to CI has
been rather straightforward:

A container:
https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/

The Jenkins job:
https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/


[1] https://www.phpmetrics.org/

cheers,

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding and fixing cyclic dependencies

Daniel Kinzler-3
Hey Antoine!

Am 13.05.19 um 20:59 schrieb Antoine Musso:

> Hello,
>
> A few months ago, the code health group enquired about the Php metrics static
> analysis tool [1] and it has been rather straightforward to run it automatically
> and publish its report:
>
> https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/
>
> The report has a list of potential coupling and also show relation between
> objects.  That might be a complement.

Yea, that's what I used for my initial analysis for the session on decoupling at
TechConf in Portland :)

phpmetrics and similar tools are useful, but they do not analyze the transitive
dependencies (at least not sufficiently, for my use case). That is, they can't
tell me whether two classes are coupled, but they can't tell me which classes to
decouple to resolve clusters of tightly coupled code. And they can't measure how
"tangled" the codebase is overall, just how "good" or "bad" a given class is (or
all classes are, on average).

> As for your tool, I am pretty sure we can easily run it automatically and
> publish it next to the PHP Metrics report?   The addition to CI has been rather
> straightforward:
>
> A container:
> https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/
>
> The Jenkins job:
> https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/

Ruprecht comes with a few dependencies that may make containerization less
straight forward. Not terribly hard, but somewhat annoying. E.g. it needs python
1 *and* 3, it needs graph-tool which has to be installed as a debian package
from a non-standard repo, etc.

If you feel like looking into that, I'd of course be happy, of course ;)

--
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Finding and fixing cyclic dependencies

Antoine Musso-3
Guten Tag,

Time flow, and I forgot to reply to that email :-\

On 13/05/2019 23:11, Daniel Kinzler wrote:

> Hey Antoine!
>
> Am 13.05.19 um 20:59 schrieb Antoine Musso:
>> Hello,
>>
>> A few months ago, the code health group enquired about the Php metrics static
>> analysis tool [1] and it has been rather straightforward to run it automatically
>> and publish its report:
>>
>> https://doc.wikimedia.org/mediawiki-core/master/phpmetrics/
>>
>> The report has a list of potential coupling and also show relation between
>> objects.  That might be a complement.
>
> Yea, that's what I used for my initial analysis for the session on decoupling at
> TechConf in Portland :)
>
> phpmetrics and similar tools are useful, but they do not analyze the transitive
> dependencies (at least not sufficiently, for my use case). That is, they can't
> tell me whether two classes are coupled, but they can't tell me which classes to
> decouple to resolve clusters of tightly coupled code. And they can't measure how
> "tangled" the codebase is overall, just how "good" or "bad" a given class is (or
> all classes are, on average).

Oh so yeah that is going way deeper in the graph analysis. I would guess
someone with a background in graph theory would be able to help on that
front.  I love the topic, but not enough to actually learn it :-\


>> As for your tool, I am pretty sure we can easily run it automatically and
>> publish it next to the PHP Metrics report?   The addition to CI has been rather
>> straightforward:
>>
>> A container:
>> https://gerrit.wikimedia.org/r/#/c/integration/config/+/469689/
>>
>> The Jenkins job:
>> https://gerrit.wikimedia.org/r/#/c/integration/config/+/469690/
>
> Ruprecht comes with a few dependencies that may make containerization less
> straight forward. Not terribly hard, but somewhat annoying. E.g. it needs python
> 1 *and* 3, it needs graph-tool which has to be installed as a debian package
> from a non-standard repo, etc.
>
> If you feel like looking into that, I'd of course be happy, of course ;)

There is no python 1?  I am assuming you are referring to python2.7.
They can be coinstalled in the same Debian based container, they simply
are different binaries: /usr/bin/python2 /usr/bin/python3. Should be easy.

Depends on what a non standard repo is? The CI images are build with the
non-standard repository apt.wikimedia.org, we have some PHP packages
from sury.org (the person who also package PHP for Debian) and others.
It is super easy, has long as the packages are signed with a gpg key.


If you could please fill a task against #continuous-integration-config
in Phabricator with a rough breakdown of how to setup the stack. I am
sure it can be done rather quickly.

--
Antoine "hashar" Musso


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l