Analytics clients (stat/notebook hosts) and backups of home directories

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Analytics clients (stat/notebook hosts) and backups of home directories

Luca Toscano
Hi everybody,

as part of https://phabricator.wikimedia.org/T201165 the Analytics team
thought to reach out to everybody to make it clear that all the home
directories on the stat/notebook nodes are not backed up periodically. They
run on a software RAID configuration spanning multiple disks of course, so
we are resilient on a disk failure, but even if unlikely if might happen
that a host could loose all its data. Please keep this in mind when working
on important projects and/or handling important data that you care about.

I just added a warning to
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
If you have really important data that is too big to backup, keep in mind
that you can use your home directory (/user/your-username) on HDFS (that
replicates data three times across multiple nodes).

Please let us know if you have comments/suggestions/etc.. in the
aforementioned task.

Thanks in advance!

Luca (on behalf of the Analytics team)
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Analytics clients (stat/notebook hosts) and backups of home directories

Leila Zia
Hi Luca,

Thanks for the heads up. Isaac is coordinating a response from the
Research side.

I have one question for you: As you allow/encourage for more copies of
the files to exist, what is the mechanism you'd like to put in place
for reducing the chances of PII to be copied in new folders that then
will be even harder (for your team) to keep track of? Having an
explicit process/understanding about this will be very helpful.

Thanks,
Leila


On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[hidden email]> wrote:

>
> Hi everybody,
>
> as part of https://phabricator.wikimedia.org/T201165 the Analytics team
> thought to reach out to everybody to make it clear that all the home
> directories on the stat/notebook nodes are not backed up periodically. They
> run on a software RAID configuration spanning multiple disks of course, so
> we are resilient on a disk failure, but even if unlikely if might happen
> that a host could loose all its data. Please keep this in mind when working
> on important projects and/or handling important data that you care about.
>
> I just added a warning to
> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
> If you have really important data that is too big to backup, keep in mind
> that you can use your home directory (/user/your-username) on HDFS (that
> replicates data three times across multiple nodes).
>
> Please let us know if you have comments/suggestions/etc.. in the
> aforementioned task.
>
> Thanks in advance!
>
> Luca (on behalf of the Analytics team)
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Analytics clients (stat/notebook hosts) and backups of home directories

Luca Toscano
Hi Leila and Kate,

adding a few words after Nuria's email to clarify my original intentions.
My point was that any important and vital file that needs to be preserved
may be stored in HDFS rather than on stat/notebooks due to the absence of
backups of the home directories. My concern was that people had a different
understanding about backups and I wanted to clarify.
We (as Analytics team) don't have any good way at the moment to
periodically scan HDFS and home directories across hosts to find PII data
that is retained more than the allowed period of time. The main motivation
is that we'd need to find a way to check a huge amount of files, with
different names and formats, and figure out if the data contained in them
is PII and retained more than X days. It is not an impossible task but not
easy or trivial, we'd need a lot more staff in my opinion to create and
maintain something similar :) We started recently with the clean up of old
home directories (i.e. belonging to users not active anymore) and we
established a process with SRE to get pinged when a user is offboarded to
verify what data should be kept and what not (I know that both of you are
aware of this since you have been working with us on several tasks, I am
writing it to allow other people to get the context :). This is only a
starting point, I really hope to have something more robust and complete in
the future. In the meantime, I'd say that every user is responsible of the
data that he/she handles on the Analytics infrastructure, periodically
reviewing it and deleting when necessary. I don't have a specific
guideline/process to suggest, but we can definitely have a chat together
and decide something shared among our teams!

Let me know if this makes sense or not :)

Thanks,

Luca

Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <[hidden email]>
ha scritto:

> >I have one question for you: As you allow/encourage for more copies of
> >the files to exist
> To be extra clear, we do not encourage for data to be in that notebooks
> hosts at all, there is no capacity of them to neither process nor hosts
> large amounts of data. Data that you are working with is best placed on
> /user/your-username databse in hadoop so far from encouraging multiple
> copies we are rather encouraging you keep the data outside the notebook
> machines.
>
> Thanks,
>
> Nuria
>
> On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <[hidden email]>
> wrote:
>
>> I second Leila's question. The issue of how we flag PII data and ensure
>> it's appropriately scrubbed came up in our team meeting yesterday. We're
>> discussing team practices for data/project backups tomorrow and plan to
>> come out with some proposals, at least for the short term.
>>
>> Are there any existing processes or guidelines I should be aware of?
>>
>> Thanks!
>> Kate
>>
>> --
>>
>> Kate Zimmerman (she/they)
>> Head of Product Analytics
>> Wikimedia Foundation
>>
>>
>> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[hidden email]> wrote:
>>
>>> Hi Luca,
>>>
>>> Thanks for the heads up. Isaac is coordinating a response from the
>>> Research side.
>>>
>>> I have one question for you: As you allow/encourage for more copies of
>>> the files to exist, what is the mechanism you'd like to put in place
>>> for reducing the chances of PII to be copied in new folders that then
>>> will be even harder (for your team) to keep track of? Having an
>>> explicit process/understanding about this will be very helpful.
>>>
>>> Thanks,
>>> Leila
>>>
>>>
>>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[hidden email]>
>>> wrote:
>>> >
>>> > Hi everybody,
>>> >
>>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
>>> team
>>> > thought to reach out to everybody to make it clear that all the home
>>> > directories on the stat/notebook nodes are not backed up periodically.
>>> They
>>> > run on a software RAID configuration spanning multiple disks of
>>> course, so
>>> > we are resilient on a disk failure, but even if unlikely if might
>>> happen
>>> > that a host could loose all its data. Please keep this in mind when
>>> working
>>> > on important projects and/or handling important data that you care
>>> about.
>>> >
>>> > I just added a warning to
>>> >
>>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
>>> .
>>> > If you have really important data that is too big to backup, keep in
>>> mind
>>> > that you can use your home directory (/user/your-username) on HDFS
>>> (that
>>> > replicates data three times across multiple nodes).
>>> >
>>> > Please let us know if you have comments/suggestions/etc.. in the
>>> > aforementioned task.
>>> >
>>> > Thanks in advance!
>>> >
>>> > Luca (on behalf of the Analytics team)
>>> > _______________________________________________
>>> > Wiki-research-l mailing list
>>> > [hidden email]
>>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>> _______________________________________________
>> Analytics mailing list
>> [hidden email]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> _______________________________________________
> Analytics mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: [Analytics] Analytics clients (stat/notebook hosts) and backups of home directories

Leila Zia
All clear, Luca and Nuria. Thanks!


On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano <[hidden email]> wrote:

>
> Hi Leila and Kate,
>
> adding a few words after Nuria's email to clarify my original intentions.
> My point was that any important and vital file that needs to be preserved
> may be stored in HDFS rather than on stat/notebooks due to the absence of
> backups of the home directories. My concern was that people had a different
> understanding about backups and I wanted to clarify.
> We (as Analytics team) don't have any good way at the moment to
> periodically scan HDFS and home directories across hosts to find PII data
> that is retained more than the allowed period of time. The main motivation
> is that we'd need to find a way to check a huge amount of files, with
> different names and formats, and figure out if the data contained in them
> is PII and retained more than X days. It is not an impossible task but not
> easy or trivial, we'd need a lot more staff in my opinion to create and
> maintain something similar :) We started recently with the clean up of old
> home directories (i.e. belonging to users not active anymore) and we
> established a process with SRE to get pinged when a user is offboarded to
> verify what data should be kept and what not (I know that both of you are
> aware of this since you have been working with us on several tasks, I am
> writing it to allow other people to get the context :). This is only a
> starting point, I really hope to have something more robust and complete in
> the future. In the meantime, I'd say that every user is responsible of the
> data that he/she handles on the Analytics infrastructure, periodically
> reviewing it and deleting when necessary. I don't have a specific
> guideline/process to suggest, but we can definitely have a chat together
> and decide something shared among our teams!
>
> Let me know if this makes sense or not :)
>
> Thanks,
>
> Luca
>
> Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <[hidden email]>
> ha scritto:
>
> > >I have one question for you: As you allow/encourage for more copies of
> > >the files to exist
> > To be extra clear, we do not encourage for data to be in that notebooks
> > hosts at all, there is no capacity of them to neither process nor hosts
> > large amounts of data. Data that you are working with is best placed on
> > /user/your-username databse in hadoop so far from encouraging multiple
> > copies we are rather encouraging you keep the data outside the notebook
> > machines.
> >
> > Thanks,
> >
> > Nuria
> >
> > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <[hidden email]>
> > wrote:
> >
> >> I second Leila's question. The issue of how we flag PII data and ensure
> >> it's appropriately scrubbed came up in our team meeting yesterday. We're
> >> discussing team practices for data/project backups tomorrow and plan to
> >> come out with some proposals, at least for the short term.
> >>
> >> Are there any existing processes or guidelines I should be aware of?
> >>
> >> Thanks!
> >> Kate
> >>
> >> --
> >>
> >> Kate Zimmerman (she/they)
> >> Head of Product Analytics
> >> Wikimedia Foundation
> >>
> >>
> >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <[hidden email]> wrote:
> >>
> >>> Hi Luca,
> >>>
> >>> Thanks for the heads up. Isaac is coordinating a response from the
> >>> Research side.
> >>>
> >>> I have one question for you: As you allow/encourage for more copies of
> >>> the files to exist, what is the mechanism you'd like to put in place
> >>> for reducing the chances of PII to be copied in new folders that then
> >>> will be even harder (for your team) to keep track of? Having an
> >>> explicit process/understanding about this will be very helpful.
> >>>
> >>> Thanks,
> >>> Leila
> >>>
> >>>
> >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <[hidden email]>
> >>> wrote:
> >>> >
> >>> > Hi everybody,
> >>> >
> >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
> >>> team
> >>> > thought to reach out to everybody to make it clear that all the home
> >>> > directories on the stat/notebook nodes are not backed up periodically.
> >>> They
> >>> > run on a software RAID configuration spanning multiple disks of
> >>> course, so
> >>> > we are resilient on a disk failure, but even if unlikely if might
> >>> happen
> >>> > that a host could loose all its data. Please keep this in mind when
> >>> working
> >>> > on important projects and/or handling important data that you care
> >>> about.
> >>> >
> >>> > I just added a warning to
> >>> >
> >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
> >>> .
> >>> > If you have really important data that is too big to backup, keep in
> >>> mind
> >>> > that you can use your home directory (/user/your-username) on HDFS
> >>> (that
> >>> > replicates data three times across multiple nodes).
> >>> >
> >>> > Please let us know if you have comments/suggestions/etc.. in the
> >>> > aforementioned task.
> >>> >
> >>> > Thanks in advance!
> >>> >
> >>> > Luca (on behalf of the Analytics team)
> >>> > _______________________________________________
> >>> > Wiki-research-l mailing list
> >>> > [hidden email]
> >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >>>
> >>>
> >>> _______________________________________________
> >> Analytics mailing list
> >> [hidden email]
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> > _______________________________________________
> > Analytics mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l