UPDATE: This clearly going to be a major hassle, so we’ll spend the extra time coding a program that will sanitize the data before it goes into splunk.
Last week Google’s Summer of Code program started and my student Dániel Bali is ready to get busy combing through our massive logs and see what sorts of information he can mine from our logs. We only have one minor problem — our logs contain the IP addresses of our users and some requests contain the user names of the person making the request. Removing this private information from the logs before Dániel sees them is quite a pain to do well. I would like to propose that we: Consider Dániel part of our core team for the summer and allow him to see IP addresses and all the requests in full. Have Dániel sign a short statement stating that he will not divulge any private information. Will fail him in his GSoC project if he does divulge any private information. If this is not acceptable to you, please speak up soon. I would like to make this happen early next week so Dániel can continue his GSoc work. UPDATE: The final output of Dániel’s work will not contain any private information. If we end up using any private data as input, we will sanitize it and remove private information before we publish the output.
11 thoughts on “Summer of Code log analysis project: May we share our data with our GSoC student?”
+1 from me.
I assume it would go without saying, but this presumably also covers keeping private those reports that contain private information, after said reports exist.
I’m indifferent, but let my indifference move towards support (mainly because I don’t believe IP addresses are or should quality as personally identifiable information).
Also, if the community moves in favor of not allowing him access to IP address, may I suggest uniquely hashing (SHAxxx or MD5 or some form of UUID) each address before turning them over. This can be quickly and easily done with some clever scripting and allows there to still be a unique identifier in the logs for each IP address and still provide anonymity.
I take back what I said about Splunk, I misunderstood how that process works. Splunk has no contact with the data whatsoever.
Permission granted! (Not that I see what interesting things one could find in the logs…)
It would sit better with me if you made a blanket declaration that any GSoC student working with MB was considered part of MB, but would be required to sign an NDA as you mention above, if they needed access to private data in their work.
I like Hawke’s last proposal. Seems the most clean option.
I kind of agree with Hawke. I’m not against my own info being used, but others might. While I don’t see ip alone as being really “personal information”, it is “private information”, as it is known only by MB, and not made public.
I don’t know exactly what types of data he might have access to, but I’d assume it could include POSTs sent during editing… which would be relatively easy to cross the datestamps with the (public info) data dumps, which do include edit data – and now you have ip+editor nick, and it’s no longer anonymous.
For Dániel’s analysis to be fully public and confirmable, such that his results can be checked by others, presumably the dataset also will eventually need to be made public? (Else the results are not checkable.) If so, then doesn’t the data need to be sanitized anyhow? In which case, the pain is there at some point; it seems to me a lot better to have it sanitized by someone who is a full time core team member, rather than someone who is, temporarily, only a “core team” member due to a technicality in how he is temporarily defined. , no offense to Dániel intended.
(It also seems odd that a GSoC student – essentially an “apprentice” in any other field – could be considered a “core team” member, even if only for a short term).
If the “IP addresses and some requests contain the user names of the person making the request” are not important for the log analysis project, why you don’t blank/delete/randomize this parts? I’m sure, the members of the core team all have the skills to use ‘sed’ or ‘grep’ for such a task?
Or am I being simplistic?
I’m sorry to say I’ll echo everyone else on this. I personally can’t give a bigger damn about if anyone sees *my* ip/edits/whatever.
And MusicBrainz is better than that.