UPDATE: This clearly going to be a major hassle, so we’ll spend the extra time coding a program that will sanitize the data before it goes into splunk.
Last week Google’s Summer of Code program started and my student Dániel Bali is ready to get busy combing through our massive logs and see what sorts of information he can mine from our logs.
We only have one minor problem — our logs contain the IP addresses of our users and some requests contain the user names of the person making the request. Removing this private information from the logs before Dániel sees them is quite a pain to do well.
I would like to propose that we:
Consider Dániel part of our core team for the summer and allow him to see IP addresses and all the requests in full.Have Dániel sign a short statement stating that he will not divulge any private information.Will fail him in his GSoC project if he does divulge any private information.
If this is not acceptable to you, please speak up soon. I would like to make this happen early next week so Dániel can continue his GSoc work.
UPDATE: The final output of Dániel’s work will not contain any private information. If we end up using any private data as input, we will sanitize it and remove private information before we publish the output.