A04: Enron emails
This assignment helps you practice with “joins” using an email dataset. Only “inner joins” will be needed.
Background and disclaimer
Enron Corporation, once an energy, commodities, and services company, is infamous for having engaged in corporate fraud and corruption that resulted in its bankruptcy in late 2001. Enron employed 20,000 people and claimed revenues of $111 billion at the end of 2000 (source). The scandal was so disasterous that a federal law known as the Sarbanes-Oxley Act of 2002 was passed, which sets accounting standards for U.S. public companies.
During the investigation of Enron’s accounting practices by the Federal Energy Regulatory Commission (FERC), over 600,000 emails generated by 158 employees were collected. When the investigation ended, FERC deemed the emails to be in the public domain and may be used for historical research and academic purposes. The email database has been available on the web for more than a decade from Carnegie Mellon University. Over the years it has been reviewed and various emails have been removed in response to privacy concerns.
Carnegie Mellon’s site has an important disclaimer:
I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of “real” email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)
Why are we using this dataset?
I believe coursework should be as realistic as possible in order to model the environment that students will face outside of school. In our case, that means using real datasets and answering meaningful questions about those data. Sometimes, the data that we are asked to analyze in the real world is confidential and possibly even embarrassing to certain individuals. The Enron email dataset includes confidential emails that were never intended to be seen by the general public. During the course of this assignment, I expect all students to exhibit maturity and respect for the individuals represented in these emails.
- Show the message date and subject for all emails sent by one of Enron’s CEOs, Jeffrey Skilling (email firstname.lastname@example.org); order by date (earliest first).
- Show the email address and recipient type (0/1/2 for to/cc/bcc respectively) for recipients of the message with subject ‘FERC Issues’ (message id 187528); do not apply any particular sort to the results.
- Show the date and subject, and sender email of the 10 most recent emails in Kenneth Lay’s folder ‘lay-k/inbox’ (Lay was the founder of Enron); sort by date (recent first). Use just one query. Do not look up the folder id ahead of time; use the folder name in the query. Name the sender email column “sender” in the results.
- Show the folder, date, subject, and sender email of all emails that have a subject with the word “bankruptcy”. Limit results to messages sent between November 1, 2001 and the end of the year (include Nov 1), and sent from emails that end with ‘@enron.com’. Order by date, newest last.
- Show all distinct email addresses of people who received a message with the subject “FW: UPDATE - Reporting to Work Next Week”; these users may be To or CC or BCC recipients. Sort by email address (a first, z last).