What it is like to do real network analysis out in the wild. That means everything is not as nicely packaged as we might want. You will be working with exactly what is available to anyone else using this data.

For this project, you will be analyzing part of the Enron email network. This is a big project and will count as two homework assignments. You will fail if you leave this until a couple days before it is due, so please please start early.

Enron Background Reading
Enron Scandal: The Fall of a Wall Street Darling https://www.investopedia.com/updates/enron-scandal-summary/ (Links to an external site.)

Tapes reveal Enron’s secret role in California’s power blackouts https://www.theguardian.com/business/2005/feb/05/enron.usnews (Links to an external site.)

The Immortal Life of the Enron E-mails https://www.technologyreview.com/s/515801/the-immortal-life-of-the-enron-e-mails/ (Links to an external site.)


I downloaded the entire email dataset, created an adjacency list, filtered it, and ran the community detection algorithm. This is the enronPrepared.gephi file that I have provided. You are welcome to filter it further or run more statistics.

Your job in this assignment is to understand what’s going on in this network. To do this, you will need to connect information about the structure to an understanding of the content. More specifically, you will have to see what the structure looks like and then go to the emails themselves to learn who people are, what topics they discuss, what their roles are, etc. Like on your personal network visualization, this should allow you to figure out (generally) what each of the major clusters represent.

You should also choose at least one other network structure feature and explain it (centrality, tie strength, trust, density, etc.). For example, if you choose degree centrality, find the nodes with the highest degrees. Look at who they are emailing and why. Also keep in mind that this data comes from email; some people may have a higher degree simply because they did not delete many messages. Finally, you may not simply report the numbers. You MUST connect that feature to things you discover in email. It could be, for example, the node with the highest degree is a secretary who sends out announcements or an email list that goes to lots of people. You cannot just say “This node has high centrality” or “this group has high density”. YOU MUST EXPLAIN WHY.

Examples of things that are poor analysis: “Betty is important because she has a high betweenness centrality.” While Betty may have a high betweenness centrality, it does not necessarily indicate she is important. How do you know she is important? Who does she communicate with? Does she connect important or otherwise disconnected groups of people (e.g. different divisions of the organization). Betweenness means something – it reflects a person’s role as a gatekeeper or passer of information. Give examples of how Betty has assumed this role.

Similarly, if you find yourself telling me anything about Ken Lay, you are probably on the wrong track. While he is prominent in Enron, he is not prominent in the network. Your task here is to analyze the network you were given and what is in it, NOT to tell me about Enron the organization and how that appears in the network you have.


Part of the purpose of this assignment is for you to experience what it is like to do real network analysis out in the wild. That means everything is not as nicely packaged as we might want. You will be working with exactly what is available to anyone else using this data.

To get to the content, you will need to read messages from the Enron collection. You can use a searchable database here: http://www.enkive.org/demo (Links to an external site.). Please follow the instructions on their page to connect and search. Note that we are relying on another service here and, as with doing real network analysis, it may fail! The raw data is browsable online at http://www.enron-mail.com/email/ (Links to an external site.). The names there are folders for different email owners. You can browse through folders and get to actual messages.

You are also welcome to download the raw data at http://www.cs.cmu.edu/~enron/ (Links to an external site.). It’s messy and big, but the files are plain text that you can open in any text editor (TextWrangler is my personal favorite and you can use it to search for a term in the entire folder of emails, not just in a specific file).

Also, feel free to explore other information about the Enron crisis online. If you’re up for it, watch the excellent documentary “The Smartest Guys In the Room”. (Last I checked, it’s available on Netflix). Your goal is to really understand this organization so you can analyze their network.

HOWEVER, remember you are analyzing the network I have given you, NOT the organizaiton as a whole. This is a sample of the network, and the data is weird. It’s email. Some people keep everything and other people delete things. Some people may be poorly represented in the sample for many reasons. Thus, people who are important to the organization will not necessarily be the most prominent people in the network you have. Knowing the background of Enron will help you understand who people are and what events are taking place, but you should not simply transfer that knowledge to try to understand the network. The network might look very different.


Your paper should be analytical, not a personal reflection. I do not want to hear about what you thought about the project, what worried you, what you found difficult, how you made decisions. The paper should factually describe the analytical steps you took, how you discovered what the people or clusters represent, and what evidence you have to support that. If you are unsure about the structure, review the writing in the textbook or assigned papers. These discuss results, not the author’s feelings. Make sure your paper follows those guidelines.

Your paper should follow this structure:

Section 1: Introduction – one paragraph of background on the Enron crisis and email archive, a description of what the network represents, and a description of what you are analyzing.

Section 2: Cluster Analysis – describe what each of the major communities represent. Explain how you found that out and support your analysis with actual emails.

Section 3: Additional structural feature – what additional feature did you look at, how did you identify that in the network, what are your major findings, and what do those represent in the Enron structure / dataset

Section 4: Conclusion – Is this data set useful for understanding the Enron crisis? Were you able to find meaningful insights

The paper should be around 1000 words minimum and you probably shouldn’t need more than 2000.


  • Section 1: 10%
  • Section 2: 25%
  • Section 3: 25%
  • Section 4: 10%
  • Quality of writing (grammar, structure, analytical focus): 30%



