Chapter 15. Conclusion

Congratulations! You’ve reached the end of the book. When you first began, you likely knew little Python and you hadn’t used programming to investigate data.

Your experience now should be quite different. You’ve gained knowledge and experience finding and cleaning data. You’ve honed your skills by focusing your questions and determining what you can and cannot answer given a particular dataset. You can write simple regexes and complex web scrapers. You have learned how to store and deploy your code and connect with databases. You can scale your data and processes in the cloud and manage your data wrangling via automation.

The fun doesn’t have to end here, however! There is plenty more to learn and do in your career as a data wrangler. You can take the skills and tools you have learned here and continue to push your knowledge, and in turn the boundaries of the field of data wrangling. We encourage you to advance your quest for excellence and keep asking difficult questions of your data, processes, and methods.

Duties of a Data Wrangler

As we’ve established throughout this book and our investigations, the data out there and the conclusions you can reach as a data wrangler are vast. But along with those opportunities come responsibilities.

There are no data wrangling police; however, you have learned some ethics throughout our book. You’ve learned to be a conscientious web scraper. You’ve learned to pick up the phone and ask for more information. You’ve learned to explain and document your process when you present your findings. You’ve learned how to ask hard questions about difficult topics, particularly when the data sources may have other motivations.

As you pursue learning and growing as a data wrangler, your ethical sense will grow and help guide and challenge you in your work and processes. In a way, you are now an investigative journalist. The conclusions you reach and the questions you ask can and will make a difference in your field. With that knowledge, you have the burden of duty.

Your duties include:

  • Using your knowledge, skills, and ability for just and good causes

  • Helping contribute to the knowledge of others around you

  • Giving back to the community that helped you

  • Challenging opposition to the ethics you have learned so far and continue to develop

We encourage you to step up and meet these challenges through your career as a data wrangler. Do you like working with others and teaching? Become a mentor! Do you enjoy a particular open source package? Become a code or documentation contributor! Have you been researching an important social or health issue? Contribute your findings to the academic or social community! Have you experienced difficulties from a particular community or source? Share your story with the world.

Beyond Data Wrangling

Your skills have developed over the course of this book, but you still have much to learn. Depending on your skillset and interests, there are quite a few areas for further exploration.

Become a Better Data Analyst

This book offered an introduction to statistical and data analysis. If you want to truly hone your statistical and analytical skills, you’ll want to spend more time reading about the science behind the methods as well as learning some of the more intensive Python packages, give you more power and flexibility when analyzing your datasets.

To learn more advanced statistics, regression models and the math behind data analysis are essential topics of study. If you haven’t taken a statistics course, EdX has a great archived course from the University of California, Berkeley. If you’d like to explore with a book, Think Stats by Allen Downey (O’Reilly) is a great introduction to statistical math concepts and also uses Python. Cathy O’Neill and Rachel Schutt’s Doing Data Science (also from O’Reilly) provides a deeper analysis of the field of data science.

If you’re interested in learning the scipy stack and more about how Python can help you perform more advanced math and statistics, you’re in luck. One of the main contributors to pandas, Wes McKinney, has written a book that covers pandas in depth (Python for Data Analysis; O’Reilly). The pandas documentation is also a great place to start learning. You played around a bit in Chapter 7 with numpy. If you are interested in learning some of the numpy internals, check out the SciPy introduction to the basics.

Become a Better Developer

If you really want to hone your Python skills, Luciano Ramalho’s Fluent Python (O’Reilly) discusses some more in-depth design patterns in Python thinking. We also highly recommend taking a look through recent videos of Python events around the world and investigating topics that interest you.

If this book is your first introduction to programming, you may want to take an introduction to computer science course. If you want a self-study option, Coursera offers one from Stanford University. If you’d like an online textbook covering some of the theory behind computer science, we recommend Structure and Interpretation of Computer Programs, by Harold Abelson and Gerald Jay Sussman (MIT Press).

If you’re interested in learning more development principles through building and working with others, we recommend finding a local meetup group and getting involved. Many such groups host local and remote hackathons, so you can work on code alongside others and learn by doing.

Become a Better Visual Storyteller

If you were particularly interested in the visual storytelling parts of this book, there are many ways to further your knowledge of that field. If you want to continue with the libraries we’ve used, we highly recommend going through the Bokeh tutorials and experimenting with your Jupyter notebooks.

Learning JavaScript and some of the popular visualization libraries from the JS community will help you become a better visual storyteller. Square offers an introduction to a D3 course with a brief introduction to the popular JavaScript library D3.

Finally, if you want to study some of the theories and ideas behind visual storytelling from a data analysis standpoint, we recommend Edward Tufte’s Visual Display of Quantitative Information (Graphics Press).

Become a Better Systems Architect

If learning how to scale, deploy, and manage systems was particularly interesting to you, we have barely scratched the surface in terms of the opportunities within the systems sphere.

If you’re interested in learning some more Unix, the University of Surrey has a short introduction covering some good concepts. The Linux Documentation Project also has a short introduction to bash programming.

We highly recommend taking time to learn Ansible, a scalable and flexible server and systems management solution. If you’re more interested in scaling data solutions, Udacity offers an Intro to Hadoop and MapReduce course. You should also check out Stanford’s introduction to Apache Spark and the PySpark programming guide.

Where Do You Go from Here?

So, where do you go now? You have a litany of new skills, and you have the ability to question both your own assumptions and the data you find. You also have a working knowledge of Python and numerous useful libraries at your fingertips.

If you don’t yet have a passion for a particular field or dataset, you’ll want to discover ways to continue your progress and advancement as a data wrangler with new fields of study. There are many great data analysts out there writing inspirational stories. Here are a few:

  • FiveThirtyEight, once a blog started by Nate Silver for The New York Times, is now a site with numerous writers and analysts investigating a variety of topics. After the Ferguson grand jury decision to not indict Darren Wilson, FiveThirtyEight published an article showing the outcome was an outlier. With controversial topics, being able to show a data trend or tendency can help take some of the emotions out of the story and reveal what the data is actually saying.

  • A study of income gaps by The Washington Post used tax and census data to conclude the “ol’ boy network” was still alive in terms of job acquisition and initial salaries, but usually flattened or showed no correlation after those initial jobs were acquired.

  • We’ve studied some of the impacts of groups in Africa who use child labor, including for mining conflict minerals. A recent report by Amnesty International and Global Witness found most American firms are not adequately checking their supply pipelines to ensure their products do not use conflict minerals.

There are millions of untold stories in the world. If you have a passion or a belief, it’s likely your insights and data wrangling skills can help people and communities. If you don’t have a passion yet, we encourage you to keep learning by keeping up with data analysis in the news, documentaries, and online.

No matter where your interests lie, there is a wide world of possibilities available to deepen your learning and grasp of the concepts introduced in this book. Whatever sparked your interest the most is a great path for future learning. We hope this book is just a taste of what you’ll be doing throughout your career as a data wrangler.