Data scientists build information platforms to ask and answer previously unimaginable questions. In this course, you will gain a better understanding of what data scientists do and the problems they solve. You will learn how data science helps companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Through in-class simulations, you will apply data science methods to real-world problems in different industries.

Through lecture and hands-on exercises, you will cover topics, including:

  • The growing need for enablers of data science, the role of data scientists, vertical use cases, and business applications
  • Where and how to acquire data, methods for evaluating source data, and data transformation and preparation
  • Types of statistics and analytical methods
  • Machine learning fundamentals and breakthroughs, the importance of algorithms, and data as a platform
  • Implementing and managing recommenders using Apache Mahout and setting up and evaluating data experiments
  • Steps for deploying to production and tips for working at scale

Course Duration

3 day


Cloudera Certified Professional: Data Science

What You’ll Learn

  • Role and responsibilities of a data scientist
  • Several ways in which data scientists create value for organizations across many industries
  • Locate and acquire data from diverse sources
  • Use transformation and normalization techniques to produce accurate, useful data sets
  • Implement an automated recommendation system
  • Develop, evaluate, and refine scoring systems for recommenders
  • Considerations involved in working at scale
  • Identify meaningful, actionable, and business-oriented results from the analysis

Who Needs to Attend

Software engineers, data analysts, and statisticians


  • Basic knowledge of Hadoop, including use of the HDFS file system, awareness of the MapReduce framework, Hadoop Streaming, and Hive.
  • Proficiency in a scripting language (Python is strongly preferred, although students familiar with another language, such as Perl or Ruby should be able to complete the exercises)

Follow-On Courses

There are no follow-ons for this course.

Course Outline

1. Data Science

  • What is Data Science?
  • Growing Need for Data Science
  • Role of a Data Scientist

2. Use Cases

  • Finance
  • Retail
  • Advertising
  • Defense and Intelligence
  • Telecommunications and Utilities
  • Healthcare and Pharmaceuticals

3. Project Life Cycle

  • Steps in the Project Life Cycle

4. Data Acquisition

  • Where to Source Data
  • Acquisition Techniques
  • Evaluating Input Data
  • Data Formats
  • Data Quantity
  • Data Quality

5. Data Transformation

  • Anonymization
  • File Format Conversion
  • Joining Datasets

6. Data Analysis and Statistical Methods

  • Relationship Between Statistics and Probability
  • Descriptive Statistics
  • Inferential Statistics

7. Fundamentals of Machine Learning

  • Three Cs of Machine Learning
  • Spotlight: Naïve Bayes Classifiers
  • Importance of Data and Algorithms

8. Recommender

  • What is a Recommender System?
  • Types of Collaborative Filtering
  • Limitations of Recommender

9. Systems Fundamental Concepts

10. Apache Mahout

  • What Apache Mahout is (and is not)
  • History of Mahout
  • Availability and Installation
  • Demonstration: Using Mahout’s Item-Based Recommender

11. Implementing Recommenders with Apache Mahout

  • Similarity Metrics for Binary Preferences
  • Similarity Metrics for Numeric Preferences
  • Scoring

12. Experimentation and Evaluation

  • Measuring Recommender Effectiveness
  • Designing Effective Experiments
  • Conducting an Effective Experiment
  • User Interfaces for Recommenders

13. Production Deployment and Beyond

  • Deploying to Production
  • Tips and Techniques for Working at Scale
  • Summarizing and Visualizing Results
  • Considerations for Improvement
  • Next Steps for Recommenders

14. Appendix A: Hadoop

15. Appendix B: Mathematical Formulas

16. Appendix C: Language and Tool Reference