Need help with assignments?

Our qualified writers can create original, plagiarism-free papers in any format you choose (APA, MLA, Harvard, Chicago, etc.)

Order from us for quality, customized work in due time of your choice.

Click Here To Order Now

Assignment: Halloween Candy
Name:
Overview
In this assignment, your job will be to clean and wrangle data from a survey of Halloween candy to prepare it for a machine learning project.
Once you have completed all the tasks in the notebook, save your notebook as halloween_candy, and verify that all the tests pass in CodeGrade.
For those of you that are more comfortable with using Pandas, I have also provided extra optional data analysis questions at the end of the assignment for more practice.
Data Set
The data set that we will be using is the 2017 Halloween Candy Hierarchy data set as discussed in this boingboing article. You can also read more about the data in the Science Creative Quarterly.
The following are the rating instructions from the survey:
Basically, consider that feeling you get when you receive this item in your Halloween haul. Does it make you really happy (JOY)? Or is it something that you automatically place in the junk pile (DESPAIR)? MEH for indifference, and you can leave blank if you have no idea what the item is.Note that the original data set has been slightly altered from its original state, and if you wanted to perform any analysis for future projects, you would need to download the data directly from the links above.
This data is a great example of a messy data set, especially since they allowed respondents to enter text for a number of the fields. Also, note that some of the comments in the file might be considered inappropriate to some readers but cleaning this type of data is normal in a lot of data science projects.
Note
Show Work
Remember that you must show your work. Students submissions are spot checked manually throughout the term to verify that they are not hard coding the answer from looking only in the file or in CodeGrade’s expected output. If this is seen, the student’s answer will be manually marked wrong and their grade will be changed to reflect this.
For example, if the answer to Q1, the mean of a specific column, is 22:
# correct way
Q1 = df['column_name'].mean()
Q1 = 22
Our End Goal
Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person’s gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.
Initial Import & Exploration
In [18]:
# initial importsimport pandas as pdimport numpy as np# Do not change this option; This allows the CodeGrade auto grading to function correctlypd.set_option(‘display.max_columns’, 20)Let’s start by importing our data and creating a DataFrame called candy. We need to include encoding=’iso-8859-1′ during the import because there are special characters in the data that Pandas doesn’t recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is utf-8, so changing the encoding allows Pandas to recognize those special characters.
Run the following code, with the encoding argument, and it should import correctly.
In [19]:
# read_csv with iso-8859-1 encoding; using latin-1 would also work herecandy_full = pd.read_csv(‘candy.csv’, encoding=‘iso-8859-1’)# copy to new DF so that we can have a copy of the original import if neededcandy = candy_full.copy()Let’s take a brief look at the data by using head().
In [20]:
# first five rowscandy.head()Out[20]:
Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
5 rows × 120 columns
Next, run the following code to see information about the DataFrame.
In [21]:
# check info about the DataFramecandy.info()
RangeIndex: 2479 entries, 0 to 2478
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB
Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation: “If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.”
We can make the columns display by setting the max_cols argument equal to the number of columns in the data set.
In [22]:
# check info, set max_colscandy.info(max_cols=120)
RangeIndex: 2479 entries, 0 to 2478
Data columns (total 120 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Internal ID 2479 non-null int64
1 Q1: GOING OUT? 2368 non-null object
2 Q2: GENDER 2437 non-null object
3 Q3: AGE 2394 non-null object
4 Q4: COUNTRY 2414 non-null object
5 Q5: STATE, PROVINCE, COUNTY, ETC 2377 non-null object
6 Q6 | 100 Grand Bar 1728 non-null object
7 Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes) 1741 non-null object
8 Q6 | Any full-sized candy bar 1803 non-null object
9 Q6 | Black Jacks 1517 non-null object
10 Q6 | Bonkers (the candy) 1482 non-null object
11 Q6 | Bonkers (the board game) 1469 non-null object
12 Q6 | Bottle Caps 1709 non-null object
13 Q6 | Box’o’Raisins 1787 non-null object
14 Q6 | Broken glow stick 1767 non-null object
15 Q6 | Butterfinger 1793 non-null object
16 Q6 | Cadbury Creme Eggs 1792 non-null object
17 Q6 | Candy Corn 1796 non-null object
18 Q6 | Candy that is clearly just the stuff given out for free at restaurants 1784 non-null object
19 Q6 | Caramellos 1723 non-null object
20 Q6 | Cash, or other forms of legal tender 1795 non-null object
21 Q6 | Chardonnay 1732 non-null object
22 Q6 | Chick-o-Sticks (we donÕt know what that is) 1528 non-null object
23 Q6 | Chiclets 1764 non-null object
24 Q6 | Coffee Crisp 1621 non-null object
25 Q6 | Creepy Religious comics/Chick Tracts 1771 non-null object
26 Q6 | Dental paraphenalia 1783 non-null object
27 Q6 | Dots 1746 non-null object
28 Q6 | Dove Bars 1773 non-null object
29 Q6 | Fuzzy Peaches 1652 non-null object
30 Q6 | Generic Brand Acetaminophen 1744 non-null object
31 Q6 | Glow sticks 1778 non-null object
32 Q6 | Goo Goo Clusters 1595 non-null object
33 Q6 | Good N’ Plenty 1741 non-null object
34 Q6 | Gum from baseball cards 1759 non-null object
35 Q6 | Gummy Bears straight up 1778 non-null object
36 Q6 | Hard Candy 1780 non-null object
37 Q6 | Healthy Fruit 1781 non-null object
38 Q6 | Heath Bar 1763 non-null object
39 Q6 | Hershey’s Dark Chocolate 1802 non-null object
40 Q6 | HersheyÕs Milk Chocolate 1803 non-null object
41 Q6 | Hershey’s Kisses 1797 non-null object
42 Q6 | Hugs (actual physical hugs) 1762 non-null object
43 Q6 | Jolly Rancher (bad flavor) 1781 non-null object
44 Q6 | Jolly Ranchers (good flavor) 1780 non-null object
45 Q6 | JoyJoy (Mit Iodine!) 1449 non-null object
46 Q6 | Junior Mints 1777 non-null object
47 Q6 | Senior Mints 1532 non-null object
48 Q6 | Kale smoothie 1731 non-null object
49 Q6 | Kinder Happy Hippo 1528 non-null object
50 Q6 | Kit Kat 1801 non-null object
51 Q6 | LaffyTaffy 1739 non-null object
52 Q6 | LemonHeads 1745 non-null object
53 Q6 | Licorice (not black) 1789 non-null object
54 Q6 | Licorice (yes black) 1790 non-null object
55 Q6 | Lindt Truffle 1757 non-null object
56 Q6 | Lollipops 1784 non-null object
57 Q6 | Mars 1750 non-null object
58 Q6 | Maynards 1450 non-null object
59 Q6 | Mike and Ike 1746 non-null object
60 Q6 | Milk Duds 1782 non-null object
61 Q6 | Milky Way 1787 non-null object
62 Q6 | Regular M&Ms 1800 non-null object
63 Q6 | Peanut M&MÕs 1804 non-null object
64 Q6 | Blue M&M’s 1748 non-null object
65 Q6 | Red M&M’s 1746 non-null object
66 Q6 | Green Party M&M’s 1711 non-null object
67 Q6 | Independent M&M’s 1662 non-null object
68 Q6 | Abstained from M&M’ing. 1532 non-null object
69 Q6 | Minibags of chips 1751 non-null object
70 Q6 | Mint Kisses 1699 non-null object
71 Q6 | Mint Juleps 1664 non-null object
72 Q6 | Mr. Goodbar 1735 non-null object
73 Q6 | Necco Wafers 1731 non-null object
74 Q6 | Nerds 1752 non-null object
75 Q6 | Nestle Crunch 1777 non-null object
76 Q6 | Now’n’Laters 1657 non-null object
77 Q6 | Peeps 1765 non-null object
78 Q6 | Pencils 1766 non-null object
79 Q6 | Pixy Stix 1753 non-null object
80 Q6 | Real Housewives of Orange County Season 9 Blue-Ray 1722 non-null object
81 Q6 | ReeseÕs Peanut Butter Cups 1796 non-null object
82 Q6 | Reese’s Pieces 1784 non-null object
83 Q6 | Reggie Jackson Bar 1460 non-null object
84 Q6 | Rolos 1761 non-null object
85 Q6 | Sandwich-sized bags filled with BooBerry Crunch 1699 non-null object
86 Q6 | Skittles 1769 non-null object
87 Q6 | Smarties (American) 1750 non-null object
88 Q6 | Smarties (Commonwealth) 1573 non-null object
89 Q6 | Snickers 1785 non-null object
90 Q6 | Sourpatch Kids (i.e. abominations of nature) 1737 non-null object
91 Q6 | Spotted Dick 1593 non-null object
92 Q6 | Starburst 1782 non-null object
93 Q6 | Sweet Tarts 1767 non-null object
94 Q6 | Swedish Fish 1760 non-null object
95 Q6 | Sweetums (a friend to diabetes) 1472 non-null object
96 Q6 | Take 5 1557 non-null object
97 Q6 | Tic Tacs 1761 non-null object
98 Q6 | Those odd marshmallow circus peanut things 1739 non-null object
99 Q6 | Three Musketeers 1767 non-null object
100 Q6 | Tolberone something or other 1769 non-null object
101 Q6 | Trail Mix 1767 non-null object
102 Q6 | Twix 1785 non-null object
103 Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein 1683 non-null object
104 Q6 | Vicodin 1686 non-null object
105 Q6 | Whatchamacallit Bars 1652 non-null object
106 Q6 | White Bread 1718 non-null object
107 Q6 | Whole Wheat anything 1728 non-null object
108 Q6 | York Peppermint Patties 1770 non-null object
109 Q7: JOY OTHER 917 non-null object
110 Q8: DESPAIR OTHER 723 non-null object
111 Q9: OTHER COMMENTS 389 non-null object
112 Q10: DRESS 1730 non-null object
113 Unnamed: 113 9 non-null object
114 Q11: DAY 1750 non-null object
115 Q12: MEDIA [Daily Dish] 85 non-null float64
116 Q12: MEDIA [Science] 1376 non-null float64
117 Q12: MEDIA [ESPN] 99 non-null float64
118 Q12: MEDIA [Yahoo] 67 non-null float64
119 Click Coordinates (x, y) 1619 non-null object
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB
Of course, if you are just looking for the column names, you can just use a simple for loop.
In [23]:
# print a list of column namesfor col in candy.columns: print(col)Internal ID
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q5: STATE, PROVINCE, COUNTY, ETC
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers (a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box’o’Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N’ Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey’s Dark Chocolate
Q6 | HersheyÕs Milk Chocolate
Q6 | Hershey’s Kisses
Q6 | Hugs (actual physical hugs)
Q6 | Jolly Rancher (bad flavor)
Q6 | Jolly Ranchers (good flavor)
Q6 | JoyJoy (Mit Iodine!)
Q6 | Junior Mints
Q6 | Senior Mints
Q6 | Kale smoothie
Q6 | Kinder Happy Hippo
Q6 | Kit Kat
Q6 | LaffyTaffy
Q6 | LemonHeads
Q6 | Licorice (not black)
Q6 | Licorice (yes black)
Q6 | Lindt Truffle
Q6 | Lollipops
Q6 | Mars
Q6 | Maynards
Q6 | Mike and Ike
Q6 | Milk Duds
Q6 | Milky Way
Q6 | Regular M&Ms
Q6 | Peanut M&MÕs
Q6 | Blue M&M’s
Q6 | Red M&M’s
Q6 | Green Party M&M’s
Q6 | Independent M&M’s
Q6 | Abstained from M&M’ing.
Q6 | Minibags of chips
Q6 | Mint Kisses
Q6 | Mint Juleps
Q6 | Mr. Goodbar
Q6 | Necco Wafers
Q6 | Nerds
Q6 | Nestle Crunch
Q6 | Now’n’Laters
Q6 | Peeps
Q6 | Pencils
Q6 | Pixy Stix
Q6 | Real Housewives of Orange County Season 9 Blue-Ray
Q6 | ReeseÕs Peanut Butter Cups
Q6 | Reese’s Pieces
Q6 | Reggie Jackson Bar
Q6 | Rolos
Q6 | Sandwich-sized bags filled with BooBerry Crunch
Q6 | Skittles
Q6 | Smarties (American)
Q6 | Smarties (Commonwealth)
Q6 | Snickers
Q6 | Sourpatch Kids (i.e. abominations of nature)
Q6 | Spotted Dick
Q6 | Starburst
Q6 | Sweet Tarts
Q6 | Swedish Fish
Q6 | Sweetums (a friend to diabetes)
Q6 | Take 5
Q6 | Tic Tacs
Q6 | Those odd marshmallow circus peanut things
Q6 | Three Musketeers
Q6 | Tolberone something or other
Q6 | Trail Mix
Q6 | Twix
Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein
Q6 | Vicodin
Q6 | Whatchamacallit Bars
Q6 | White Bread
Q6 | Whole Wheat anything
Q6 | York Peppermint Patties
Q7: JOY OTHER
Q8: DESPAIR OTHER
Q9: OTHER COMMENTS
Q10: DRESS
Unnamed: 113
Q11: DAY
Q12: MEDIA [Daily Dish]
Q12: MEDIA [Science]
Q12: MEDIA [ESPN]
Q12: MEDIA [Yahoo]
Click Coordinates (x, y)
This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.
Note: Unless the instructions ask you to do something different, please always update the original candy DataFrame for the exercises below. The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations. Also, feel free to add additional cells as needed.
Data Cleaning
Exercise1: Taking a look at the column names, you may notice that some include the character Õ. This should instead be an apostrophe ‘ mark. Rename the column names that include the Õ character and replace it with an apostrophe.
Remember that you should be updating the candy DataFrame for the tasks listed as unless told differently.
In [24]:
candy_columns=candy.replace(‘Õ’, “‘”, inplace=True)candy_columnsQ1: How many duplicated rows are there in the file? Assume that a duplicate is any row that is exactly the same as another one. Save this number as Q1.
In [25]:
Q1= candy.duplicated(keep=False).sum()Q1Out[25]:
34Q2: How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same Internal ID number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same Internal ID as another. Save this number as Q2.
In [26]:
Q2= candy[‘Internal ID’].duplicated(keep=False).sum()Q2Out[26]:
38Exercise2: Drop any duplicates from the candy DataFrame. Duplicates are to be defined as any row with the same Internal ID as another. Use the default setting that keeps the first record from the duplicates.
In [27]:
candy= candy.drop_duplicates(subset=‘Internal ID’)candyOut[27]:
Internal IDQ1: GOING OUT?Q2: GENDERQ3: AGEQ4: COUNTRYQ5: STATE, PROVINCE, COUNTY, ETCQ6 | 100 Grand BarQ6 | Anonymous brown globs that come in black and orange wrapperst(a.k.a. Mary Janes)Q6 | Any full-sized candy barQ6 | Black Jacks…Q8: DESPAIR OTHERQ9: OTHER COMMENTSQ10: DRESSUnnamed: 113Q11: DAYQ12: MEDIA [Daily Dish]Q12: MEDIA [Science]Q12: MEDIA [ESPN]Q12: MEDIA [Yahoo]Click Coordinates (x, y)
090258773NaNNaNNaNNaNNaNNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
190272821NoMale44USANMMEHDESPAIRJOYMEH…NaNBottom line is Twix is really the only candy w…White and goldNaNSundayNaN1.0NaNNaN(84, 25)
290272829NaNMale49USAVirginiaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
390272840NoMale40usorMEHDESPAIRJOYMEH…NaNRaisins can go to hellWhite and goldNaNSundayNaN1.0NaNNaN(75, 23)
490272841NoMale23usaexton paJOYDESPAIRJOYDESPAIR…NaNNaNWhite and goldNaNFridayNaN1.0NaNNaN(70, 10)
…………………………………………………………
247490314359NoMale24USAMDJOYDESPAIRMEHDESPAIR…Fruit Stripe GumNaNWhite and goldNaNFridayNaNNaNNaNNaNNaN
247590314580NoFemale33USANew YorkMEHDESPAIRJOYNaN…CapersNaNBlue and blackNaNFridayNaN1.0NaNNaN(70, 26)
247690314634NoFemale26USATennesseeMEHDESPAIRJOYDESPAIR…NaNNaNBlue and blackNaNFridayNaN1.0NaNNaN(67, 35)
247790314658NoMale58UsaNorth CarolinaNaNNaNNaNNaN…NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
247890314802NoFemale66usaPennsylvaniaDESPAIRDESPAIRJOYDESPAIR…NaNYou hit all my chocolate highlights, and broug…White and goldNaNSunday1.0NaNNaNNaN(19, 26)
2460 rows × 120 columns
Exercise3: Your next task is to remove the following columns from the candy DataFrame as we will not use these columns for this project. You are welcome to do further analysis on these columns but do not save your analysis in this notebook.
Remove the following columns: Internal ID, Q5: STATE, PROVINCE, COUNTY, ETC, Q7: JOY OTHER, Q8: DESPAIR OTHER, Q9: OTHER COMMENTS, Unnamed: 113, Click Coordinates (x, y).
In [28]:
candy_columns_remove=[ ‘Internal ID’, ‘Q5: STATE, PROVINCE, COUNTY, ETC’, ‘Q7: JOY OTHER’, ‘Q8: DESPAIR OTHER’, ‘Q9: OTHER COMMENTS’, ‘Unnamed: 113’, ‘Click Coordinates (x, y)’]candy.drop(columns=candy_columns_remove,inplace=True)Code Check: As a check for the above exercises, the shape of your data should now be: (2460, 113)
In [30]:
candy.shapeOut[30]:
(2460, 113)Exercise4: Let’s now take a look at the Q2: GENDER column since this will be what we are trying to predict. Take a look at the value counts for this column.
In [31]:
candy=candy[‘Q2: GENDER’].value_counts()candyOut[31]:
Q2: GENDER
Male 1466
Female 839
I’d rather not say 83
Other 30
Name: count, dtype: int64Q3: How many missing values are in the Q2: GENDER column? Save this as Q3.
In [37]:
Q3 = candy[‘Q2: GENDER’].isnull().sum()Q3—————————————————————————
KeyError Traceback (most recent call last)
File ~anaconda3Libsite-packagespandascoreindexesbase.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File ~anaconda3Libsite-packagespandas_libsindex.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File ~anaconda3Libsite-packagespandas_libsindex.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas_libshashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas_libshashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ‘Q2: GENDER’
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[37], line 1
—-> 1 Q3 = candy[‘Q2: GENDER’].isnull().sum()
2 Q3
File ~anaconda3Libsite-packagespandascoreseries.py:1007, in Series.__getitem__(self, key)
1004 return self._values[key]
1006 elif key_is_scalar:
-> 1007 return self._get_value(key)
1009

Need help with assignments?

Our qualified writers can create original, plagiarism-free papers in any format you choose (APA, MLA, Harvard, Chicago, etc.)

Order from us for quality, customized work in due time of your choice.

Click Here To Order Now