# An Introduction to Kaggle

Learning Objectives:
* Students will learn about the [Kaggle website](https://www.kaggle.com/) as a significant Data Science community resource for learning.
* Students will create Kaggle accounts and sample offering of the site.
* Students will configure their CoCalc accounts for the "kaggle" command and learn the commands necessary to engage in Kaggle competitions.

Before class:
* Read [elitedatascience.com's "The Beginner's Guide to Kaggle"](https://elitedatascience.com/beginner-kaggle)  _Read this for perspective and wise tips on how to best leverage the platform for your preparation for and lifelong learning of Data Science work._
* Read below up to (but not including) the section marked In Class.  **Perform listed tasks and supplemental reading as directed below.**

In class:
* We will work together to make sure everyone's CoCalc account is correctly configured for engaging in Kaggle competitions.

Homework after class:
* Complete the section labeled "Homework" below before the next class when it will be collected.


# About Kaggle

While [not the only Data Science competition website](https://towardsdatascience.com/top-competitive-data-science-platforms-other-than-kaggle-2995e9dad93c), Kaggle quickly became the best-known when it began offering Machine Learning competitions in 2010.  It quickly grew to become a hub of communication, learning, and practice for the Data Science community, which can partly be attributed to the incentives for community recognition offered on the site.  Recognition is given in the form of virtual medals which contribute towards [Kaggle Progression System](https://www.kaggle.com/progression) ranks of Novice, Contributor, Expert, Master, and Grandmaster for contribution in each of four areas: Competitions, Notebooks, Datasets, and Discussion.  Whereas competition medals are earned by good performance in Kaggle competitions, notebook, dataset, and discussion contributions are recognized by community upvotes.  Further, competitors are often required to share and explain their competition work for others to learn from.  This recognition system thus provides incentives for those seeking to distinguish themselves (e.g. to potential employers/clients) to provide helpful contributions to the community in exchange for such public recognition and third-party validation of expertise.

## To-Do: Create Your Own Kaggle Account
* Create your own account on [Kaggle.com](https://www.kaggle.com/).  We will be using this site occasionally, and you will find that it is an excellent supplemental resource in and beyond this class.
* After registering your account, you will be ranked as a "Novice".  Advancement to "Contributor" status requires 9 steps of which you should _perform the first 5_:
  * **Add a bio to your profile** (This need not be anything more than a sentence or two.)
  * **Add your location** (e.g. "Gettysburg, Pennsylvania, United States")
  * **Add your occupation** (e.g. "student of Data Science at Gettysburg College")
  * **Add your organization** (e.g. "Gettysburg College")
  * **SMS verify your account** (If you cannot SMS verify with your smartphone, contact Kaggle's Support team: support@kaggle.com)
  * Run 1 script (optional for now)
  * Make 1 competition or task submission (optional for now)
  * Make 1 comment (optional for now)
  * Cast 1 upvote (optional for now)

Many use their Kaggle account not only to improve their Data Science problem solving skills, but to market those skills as well.  You may wish to further your learning and signal your growing expertise through your account going forward.

## Kaggle Competitions

Much activity in Kaggle centers around Kaggle Competitions.  As noted in the preparatory reading, Data Science as practiced on Kaggle differs from real-world Data Science work in a number of ways. Real-world problems may be easy, novel solutions may not be necessary, and humans often [satisfice](https://en.wikipedia.org/wiki/Satisficing) rather than optimize, i.e. balance efficiency and quality towards achieving "good enough" rather than expending inordinate energy to rank \#1 for a Data Science task.  Nonetheless, Kaggle's challenging competitions, for both the competitor and observer, highlight the best in Exploratory Data Analysis, feature engineering, model-building algorithms, and insightful interpretation of data.  For those who do, who explain, and who provide insightful comment, these events provide ways for practioners to learn through the experiences.

* **Read the current list of [Active Kaggle Competitions](https://www.kaggle.com/competitions).**

You will note that some competitions are designated as ongoing "Knowledge" competitions.  These provide starting points for beginners.  Take a look at the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) knowledge competition and read the documentation enough to understand what data is provided and what the competition model is expected to predict from such data.

## Kaggle Notebooks

In addition to earning recognition for competition results, Kaggle members also earn recognition by providing high quality Jupyter notebooks that provide both explanation and code to teach others how to approach problems.   A notebook may be recognized for providing great instructive value, even if the competition results achieved by the approach are not among the top competitors.

**To-Do**
* Click on the [Notebooks](https://www.kaggle.com/notebooks) tab in Kaggle.
* In the Public tab in the text field "Search notebooks", type "titanic" and press Enter to search Titanic notebooks, sorted by "Most Votes", i.e. upvotes.
* Create a web link at the bottom of this cell to the most upvoted Titanic notebook and one other Titanic notebook of your choice that you browse and think may be helpful in aiding your learning and work in the future.

I will provide an example link to a very insightful Titanic notebook illustrating Exploratory Data Analysis:
* [EDA To Prediction (DieTanic)](https://www.kaggle.com/ash316/eda-to-prediction-dietanic/notebook)

## Kaggle Datasets

Kaggle is also a source of many good public datasets.  Suppose you would like to do a course cluster project at Gettysburg College.  You might search Kaggle's dataset for keywords of interest, and you just might find the ideal dataset for combining your enjoyment of Data Science with application to the topic area of another course.

**To-Do**
* Click on the [Data]() tab in Kaggle.
* Create a link at the bottom of this cell to the most upvoted public dataset.
* Search keywords for a topic of interest to you and create a second link at the bottom of this cell to that dataset.

## Kaggle Discussions

Finally, Kaggle recognizes good contributors to discussion.  This is a community that engages with each other, and not merely a platform to broadcast performance, notebooks, and datasets.  A good discussion contributor can aid others as well.  For example, whereas most Kaggle members use Python or R in their notebooks for competitions, [this Titanic discussion entry](https://www.kaggle.com/c/titanic/discussion/28323) shares the author's attempt to use Microsoft Excel to craft a competition entry.  Other discussion contributors [raise moral or ethical questions](https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection/discussion/26684).

**To-Do:** Before class, make sure you use [this link](https://www.kaggle.com/t/0175eb2fe60940b68d3195eb893691c2) to gain access to my Kaggle In-Class test competition.  We will work together in class to make sure each person has success in configuring CoCalc to allow you to engage in Kaggle competitions.  (I would advise that you only take part at this point in Learning competitions with small datasets so as to not overrun your CoCalc account limits.)

# In Class

Together in class, we will:
* Install the ```kaggle``` shell command in your CoCalc environment,
* Configure the ```kaggle``` command to connect to your Kaggle account,
* Download test competition data,
* Compute a test competition submission, and
* Submit an entry to the test competition.

After class, you will download data for a different test competition and go through a prescribed process for submitting to that competition.

In the same folder as this notebook, go to "+ New" and create a new "kaggle.x11" X11 Desktop.  This creates what is known as a Linux command shell window in the upper left with command prompts ending in "$".  After each prompt, the user can type a command that your virtual Linux computer on CoCalc (a Linux Docker container for those interested) will execute, print any output, and give you another prompt.  Like Python, IPython, and other interpreter environments, this is what is referred to as a read-eval-print loop.

Enter the following commands:

The first command seeks to install the kaggle Linux command.  The second checks to see where the command is installed.  The third creates a directory to put your Kaggle account key into for identifying you as the user of the command.  

Now from a Kaggle page where you are logged in, left-click your user symbol and select "My Account".  Under your Kaggle account, there is an "API" section where you can click a button "Create New API Token".  Download the "kaggle.json" file to your computer and then upload it to this Kaggle folder.  We will then move the file into the correct folder with correct access permissions with the following commands entered in your "kaggle.x11" tab's shell:

At this point, your ```kaggle``` shell command should work.  To get the test competition data, enter the commands to download and unzip the competition files:

Now we should have the competition data and be able to work with it as with the initial linear regression demo.  Take a moment to review the competition link and read about the different files you have downloaded and unzipped.

## Using Our Linear Regression to Predict for Kaggle Contest ```test.csv```

In this next code block, we illustrate how to read test data (without the ```y``` output), using our linear regression to predict the output, adding these predictions to our dataframe, how to _output CSV_ predictions, and how to submit them to our Kaggle competition.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load the comma-separated values training data "train.csv", i.e. data with input(s) and output(s),
# and create a dataframe of the data:
data_path = "./"   # This means "Look in the current directory for the file."
df_train = pd.read_csv(data_path + 'train.csv')

# Let's look at the "head", the initial rows, of the training data dataframe:
print(df_train.head())

# We can see that there's an identifier index "id", but what concerns us for this exercise are
# the inputs "x1", "x2", and "x3" and the output "y".
# As before, we separate these into separate data frames and use these to compute our linear regression model:
X = df_train[['x1', 'x2', 'x3']]
y = df_train[['y']]
linear_regressor = LinearRegression()
linear_regressor.fit(X, y)

# Next we load our test data:
df_test = pd.read_csv(data_path + 'test.csv')

# Print the "head" of the test data, and show that it has all of what's in train.csv except for the output y column:
print(df_test.head())

# Now we can separate out the inputs of our test data and create a new df_test dataframe column 'y'
# that we assign our linear regression predictions to:
X = df_test[['x1','x2','x3']]
df_test['y'] = linear_regressor.predict(X)

# The contest submission format is a CSV file that only has the "id" and "y" columns, so we create a submission
# dataframe "df_submission" with just those columns from df_test, print the first few rows to verify the
# contents of our submission, and then write the output to a new CSV file "submission.csv" in this folder/directory.
df_submission = df_test[['id','y']]
print(df_submission.head())
df_submission.to_csv(path_or_buf=open(data_path + 'submission.csv', 'w+'), index=False, header=True)

   id        x1        x2        x3           y
0   0  0.013024 -0.376194 -0.156695   31.638628
1   1  0.295720  0.304128 -0.378602 -202.152690
2   2  0.539952 -0.162594 -0.427334 -194.812320
3   3 -0.372273 -0.003075  0.491478  374.543182
4   4 -0.049420  0.365024 -0.331134 -224.375438
     id        x1        x2        x3
0  5000 -0.315663 -0.402388  0.150085
1  5001  0.200608 -0.414422 -0.418825
2  5002  0.148022 -0.508669 -0.144411
3  5003  0.046422 -0.293633  0.257899
4  5004  0.288848  0.310982 -0.446483
     id           y
0  5000  176.376392
1  5001 -254.353860
2  5002  -38.414781
3  5003  277.488843
4  5004 -271.907152


Now switch to your X11 terminal window and submit ```submission.csv``` to the competition with the command:

After ```-c``` is the short competition code name.  After ```-f``` is your submission filename.  After ```-m``` is a message you add to tell apart your different submissions.

## Homework

For this homework, your single exercise is to compete in our practice [Linear Regression: Signal and Noise](https://www.kaggle.com/c/inclass-signal-and-noise) Kaggle competition for which you may need to gain access to through [this link](https://www.kaggle.com/t/210dd01c1a734f78a31c89e2ce5d6db5).

To get best results, examine the data input/output correlation and only perform linear regression using correlating inputs.  Go through the same steps as above.  Perform correlation visualization as we did in our first linear regression demo.  Remember that you'll use the contest code ```inclass-signal-and-noise``` rather than our previous contest code ```inclass-competition-test```.

A perfect rating will be achieved by clearly performing all steps and achieving the benchmark contest score.

**Note: To avoid overwriting your previous contest files of the same name, use these two commands:**

**Next, you'll need to reassign ```file_path``` to the value ```'hw/'``` before loading the files.**


When you are done, if you have written your ```submission.csv``` to that same ```hw``` directory, you'll need to modify your ```kaggle``` submission command accordingly: