We will use the Python package tweepy to access the Twitter API. If you do not have it installed, please follow these steps:

Install it from PyPI:

easy_install tweepy

Install it from source:

git clone git://github.com/tweepy/tweepy.git
cd tweepy
python setup.py install

Once installed, we begin with importing tweepy:

import tweepy

We import the consumer keys and access tokens used for authentication (OAuth):

api_key = 'g5uPIpw80nULQI1gfklv2zrh4'api_secret = 'cOWvNWxYvPmEZ0ArZVeeVVvJu41QYHdUS2GpqIKtSQ1isd5PJy'access_token = '49722956-TWl8J0aAS6KTdcbz3ppZ7NfqZEmrwmbsb9cYPNELG'access_secret = '3eqrVssF3ppv23qyflyAto8wLEiYRA8sXEPSghuOJWTub

We complete the OAuth process, using the keys and tokens that we imported in step 4:

auth = tweepy.OAuthHandler(api_key, api_secret)auth.set_access_token(access_token, access_secret)

We create the actual interface, using authentication in this step:

api = tweepy.API(auth)my_tweets, other_tweets = [], []

We get 500 unique tweets through the Twitter API. We do not consider retweets as these are not the original authorship. The idea is to compare our own tweets with other tweets on Twitter:

to_get = 500for status in tweepy.Cursor(api.user_timeline, screen_name='@prof_oz').items():  text = status._json['text']  if text[:3] != 'RT ': # we don't want retweets because they didn't author those!    my_tweets.append(text)  else:      other_tweets.append(text)  to_get -= 1  if to_get <=0:      break

We count the number of real tweets and the number of other tweets. Note that all other tweets are not to be considered as impersonated tweets:

In [67]:len(real_tweets), len(other_tweets)

The output can be seen as follows:

Out[67]:(131, 151)

We view the headers of each of the two types of gathered tweets:

real_tweets[0], other_tweets[0]

The output can be seen as follows:

(u'@stanleyyork Definitely check out the Grand Bazaar as well as a tour around the Mosques and surrounding caf\xe9s / sho\u2026 https://t.co/ETREtznTgr',u'RT @SThornewillvE: This weeks @superdatasci podcast has a lot of really interesting talk about #feature engineering, with @Prof_OZ, the auth\u2026')

We put the data in a data frame using pandas, and we also add an extra column, is_mine. The value of the is_mine column is set to True for all tweets that are real tweets; it is set to False for all other tweets:

import pandasdf = pandas.DataFrame({'text': my_tweets+other_tweets, 'is_mine': [True]*len(my_tweets)+[False]*len(other_tweets)})

Viewing the shape that is the dimension of the dataframe, we use the following:

df.shape
(386, 2)Hello

Let's view first few rows of the table:

df.head(2)

The output will look like the following table:

	is_mine	text
0	True	@stanleyyork Definitely check out the Grand Ba...
1	True	12 Exciting Ways You Can Use Voice-Activated T...

Let's view last few rows of the table:

df.tail(2)

The output of the preceding code will give the following table:

	is_mine	text
384	False	RT @Variety: BREAKING: #TheInterview will be s...
385	False	RT @ProfLiew: Let's all congratulate Elizabeth...

We extract a portion of the dataset for validation purposes:

import numpy as np
np.random.seed(10)

remove_n = int(.1 * df.shape[0])  # remove 10% of rows for validation set

drop_indices = np.random.choice(df.index, remove_n, replace=False)
validation_set = df.iloc[drop_indices]
training_set = df.drop(drop_indices)

Table of Contents for
Hands-On Machine Learning for Cybersecurity

AA detection for tweets

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity