We will use the Python package tweepy to access the Twitter API. If you do not have it installed, please follow these steps:
- Install it from PyPI:
easy_install tweepyInstall it from source:
git clone git://github.com/tweepy/tweepy.git
cd tweepy
python setup.py install- Once installed, we begin with importing tweepy:
import tweepy
- We import the consumer keys and access tokens used for authentication (OAuth):
api_key = 'g5uPIpw80nULQI1gfklv2zrh4'api_secret = 'cOWvNWxYvPmEZ0ArZVeeVVvJu41QYHdUS2GpqIKtSQ1isd5PJy'access_token = '49722956-TWl8J0aAS6KTdcbz3ppZ7NfqZEmrwmbsb9cYPNELG'access_secret = '3eqrVssF3ppv23qyflyAto8wLEiYRA8sXEPSghuOJWTub
- We complete the OAuth process, using the keys and tokens that we imported in step 4:
auth = tweepy.OAuthHandler(api_key, api_secret)auth.set_access_token(access_token, access_secret)
- We create the actual interface, using authentication in this step:
api = tweepy.API(auth)my_tweets, other_tweets = [], []
- We get 500 unique tweets through the Twitter API. We do not consider retweets as these are not the original authorship. The idea is to compare our own tweets with other tweets on Twitter:
to_get = 500for status in tweepy.Cursor(api.user_timeline, screen_name='@prof_oz').items(): text = status._json['text'] if text[:3] != 'RT ': # we don't want retweets because they didn't author those! my_tweets.append(text) else: other_tweets.append(text) to_get -= 1 if to_get <=0: break
- We count the number of real tweets and the number of other tweets. Note that all other tweets are not to be considered as impersonated tweets:
In [67]:len(real_tweets), len(other_tweets)
The output can be seen as follows:
Out[67]:(131, 151)
- We view the headers of each of the two types of gathered tweets:
real_tweets[0], other_tweets[0]The output can be seen as follows:
(u'@stanleyyork Definitely check out the Grand Bazaar as well as a tour around the Mosques and surrounding caf\xe9s / sho\u2026 https://t.co/ETREtznTgr',u'RT @SThornewillvE: This weeks @superdatasci podcast has a lot of really interesting talk about #feature engineering, with @Prof_OZ, the auth\u2026')
We put the data in a data frame using pandas, and we also add an extra column, is_mine. The value of the is_mine column is set to True for all tweets that are real tweets; it is set to False for all other tweets:
import pandasdf = pandas.DataFrame({'text': my_tweets+other_tweets, 'is_mine': [True]*len(my_tweets)+[False]*len(other_tweets)})
Viewing the shape that is the dimension of the dataframe, we use the following:
df.shape
(386, 2)Hello
Let's view first few rows of the table:
df.head(2)
The output will look like the following table:
Let's view last few rows of the table:
df.tail(2)The output of the preceding code will give the following table:
|
is_mine |
text |
|
|
384 |
False |
RT @Variety: BREAKING: #TheInterview will be s... |
|
385 |
False |
RT @ProfLiew: Let's all congratulate Elizabeth... |