I was trying to build a small dataset of sentences/sentiments in the Albanian language, and after I tried some newspaper comments it turned out that most of them were wrong grammatically, and at least 20% were out of the contest. So since twitter lets us use the search API for free (thanks for that) this became the goal: A C# console application that will make use of Twitter API to retrieve user replies for public tweets of a public account. So here is how I did it step by step.
Twitter authorization
To use the Twitter API you should create an app in https://developer.twitter.com
The app will authenticate all the search requests with bearer token authentication, and get a valid token here's what you should do:
- Head over https://developer.twitter.com/en/apps/
- Open the app created previously and under "Consumer API keys" there are "API key" and "API secret" values
- Concatenate these values with a colon character (
:
) in a single string - Encode the concatenated string as Base64, and the result is what is used to get a valid bearer token for further requests.
Now the final step is to issue a x-www-form-urlencoded
POST request in https://api.twitter.com/oauth2/token with Authorization header and grant_type=client_credentials
as request body.
Here is the curl request as generated by Postman:
curl -X POST \
https://api.twitter.com/oauth2/token \
-H 'Authorization: Basic <BASE64_ENCODED_VALUE>' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'Postman-Token: fa5f5d93-09bf-4c2a-bc7a-52814aebaa28' \
-H 'cache-control: no-cache' \
-d grant_type=client_credentials
This call response is a JSON object:
{
"token_type": "bearer",
"access_token": "-THE ACCESS TOKEN-"
}
Twitter Standard search API
Returns a collection of relevant Tweets matching a specified query.
This API has a lot of parameters and options but in this case, I'm only gonna need a few of them:
from/to
Indicates a tweet is from @user/reply to @usertweet_mode=extended
: This will make sure that the entire tweet text is included in the response, not that this will alter a response field of user message fromtext
tofull_text
since_id
: query tweets with id greater than the given parametermax_id
: Returns results with an ID less than or equal to the specified ID.count
: By default, the API response has 15 results. This parameter could alter it up to 100
Working with the query results
At this point, you should have in mind a public twitter account (or a few of them). For me, in this app, it has to be a public account because these are the ones that get more replies, and the initial idea was explicitly about them.
I'll have 2 initial queries:
- One that retrieves the tweets by @user1 (the public account)
- Another one that gets the replies to the results of the previous query
The response for each of those is a JSON object that contains:
- an array of
statuses
- the
search_metadata
Now the statuses property contains the responses we need for this query, however, the search_metadata
property has some properties you need to pay attention to, especially the next_results
field. If this field is present, it means that there may be more statuses to retrieve based on our first initial query. This is because the standard search API won't return more than 100 statuses.
Retrieving tweets
Now Since the API has to be called multiple times I put the HTTP calls into a separate class:
public class SearchService
{
private readonly HttpClient _client;
public SearchService(string bearerToken)
{
_client = new HttpClient
{
BaseAddress = new Uri("https://api.twitter.com/1.1/search/tweets.json"),
};
_client.DefaultRequestHeaders.Clear();
_client.DefaultRequestHeaders.Add("Authorization", $"Bearer {bearerToken}");
}
public async Task<string> GetData(string q)
{
var result = await _client.GetStringAsync(q);
return result;
}
}
With a hand from https://quicktype.io/ I created the C# POCO classes needed to strongly parse the JSON response, so now it's the time to write the app.
Logic flow will be like this:
- Get initial user tweets (that is q=from:
user
) - Repeat the previous step while
next_results
is present and has a value - Save the parsed data into a DB table.
- Get user replies since the smallest 64bit integer of retrieved queries
- Repeat the previous step while
next_results
is present and has a value - Filter out results which
reply_to_status_id
is not to our interesting @user tweets - Save parsed data into the DB table.
Here is a small function that will retrieve the data from a query, and return only statuses:
static List<Status> QueryTweets(string query)
{
var userStatusCollection = new List<Status>();
var userStatusQuery = searchService.GetData(query).Result;
var userStatusObj = JsonConvert.DeserializeObject<TweetSearchResponse>(userStatusQuery);
userStatusCollection.AddRange(userStatusObj.Statuses);
if (string.IsNullOrEmpty(userStatusObj.SearchMetadata.NextResults))
return userStatusCollection;
else
userStatusCollection.AddRange(QueryTweets($"{userStatusObj.SearchMetadata.NextResults}&tweet_mode=extended"));
return userStatusCollection;
}
Now I'll call this function first with a query build to retrieve some @user tweets, and then to get all the replies since the earliest query returned for that account.
You can find the full code of this sample in https://github.com/ermirbeqiraj/twitter-data-crawl