I was willing to build a small dataset of sentences/sentiments in the Albanian language, and after I tried some newspaper comments it turned out that most of them were wrong grammatically, and at least 10% were out of the contest. So since twitter lets us use the search API for free (thanks for that) this became the target: A C# console application that will make use of Twitter API to retrieve user replies for public tweets of a public account. So here is how I did it step by step.

Twitter authorization

To use the Twitter API you should create an app in https://developer.twitter.com

The app will authenticate all the search requests with bearer token authentication, and get a valid token here's what you should do:

  • Head over https://developer.twitter.com/en/apps/
  • Open the app created previously and under "Consumer API keys" there are "API key" and "API secret" values
  • Concatenate these values with a colon character (:) in a single string
  • Encode the concatenated string as Base64, and the result is what is used to get a valid bearer token for further requests.

Now the final step is to issue a x-www-form-urlencoded POST request in https://api.twitter.com/oauth2/token with Authorization header and grant_type=client_credentials as request body.

Here is the curl request as generated by Postman:

curl -X POST \
  https://api.twitter.com/oauth2/token \
  -H 'Authorization: Basic <BASE64_ENCODED_VALUE>' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'Postman-Token: fa5f5d93-09bf-4c2a-bc7a-52814aebaa28' \
  -H 'cache-control: no-cache' \
  -d grant_type=client_credentials

This call response is a JSON object:

{
    "token_type": "bearer",
    "access_token": "-THE ACCESS TOKEN-"
}

Twitter Standard search API

Returns a collection of relevant Tweets matching a specified query.

This API has a lot of parameters and options but in this case, I'm only gonna need a few of them:

  • from/to Indicates a tweet is from @user/reply to @user
  • tweet_mode=extended : This will make sure that the entire tweet text is included in the response, not that this will alter a response field of user message from text to full_text
  • since_id : query tweets with id greater than the given parameter
  • max_id: Returns results with an ID less than or equal to the specified ID.
  • count : By default, the API response has 15 results. This parameter could alter it up to 100

Working with the query results

At this point, you should have in mind a public twitter account (or a few of them). For me, in this app, it has to be a public account because these are the ones that get more replies, and the initial idea was explicitly about them.

I'll have 2 initial queries:

  • One that retrieves the tweets by @user1 (the public account)
  • Another one that gets the replies to the results of the previous query

The response for each of those is a JSON object that contains:

  • an array of statuses
  • the search_metadata

Now the statuses property contains the responses we need for this query, however, the search_metadata property has some properties you need to pay attention to, especially the next_results field. If this field is present, it means that there may be more statuses to retrieve based on our first initial query. This is because the standard search API won't return more than 100 statuses.

Retrieving tweets

Now Since the API has to be called multiple times I put the HTTP calls into a separate class:

    public class SearchService
    {
        private readonly HttpClient _client;
        public SearchService(string bearerToken)
        {
            _client = new HttpClient
            {
                BaseAddress = new Uri("https://api.twitter.com/1.1/search/tweets.json"),
            };
            _client.DefaultRequestHeaders.Clear();
            _client.DefaultRequestHeaders.Add("Authorization", $"Bearer {bearerToken}");
        }

        public async Task<string> GetData(string q)
        {
            var result = await _client.GetStringAsync(q);
            return result;
        }
    }

With a hand from https://quicktype.io/ I created the C# POCO classes needed to strongly parse the JSON response, so now it's the time to write the app.

Logic flow will be like this:

  • Get initial user tweets (that is q=from:user)
  • Repeat the previous step while next_results is present and has a value
  • Save the parsed data into a DB table.
  • Get user replies since the smallest 64bit integer of retrieved queries
  • Repeat the previous step while next_results is present and has a value
  • Filter out results which reply_to_status_id is not to our interesting @user tweets
  • Save parsed data into the DB table.

Here is a small function that will retrieve the data from a query, and return only statuses:

        static List<Status> QueryTweets(string query)
        {
            var userStatusCollection = new List<Status>();
            var userStatusQuery = searchService.GetData(query).Result;
            var userStatusObj = JsonConvert.DeserializeObject<TweetSearchResponse>(userStatusQuery);
            userStatusCollection.AddRange(userStatusObj.Statuses);

            if (string.IsNullOrEmpty(userStatusObj.SearchMetadata.NextResults))
                return userStatusCollection;
            else
                userStatusCollection.AddRange(QueryTweets($"{userStatusObj.SearchMetadata.NextResults}&tweet_mode=extended"));

            return userStatusCollection;
        }

Now I'll call this function first with a query build to retrieve some @user tweets, and then to get all the replies since the earliest query returned for that account.

You can find the full code of this sample in https://github.com/ermirbeqiraj/twitter-data-crawl