The /questions endpoint only returns around 30,000 items where 400,000 are expected

Question

I use stackr library and I registered an api key as the library mentions "since it increases your daily quota of queries from 300 to 10,000." So I use the following code, which worked fine some months ago, and give all the questions related to a tag. However now it gives only around 30000

library(stackr)
Sys.setenv(STACK_EXCHANGE_KEY = "RBeb2Cm7UIYNbN4lwegbaQ((")
df_questions_r <- stack_questions(tagged="r", num_pages=1000000, pagesize=100, filter="!UHY-aKsFJ(KvceZ5uauvQDp9b_ZwAQaEY0KwVy4Czncd97-22tonZWvDXfhmP(X*Baz8J0uC0Q")

Given I have setup my key correctly and no other rate-limiting issues, how can I fetch the 400,000 questions in the R tag on Stack Overflow and/or how do I get past that 30,000 items limit?

This looks like a question which is an XY problem. Please describe in more detail what it is you are attempting to achieve (not how you're attempting to achieve it). — Makyen
– Makyen, Commented Jan 2, 2023 at 16:55
Asking about if you should use an access_token is, probably, a red herring. If the issue is that you are running out of quota at 300 requests, which is what # of requests & error message you two have included elsewhere tend to indicate, then using an access_token won't help you to solve that problem, because you must use an SE API key to use an access_token and what you're seeing tends to indicate the requests are being send without a valid key that matches the access_token, because each access_token represents authorization for a specific application + specific user. — Makyen
– Makyen, Commented Jan 2, 2023 at 16:57
@Makygen I ask how I can receive all questions for the r tag from the stackoverflow site — cottinR
– cottinR, Commented Jan 2, 2023 at 17:01
For questions, I'd suggest focusing on one of: debugging the code you have (which either requires someone to duplicate what you have and do the debugging of finding out what the library is actually doing, or you finding out what the library is actually doing, primarily by looking at what it's actually sending to the SE API, or as close to that as you can, which could be just logging the values it's using to construct the requests), OR ask a general question of "how to do X" in R. — Makyen
– Makyen, Commented Jan 2, 2023 at 17:08
Frankly, the first thing I'd do is figure out some way to see what the library is sending to the SE API and verify if it's actually sending the key which you've intended to set. I don't know R or this library, so really can't tell you how to do that. The evidence you've provided so far, the limit to using a small number of requests and the error message you two have shown elsewhere, strongly suggests that it's not sending a valid key value with the request. So that's what I'd check first, either by looking at the values it's using to construct the request or the actual request sent. — Makyen
– Makyen, Commented Jan 2, 2023 at 17:13
@Makyen I can somehow reproduce this. I get 313 pages after which has_more returns false. It looks similar to this bug: stackapps.com/questions/8356 with the only difference that I don't get an CORS error. I'll check if I have other means to get past that has_more bug by using the steps offered by Brock. — rene
– rene ♦, Commented Jan 2, 2023 at 19:32
@rene I had no problem fetching questions in the R tag. I did manually stop fetching after 1008 requests (100,800 questions) in order to not burn the additional 3.7k quota. However, the code I use determines the number of expected pages from the total and does not pay attention to has_more, so isn't affected by the intermittent error in what's returned in the has_more field. In the request logs, I do see at least one invalid has_more: false at around page 381. — Makyen
– Makyen, Commented Jan 2, 2023 at 20:14
A compounding problem is that requesting total can be an expensive operation from the SE API's point of view. Doing so in SD was a strong contributor to the intermittent issues we have been seeing since May and the massive issues we saw 21 hours after having made changes to request Staging Ground posts. Overall, requesting total shouldn't be done with every request. — Makyen
– Makyen, Commented Jan 2, 2023 at 20:14
Thus, to avoid all of the issues, it's probably best to either just keep requesting pages until a valid response is received, but with no items, or get the total once, so you know the ballpark of number of requests,and then just keep requesting until getting a valid response, but with items empty. You could also test for has_more: false in addition to an empty (not missing) items. I haven't looked at the library that's being used in detail, but it's likely that it assumes that has_more is always valid, when it's actually unreliable. — Makyen
– Makyen, Commented Jan 2, 2023 at 20:15
@rene Jsut FYI: The error which was reported in earlier questions from these users indicated that they had run out of quota, and had done so in < 2 hours. It's possible that is the end result of lots of testing, and wasn't the actual problem, but it's the error they reported as being the reason they stopped receiving results at < 300 requests "Collection issue for stackexchange api". — Makyen
– Makyen, Commented Jan 2, 2023 at 20:18
@rene You're correct. The stackr library does reply on has_more to end requesting pages. To account for the bug in the has_more field returned by the SE API, that end condition will need to be changed to be that items is a valid array, but contains no values. The cost for that will be one additional SE API request at the end of each sequence. — Makyen
– Makyen, Commented Jan 2, 2023 at 20:37

rene · Accepted Answer · 2023-01-06 10:42:06Z

This looks like a bug in the API. The response wrapper object has a field has_more to indicate if there are more records beyond the page you fetched. For unknown reasons has_more returns false when it shouldn't. In your case it returns false on page 313.

I have created a stack snippet to demonstrate the problem. (I'm sorry but I'm not an R developer). In the javascript snippet you can choose to useHasMoreOnly (set its value to true) and when you then run the script, it will fetch 313 pages. If you set useHasMoreOnly to false, the code will only check of items were returned. That makes that we happily fetch page 314 and it turns out that works fine.

So the work-around should be that the library should not use has_more but instead:

look at total in the wrapper and the current page times page_size to determine if it needs to fetch another page. This is an expensive operation and should be avoided when possible. The total isn't returned by default so if you want to use it, enable it in your filter first.
look at the items[] array in the wrapper. If that comes back empty (so zero elements or in JavaScript syntax: items.length === 0) you have reached past the last page and can stop fetching more pages.

Unfortunately I can't offer the exact changes that are needed in the library you use but I hope that the code and explanation give enough guidance to implement the workaround.

There is reason to believe this bug is related if not the same as After successfully retrieving 180 pages, the API gracelessly, semi-silently, fails

// set to false to fetch beyond 313 pages
var useHasMoreOnly = false;

var key = 'RBeb2Cm7UIYNbN4lwegbaQ((';

var parms = { 
  site: 'stackoverflow',
  tagged:"r", 
  num_pages: 400, //1000000, 
  pagesize:100, 
  filter: "!UHY-aKsFJ(KvceZ5uauvQDp9b_ZwAQaEY0KwVy4Czncd97-22tonZWvDXfhmP(X*Baz8J0uC0Q"
};

stack_questions(parms)
  .then( (df_questions_r) => {
    document.getElementsByTagName('body')[0].textContent = `fetched ${df_questions_r.length} posts`;  // JSON.stringify(df_questions_r);
  })
  .catch(console.log);

// library magic
function stack_questions(opt) {
   var url = 'https://api.stackexchange.com/2.3/questions';
   var items = [];
   return new Promise( (resolve, reject) => {
   
    function getNext(next, url,key,opt, page) {
    
    }
    
    function get(url, key, opt, page) { 
        var num_pages = opt.num_pages || 1;
        var localOpt = Object.assign({ key:key, page:page}, opt);
        var optList = Object
          .keys(localOpt)
          .filter( k=> k !== 'num_pages')
          .reduce( function(p,c) { return `${p}&${c}=${localOpt[c]}` }, "");
        var nexturl = url + '?' + optList.substring(1);
        fetch(nexturl)
          .then((resp) => resp.json())
          .then((json) => {
             for(const item of json.items || []) {
               items.push(item)
             }
             return json
           })
         .then((json) => {
             if (json.backoff) {
                console.log('backoff for ', json.backoff , page);
             }
             var waitMs = (json.backoff || 1 ) * 1000;
             
             var local_count = json.items.length;
             var has_more = json.has_more;
             var has_pages_left = page < num_pages;
             var has_records_left = json.items && json.items.length > 0; 
             // do we stop or do we get the next page? 
             if (useHasMoreOnly) {
                // if the the api says it has more
                // that
                if (has_more) {
                  setTimeout(get(url,key,opt, page + 1), waitMs); 
                } else {
                    resolve(items);
                }
             } else {
               // we do not trust has_more so we
               // use total and our own counts to decide
               // to get another page
               if (has_pages_left && has_records_left) {
                 setTimeout(get(url,key,opt, page + 1), waitMs); 
                } else {
                    resolve(items);
                }
             }
          })
          .catch(reject);
     }
     
     get(url, key, opt, 1);
   });
}

I'd recommend the case for termination when not looking at has_more be that items be an empty array. We really should discourage people from getting total. As we discovered in SD, getting total can result in dramatically higher loads on the SE API to the point of generating substantial errors, and the issues can appear at random way after any changes were made to the requests being made. — Makyen
– Makyen, Commented Jan 2, 2023 at 20:44
@Makyen yeah. I saw your comments. let me fix and re-test that. — rene
– rene ♦, Commented Jan 2, 2023 at 20:46
Actually, we can probably use items.length < pagesize as the check for the SE API being out of data. Using that would mean we only spend an extra request when the total number of items is an integer multiple of the pagesize used in the request. — Makyen
– Makyen, Commented Mar 29, 2023 at 16:23

Stack Exchange Network

The /questions endpoint only returns around 30,000 items where 400,000 are expected

1 Answer 1

You must log in to answer this question.

Linked

The /questions endpoint only returns around 30,000 items where 400,000 are expected

1 Answer 1

You must log in to answer this question.

Linked

Related