This looks like a bug in the API. The response wrapper object has a field has_more to indicate if there are more records beyond the page you fetched. For unknown reasons has_more returns false when it shouldn't. In your case it returns false on page 313.
I have created a stack snippet to demonstrate the problem. (I'm sorry but I'm not an R developer). In the javascript snippet you can choose to useHasMoreOnly (set its value to true) and when you then run the script, it will fetch 313 pages. If you set useHasMoreOnly to false, the code will only check of items were returned. That makes that we happily fetch page 314 and it turns out that works fine.
So the work-around should be that the library should not use has_more but instead:
- look at
total in the wrapper and the current page times page_size to determine if it needs to fetch another page. This is an expensive operation and should be avoided when possible. The total isn't returned by default so if you want to use it, enable it in your filter first.
- look at the
items[] array in the wrapper. If that comes back empty (so zero elements or in JavaScript syntax: items.length === 0) you have reached past the last page and can stop fetching more pages.
Unfortunately I can't offer the exact changes that are needed in the library you use but I hope that the code and explanation give enough guidance to implement the workaround.
There is reason to believe this bug is related if not the same as After successfully retrieving 180 pages, the API gracelessly, semi-silently, fails
// set to false to fetch beyond 313 pages
var useHasMoreOnly = false;
var key = 'RBeb2Cm7UIYNbN4lwegbaQ((';
var parms = {
site: 'stackoverflow',
tagged:"r",
num_pages: 400, //1000000,
pagesize:100,
filter: "!UHY-aKsFJ(KvceZ5uauvQDp9b_ZwAQaEY0KwVy4Czncd97-22tonZWvDXfhmP(X*Baz8J0uC0Q"
};
stack_questions(parms)
.then( (df_questions_r) => {
document.getElementsByTagName('body')[0].textContent = `fetched ${df_questions_r.length} posts`; // JSON.stringify(df_questions_r);
})
.catch(console.log);
// library magic
function stack_questions(opt) {
var url = 'https://api.stackexchange.com/2.3/questions';
var items = [];
return new Promise( (resolve, reject) => {
function getNext(next, url,key,opt, page) {
}
function get(url, key, opt, page) {
var num_pages = opt.num_pages || 1;
var localOpt = Object.assign({ key:key, page:page}, opt);
var optList = Object
.keys(localOpt)
.filter( k=> k !== 'num_pages')
.reduce( function(p,c) { return `${p}&${c}=${localOpt[c]}` }, "");
var nexturl = url + '?' + optList.substring(1);
fetch(nexturl)
.then((resp) => resp.json())
.then((json) => {
for(const item of json.items || []) {
items.push(item)
}
return json
})
.then((json) => {
if (json.backoff) {
console.log('backoff for ', json.backoff , page);
}
var waitMs = (json.backoff || 1 ) * 1000;
var local_count = json.items.length;
var has_more = json.has_more;
var has_pages_left = page < num_pages;
var has_records_left = json.items && json.items.length > 0;
// do we stop or do we get the next page?
if (useHasMoreOnly) {
// if the the api says it has more
// that
if (has_more) {
setTimeout(get(url,key,opt, page + 1), waitMs);
} else {
resolve(items);
}
} else {
// we do not trust has_more so we
// use total and our own counts to decide
// to get another page
if (has_pages_left && has_records_left) {
setTimeout(get(url,key,opt, page + 1), waitMs);
} else {
resolve(items);
}
}
})
.catch(reject);
}
get(url, key, opt, 1);
});
}
access_tokenis, probably, a red herring. If the issue is that you are running out of quota at 300 requests, which is what # of requests & error message you two have included elsewhere tend to indicate, then using anaccess_tokenwon't help you to solve that problem, because you must use an SE APIkeyto use anaccess_tokenand what you're seeing tends to indicate the requests are being send without a validkeythat matches theaccess_token, because eachaccess_tokenrepresents authorization for a specific application + specific user.keywhich you've intended to set. I don't know R or this library, so really can't tell you how to do that. The evidence you've provided so far, the limit to using a small number of requests and the error message you two have shown elsewhere, strongly suggests that it's not sending a validkeyvalue with the request. So that's what I'd check first, either by looking at the values it's using to construct the request or the actual request sent.has_morereturns false. It looks similar to this bug: stackapps.com/questions/8356 with the only difference that I don't get an CORS error. I'll check if I have other means to get past that has_more bug by using the steps offered by Brock.totaland does not pay attention tohas_more, so isn't affected by the intermittent error in what's returned in thehas_morefield. In the request logs, I do see at least one invalidhas_more: falseat around page 381.totalcan be an expensive operation from the SE API's point of view. Doing so in SD was a strong contributor to the intermittent issues we have been seeing since May and the massive issues we saw 21 hours after having made changes to request Staging Ground posts. Overall, requestingtotalshouldn't be done with every request.items, or get thetotalonce, so you know the ballpark of number of requests,and then just keep requesting until getting a valid response, but withitemsempty. You could also test forhas_more: falsein addition to an empty (not missing)items. I haven't looked at the library that's being used in detail, but it's likely that it assumes thathas_moreis always valid, when it's actually unreliable.has_moreto end requesting pages. To account for the bug in thehas_morefield returned by the SE API, that end condition will need to be changed to be thatitemsis a valid array, but contains no values. The cost for that will be one additional SE API request at the end of each sequence.