It says “too many open files”, and probably refers to number of open sockets. Why does it call them files? Sockets are just file descriptors, operating systems limit number of open sockets allowed. How many files are too many? I checked with python resource module and it seems like it’s around 1024. How can we bypass this? Primitive way is just increasing limit of open files. But this is probably not the good way to go. Much better way is just adding some synchronization in your client limiting number of concurrent requests it can process. I’m going to do this by adding asyncio.Semaphore() with max tasks of 1000.
async def bound_fetch(sem, _url, session, wappalyzer, col):
# Getter function with semaphore.
async with sem:
try:
page = await WebPage.new_from_url_async(url='http://' + _url, verify=False, aiohttp_client_session=session, timeout=7, headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'})
except:
return {}
tech = wappalyzer.analyze_with_categories(page)
print(tech)
dbo.collection("sites").find({"results": { "techonologies": { $exists: true, $ne: [] }}})