Up until now, browser crawlers used the same session (and therefore the same proxy) for
all request from a single browser - now get a new proxy for each session. This means
that with incognito pages, each page will get a new proxy, aligning the behaviour with
CheerioCrawler.
This feature is not enabled by default. To use it, we need to enable useIncognitoPages
flag under launchContext:
new Apify.Playwright({
launchContext: {
useIncognitoPages: true,
},
// ...
})
Note that currently there is a performance overhead for using
useIncognitoPages. Use this flag at your own will.
We are planning to enable this feature by default in SDK v3.0.
Previously when a page function timed out, the task still kept running. This could lead to requests being processed multiple times. In v2.2 we now have abortable timeouts that will cancel the task as early as possible.
Several new timeouts were added to the task function, which should help mitigate the zero concurrency bug. Namely fetching of next request information and reclaiming failed requests back to the queue are now executed with a timeout with 3 additional retries before the task fails. The timeout is always at least 300s (5 minutes), or handleRequestTimeoutSecs if that value is higher.
RequestError: URI malformed in cheerio crawler (#1205)diffCookie (#1217)runTaskFunction() (#1250)purgeLocalStorage method by @vladfrangu in https://github.com/apify/apify-js/pull/1187forceCloud down to the KV store by @vladfrangu in https://github.com/apify/apify-js/pull/1186YOUTUBE_REGEX_STRING being too greedy by @B4nan in https://github.com/apify/apify-js/pull/1171fixUrl function by @szmarczak in https://github.com/apify/apify-js/pull/1184Full Changelog: https://github.com/apify/apify-js/compare/v2.0.7...v2.1.0
APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE), closes #956@ts-ignore comments to imports of optional peer dependencies (#1152)sdk.openSessionPool() (#1154)infiniteScroll (#1140)ProxyConfiguration and CheerioCrawler.got-scraping to receive multiple improvements.This update introduces persistent browser headers when using got-scraping.
This release improves the stability of the SDK.
CheerioCrawler caused by parser conflicts in recent versions of cheerio.got-scraping 2.0.1 until fully compatible.We're releasing SDK 2 ahead of schedule, because we need state of the art HTTP2 support for scraping and with Node.js versions <15.10, HTTP2 is not very reliable. We bundled in 2 more potentially breaking changes that we were waiting for, but we expect those to have very little impact on users. Migration should therefore be super simple. Just bump your Node.js version.
If you're waiting for full TypeScript support and new features, those are still in the works and will be released in SDK 3 at the end of this year.
cheerio to 1.0.0-rc.10 from rc.3. There were breaking changes in cheerio between the versions so this bump might be breaking for you as well.LiveViewServer which was deprecated before release of SDK v1.browser-pool rewriteheaderGeneratorOptions not being passed to got-scraping in requestAsBrowser./v2 duplication in apiBaseUrl.CheerioCrawlerCheerioCrawler downloads the web pages using the requestAsBrowser utility function.
As opposed to the browser based crawlers that are automatically encoding the URLs, the
requestAsBrowser function will not do so. We either need to manually encode the URLs
via encodeURI() function, or set forceUrlEncoding: true in the requestAsBrowserOptions,
which will automatically encode all the URLs before accessing them.
We can either use
forceUrlEncodingor encode manually, but not both - it would result in double encoding and therefore lead to invalid URLs.
We can use the preNavigationHooks to adjust requestAsBrowserOptions:
preNavigationHooks: [
(crawlingContext, requestAsBrowserOptions) => {
requestAsBrowserOptions.forceUrlEncoding = true;
}
]
Apify class and ConfigurationAdds two new named exports:
Configuration class that serves as the main configuration holder, replacing explicit usage of
environment variables.Apify class that allows configuring the SDK. Env vars still have precedence over the SDK configuration.When using the Apify class, there should be no side effects.
Also adds new configuration for WAL mode in ApifyStorageLocal.
As opposed to using the global helper functions like main, there is an alternative approach using Apify class.
It has mostly the same API, but the methods on Apify instance will use the configuration provided in the constructor.
Environment variables will have precedence over this configuration.
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ token: '123' });
console.log(sdk.config.get('token')); // '123'
// the token will be passed to the `call` method automatically
const run = await sdk.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);
Another example shows how the default dataset name can be changed:
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ defaultDatasetId: 'custom-name' });
await sdk.pushData({ myValue: 123 });
is equivalent to:
const Apify = require('apify'); // use default export to get the helper functions
const dataset = await Apify.openDataset('custom-name');
await dataset.pushData({ myValue: 123 });
Configuration class and Apify named export, see above.proxyUrl without a port throwing an error when launching browsers.maxUsageCount of a Session not being persisted.puppeteer and playwright to match stable Chrome (90).taskTimeoutSecs to allow control over timeout of AutoscaledPool tasksforceUrlEncoding to requestAsBrowser optionspreNavigationHooks and postNavigationHooks to CheerioCrawlerprepareRequestFunction and postResponseFunction methods of CheerioCrawleraborting for handling gracefully aborted run from Apify platform.requestAsBrowser behavior with various combinations of json, payload legacy options. closes: #1028This release brings the long awaited HTTP2 capabilities to requestAsBrowser. It could make HTTP2 requests even before, but it was not very helpful in making browser-like ones. This is very important for disguising as a browser and reduction in the number of blocked requests. requestAsBrowser now uses got-scraping.
The most important new feature is that the full set of headers requestAsBrowser uses will now be generated using live data about browser headers that we collect. This means that the "header fingeprint" will always match existing browsers and should be indistinguishable from a real browser request. The header sets will be automatically rotated for you to further reduce the chances of blocking.
We also switched the default HTTP version from 1 to 2 in requestAsBrowser. We don't expect this change to be breaking, and we took precautions, but we're aware that there are always some edge cases, so please let us know if it causes trouble for you.
utils.requestAsBrowser() with got-scraping.useHttp2 true by default with utils.requestAsBrowser().Apify.call() failing with empty OUTPUT.puppeteer to 8.0.0 and playwright to 1.10.0 with Chromium 90 in Docker images.@apify/ps-tree to support Windows better.@apify/storage-local to support Node.js 16 prebuilds.utils.waitForRunToFinish please use the apify-client package and its waitForFinish functions. Sorry, forgot to deprecate this with v1 release.require that broke the SDK with underscore 1.13 release.@apify/storage-local to v2 written in TypeScript.SessionPoolOptions not being correctly used in BrowserCrawler.puppeteer or playwright installations.In this minor release we focused on the SessionPool. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.
// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
id: 'my-session',
// ... some config
});
// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');
sessionPool.addSession() function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id).sessionId to sessionPool.getSession() to be able to retrieve a session from the session pool with the specific session id.SessionPool not working properly in both PuppeteerCrawler and PlaywrightCrawler.Apify.call() and Apify.callTask() output - make it backwards compatible with previous versions of the client.browser-pool to fix issues with failing hooks causing browsers to get stuck in limbo.proxy-chain dependency because now it's covered in browser-pool.