How To Create A Scraper And Submit A Form With Puppeteer
Mihnea-Octavian Manolache on Feb 28 2023
Have you ever had to work with POST requests while web scraping? I am sure you have! And it’s forms we have to handle most of the time. That is why today, I am going to talk about how to submit the form with Puppeteer. If you don’t know what puppeteer is yet, don’t worry. You’ll find out in just a moment. Until then, let me set some expectations for today’s article. If you follow me on our learning path, today you should be able to learn:
- What is Puppeteer in web scraping
- How to set up a simple Puppeteer project
- How is form submission handled in Puppeteer
So, without further ado, let’s get to it!
What is Puppeteer and why is it important for web scraping?
In general, web scraping refers to the process of automating data extraction from various servers. Back in the day, a simple HTTP client would have been enough to perform this task. Nowadays though, websites rely more and more on JavaScript. And traditional HTTP clients are unable to render JavaScript files. That is where Puppeteer comes into play.
Puppeteer is a Node.js library that allows you to control a headless Chrome or Chromium browser via the DevTools Protocol. Long story short, it provides a high level API to automate Chrome.
In terms of web scraping, Puppeteer is useful for scraping websites that require JavaScript to be rendered. Additionally, it can also be used to interact with web pages in a way similar to a human. For example, clicking buttons or our focus for today, filling out forms. This makes it ideal for scraping websites that use anti-scraping techniques.
Setting up a simple Puppeteer project
I believe in taking things slow for a better understanding of the overall process. Before going into how to submit the form with Puppeteer, let’s talk simple Puppeteer. In this section, I am going to show you how to set up a Node project, install Puppeteer, and use it to scrape data. So, first thing first, let’s create a new folder and open it inside our desired IDE. I prefer Visual Studio Code but feel free to use whatever you want.
Did you know?
- You can ‘programmatically’ create a new folder from your terminal by typing the `mkdir` command.
- You can use the `npm init -y` command to setup a node project and accept default values
- You can create a new file with the `touch` command.
- And you can also open VSCode with the `code .` command.
If you want, you can combine the four and spin up a project in seconds like this:
~ » mkdir scraper && cd scraper && npm init -y && code .
Inside your IDE, open up a new terminal (Terminal > New Terminal) and let’s install Puppeteer. Type `npm i puppeteer --save` inside your terminal. Also, I like using JS modules instead of CommonJS. Check out the differences between the two here. If you want to use modules too, open `package. json` and add `"type": "module"` to the JSON object.
Now that we’re all set up, we can start adding some code. Create a new `index.js` file and open it in the IDE. No need to do it from the terminal this time, but just as a hint, you could use the `touch` command. Now let’s add the code:
import puppeteer, { executablePath } from 'puppeteer'
const scraper = async (url) => {
const browser = await puppeteer.launch({
headless: false,
executablePath: executablePath(),
})
const page = await browser.newPage()
await page.goto(url)
const html = await page.content()
await browser.close()
return html
}
And let’s see what we’re doing:
- We’re importing Puppeteer and `executablePath` into our project
- We’re defining a new function, that takes one `url` parameter
- We’re launching a new browser using `puppeteer.launch`
a. We’re specifying we want it to run headfull
b. We’re using `executablePath` to get the Chrome path - We’re opening a new page and navigating to the `url`
- We’re saving the `page.content()` in a constant
- We closed the browser instance
- And finally, we’re returning the `html` output of the page we just scraped
So far things are not complicated. This is the bare minimum of a web scraper implementation with Node JS and Puppeteer. If you want to run the code, simply give the `scraper` function a target and log its return value:
console.log(await scraper('https://webscrapingapi.com/'))
But remember our goal is to extract data upon submitting a form. This means we have to think of a way to submit the form with Puppeteer. Luckily, I’ve done it before and I know it’s not hard. So let’s see how you can do it too.
How to submit forms with Puppeteer
Think of Puppeteer as the means to mimic human behavior on a given website. How do we, humans, submit forms? Well, we identify the form, fill it in, and usually click a button. That is the same logic used to submit forms with Puppeteer. The only difference is how we perform these actions. Because humans rely on senses. Since puppeteer is software, we’ll do it programmatically, using puppeteer’s built in methods, like so:
#1: Submit simple forms with Puppeteer
First things first, we need to ‘visualize’ our form. In a website, all elements are grouped in an HTML block and every element has an identifier. Identifiers usually consist of CSS attributes of the element. Yet, you may come across websites that don’t have such selectors. In such scenarios, you can use xPaths for example. But that is a subject for another talk. Let’s focus on identifying elements in Puppeteer using CSS.
To have some sort of background, let’s say we want to automate the login action on Stack Overflow. So the target is https://stackoverflow.com/users/login. Open up your browser, navigate to the login page, and open Developer Tools. You can right click on the page and select ‘Inspect’. You should see something like this:
On the left side, there is a graphical interface. On the right side, there is the HTML structure. If you look closely on the right side, you will see our form. It mainly consists of two inputs and one button. These are the three elements we are targeting. And as you can see, all three elements have an `id` as a CSS identifier. Let’s translate what we’ve learned so far in code:
import puppeteer, { executablePath } from 'puppeteer'
const scraper = async (target) => {
const browser = await puppeteer.launch({
headless: false,
executablePath: executablePath(),
})
const page = await browser.newPage()
await page.goto(target.url,{waitUntil: 'networkidle0'})
await page.type(target.username.selector, target.username.value)
await page.type(target.password.selector, target.password.value)
await page.click(target.buttonSelector)
const html = await page.content()
await browser.close()
return html
}
In order to keep it functional and reusable, I chose to replace my function’s parameter with an object. This object consists of the targeted URL, the input selectors and values, and the selector for the submit button. So, to run the code, just create a new `TARGET` object that holds your data, and pass it to your `scraper` function:
const TARGET = {
url: 'https://stackoverflow.com/users/login',
username: {
selector: 'input[id=email]',
value: '<YOUR_USERNAME>'
},
password: {
selector: 'input[id=password]',
value: '<YOUR_PASSWORD>'
},
buttonSelector: 'button[id=submit-button]'
}
console.log(await scraper(TARGET))
#2: Upload files with Puppeteer
Sometimes, web automation requires us to upload files, rather than submit simple forms. If you come across such a task and need to attach files before you submit the form with Puppeteer, you will want to make use of Puppeteer’s `uploadFile` method. To keep things simple, I suggest you create a new function for this action:
const upload = async (target) => {
const browser = await puppeteer.launch({
headless: false,
executablePath: executablePath(),
})
const page = await browser.newPage()
await page.goto(target.url,{waitUntil: 'networkidle0'})
const upload = await page.$(target.form.file)
await upload.uploadFile(target.file);
await page.click(target.form.submit)
await browser.close()
}
See how this time I am using `page.$` to first identify the element. And only after that, I am calling the `uploadFile` method which only works on `ElementHandle` types. Parameter wise, just like previously, I am using an object to pass all data at once to my function. If you want to test the script, just add the following code and run `node index.js` in your terminal:
const TARGET = {
url: 'https://ps.uci.edu/~franklin/doc/file_upload.html',
form: {
file: 'input[type=file]',
submit: 'input[type=submit]'
} ,
file: './package.json'
}
upload(TARGET)
Conclusions
Wrapping everything up, I’d say it’s quite easy to submit the form with Puppeteer. Moreover, I find that compared to its alternatives, Puppeteer fully handles this action. Basically, all the user has to do is properly identify elements.
Now, I should note that a real world scraper requires much more in order to be efficient. Most of the time, if you ‘abuse’ a server by submitting too many forms in a short period of time, you will probably get blocked. That is why, if you want to automate the process of form submission, I recommend using a professional scraping service. At Web Scraping API, we do offer the option of sending POST & PUT requests You can read more about it in our documentation.
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
We care about the protection of your data. Read our Privacy Policy.
Related articles
Discover 3 ways on how to download files with Puppeteer and build a web scraper that does exactly that.
Learn how to scrape HTML tables with Golang for powerful data extraction. Explore the structure of HTML tables and build a web scraper using Golang's simplicity, concurrency, and robust standard library.
Learn how to use Playwright for web scraping and automation with our comprehensive guide. From basic setup to advanced techniques, this guide covers it all.