Anand Sukumaran

Startup Founder, Software Engineer, Abstract thinker
Co-founder & CTO @ Engagespot (Techstars NYC '24)

I built an AI agent in Javascript to pay my bills and place orders just by talking to it!

Feb 18, 2025

I built an AI agent in Javascript to pay my bills and place orders just by talking to it! 🚀

Sometimes, I feel too lazy to log in to a website and pay bills manually. If someone could do it for me, that would be great. So, I built an AI agent to handle it!

Now, I can simply say- Hey, please pay the electricity bill. The due date is near. I’ll pay using my VISA credit card.

🔹 Agent: “Sure! Let me open the website and do it for you!”

I use a secondary laptop for tasks like these, so the agent runs on that device while I continue working on my main laptop.

How does it work?

The agent fills in my consumer number, and even the CAPTCHA(sometimes).
It selects the bill, confirms the details, and navigates to the payment page.
It even chooses “VISA” from the available payment options for me!

When sensitive data like my credit card number is required, the agent asks me for input. I’ve added Text-to-Speech (TTS) so it doesn’t just display text but actually calls out:

🔹 Agent: “Anand, I need you to enter your credit card number.” I manually enter it (though I could automate this too, but for security reasons, I didn’t).

The agent then moves to the payment confirmation page, clicks “Confirm,” and asks for the OTP sent to my mobile.

🔹 Agent: “I need your OTP to confirm the payment” Instead of typing it, I use Speech-to-Text (STT) and simply say: “My OTP is 792649.”

And finally, the agent confirms: “I have successfully paid the bill of Rs. XXX. Do you want me to save a screenshot of the confirmation page?”

And that’s it - done! AI is truly becoming my personal assistant! 🚀

How it works? (Technical details)

The agent uses Stagehand as a tool - an open-source browser automation library for performing actions on a website.
The agent navigates to a website, takes a screenshot, then looks at the screenshot and reasons about what to do next.
After making a plan, it executes its plan using Stagehand.
After every step, it verifies the result by taking screenshots and reasoning about what to do next.
Whenever it determines that human input is required, it connects with the TTS tool to speak aloud.
The STT engine will listen, turn that to text and feed it back to the LLM

Tools Used

Javascript
Stagehand
Screenshot tool
Custom agent loop for auto-prompting

🔹 Time taken to build this - Around 3 hours.

I will share the source code. But I haven’t structured it properly yet, I’ll publish it when I get the time!