Skip to content

Blog website crawlers and bots in Apache2

Found a couple more bots crawling my website, and from the look at online resources it seems I catched a few of the bad guys. Crawlers which ignore the robots.txt standard, and just crawl a website for content.

Decided to do something against it, and added a filter in Apache2.

The way I have my webserver setup is that I have templates for every website (they all have different configs), and deploy them using Ansible. Parts of the website configuration which are the same, or at least similar, are handled by includes.

 

Continue reading "Blog website crawlers and bots in Apache2"

openHAB and Telegram Bot

openHAB 2 comes with a Telegram binding which allows to run a Telgram Bot. This bot can both send messages to users and groups, and can receive commands and respond to them. That's useful: your home automation system can send all kind of details to your mobile phone.

For this to make it work it needs a couple things:

First of all a mobile phone with the Telegram app on it. You can either have the bot message you directly, but this only works for one person. Or you create a group, and have the bot send the messages to the group instead. Find out about the group ID here.

Then you need to create a Telegram Bot. Instructions are available here.

And everything needs to be hooked up in openHAB.

 

Continue reading "openHAB and Telegram Bot"

New PostgreSQL pg_docbot is live

Last night a long-running project of mine went live: pg_docbot v2.

For years, Jan Wieck provided a helper bot (rtfm_please) in the #postgresql IRC channel in the freenode network. Because of protocol changes in the freenode network, this bot was no longer functional. Together with some others we decided to write a quick and dirty new bot. As it is with dirty hacks, not everything was optimal: after timeouts the bot was not able to reconnect - more exactly the POE framework did not even recognize the timeout. Also extending the bot and adding new functionality was complicated. For a while I collected all these problems in my personal bugtracker and about two years ago I started a full rewrite.

Some of the new key features:

  • pg_docbot's channel limit is gone: a user in the freenode network can only join 20something channels, the new bot was designed from the ground to handle multiple IRC connections and circumvent this problem
  • function to identify stale urls: the new ?lost command shows all unconnected urls
  • registered users are now either "op" or "admin": all operators can issue ?learn and ?forget, admins can - of course - do everything
  • new command to post to all channels: the ?wallchan command let the doc post to all channels
  • i18n: every channel has a configured language, default is English - all messages in this channel are posted in the configured language (if translation is available)
  • watchdog on board: every session is monitored and reconnected, if necessary - no more "ads: can you please restart the bot?"
  • nickname handling: every session is monitoring his (registered) nickname and will reclaim the nick if necessary, also nickserv handling is included now
  • commands are recognized in different languages: a nice add-on, by-product of i18n, most commands can be used in different languages - like "search" (English) and "suche" (German)
  • bot can join and leave channels on the fly: not much to say about, just that you can have the bot in a temporary PostgreSQL channel if you like
  • channels can have paswords now: this works both for configured channels as well as on-the-fly joined channels
  • autojoin channels: configured but not joined channels are rejoined after a while, also it is possible to configure but not autojoin channels
  • statistics: the bot runs anonymous stats about his usage, like ?search, ?learn, ?forget and so on

There is still a lot to do, not all of my tickets are closed. If you want pg_docbot talking in your language, please send me translations. The pg_docbot code is on git.postgresql.org.

Next things on my todo list:

  • verify each URL from time to time: mark unreachable as invalid
  • intelligent sort order: not yet sure how to solve this problem, right now there is no specific sort order
  • move pg_docbot to PostgreSQL infrastructure
  • web interface: the bot should redirect the user to his website if there are more then let's say 2 or 3 urls, to avoid flooding the IRC channels
  • integration in postgresql.org website: the pg_docbot database contains very useful knowledge, there are plans to integrate this into the search on the main website
  • integration with explain.depesz.com: every time the bot see's a link from a paste site, it should scan the content and generate a postign on explain.depesz.com
  • monitor planet.postgresql.org: publish new postings in IRC channels
  • allow better search: like using a regexp
  • ...


Continue reading "New PostgreSQL pg_docbot is live"