Remove duplicates from Huginn event stream

Posted by ads' corner on Wednesday, 2020-10-21
Posted in [Huginn][Linux]

One of the things I’m using Huginn for is monitoring for Twitter keywords. Some of them might appear in pairs, as example people like to tag Tweets about PostgreSQL with both #postgresql and #postgres. When I was using IFTTT this always created two emails, one for each hashtag. With Huginn I can deduplicate the events, and only notify about the first occurrence.

I need the following agents for this scenario:

Let’s look at the details of the scenario.

Search for Tweets: Twitter Search Agent

Huginn has a couple different Twitter agents, and two of them are for searching content. The Twitter Stream Agent is useful if the matching Tweets occur frequently, and the Twitter Stream Agent is useful if the Tweets are rare. I’m using the latter one.

You also need to authenticate the Twitter service before you can seatch for Tweets.

In my scenario I search for new Tweets every 30 minutes, and keep the events around for 7 days. This allows me to debug the scenario, and re-emit events into the stream. The options field is:

1
2
3
4
5
{
  "search": "\"#postgresql\"",
  "expected_update_period_in_days": "365",
  "result_type": "recent"
}

The hashtag is quoted, because I’m searching for #postgresql, not #postgresql. The expected_update_period_in_days option specifies how often this agent expects to find a Tweet, before it sets itself as non-working. I hope that a year is sufficient …

A second Twitter Search Agent is looking for #postgres Tweets, with a similar setup.

Filter Retweets: Trigger Agent

I’m not interested in Retweets, therefore I filter them out. This blog post describes in detail how I do that.

Deduplicate the Tweets: De Duplication Agent

After the search and filtering the Retweets I have a number of events which need to be deduplicated. I feed both the #postgresql and the #postgres event streams into the De Duplication Agent as Source. As deduplication property I’m using the full text of the Tweet: {{full_text}}. Loopback specifies how many Tweets the Agent should look back and find matches, in my case I have this set to 1000, should be sufficient as these tweets always are close together. And finally expected_update_period_in_days is set to 365, like in the previous agents.

Send notification: Email Agent

The last step is sending a notification once a Tweet passes through the entire chain of agents, and is not a duplicate. The options:

1
2
3
4
5
6
{
  "subject": "#postgresql/#postgres Mention",
  "expected_receive_period_in_days": "365",
  "content_type": "text/html",
  "body": "Tweet:<br/><br/>Link: <a href=\"https://twitter.com/{{ user.screen_name }}/status/{{ id_str }}\">twitter.com/{{ user.screen_name }}/status/{{ id_str }}</a><br/><br/>User: <a href=\"https://twitter.com/{{ user.screen_name }}\">{{ user.screen_name }}</a><br/>Username: {{ user.screen_name }}<br/>Username Description: {{ user.name | strip_html }}<br/>User Description: {{ user.description | strip_html }}<br/>Followers: {{ user.followers_count }}<br/>Following: {{ user.friends_count }}<br/><br/>{{ full_text | strip_html }}"
}

Summary

Here is the event flow for the entire scenario:

Huginn Deduplication
Huginn Deduplication

The two disabled Email Agents are for debugging purposes, I can use them to get notified about single events from each stream.


Categories: [Huginn] [Linux]