How to Fix My Mistake That Blocked Campaign Parameters


In my attempt to reduce bad hits to my site, by accident I blocked legitimate traffic.

Do you have those moments where you discover something and want to dig into it? Today was that day for me. Although what I found wasnít new, I still want to share what I learnedóthat way I wonít forget later. In this post, Iíll walk you through what I found, why itís broken, and what I did to fix it.

Checking on my website.
As a regular precaution, I check on the heath of my website using through Googleís search console. Clicking dashboard, I noticed a few 404 errors, something that isnít uncommon. As usual, I checked to see what they were so I could either correct or ignore them.

The first couple were the ones I usually see. I had a calendar on my site year ago and, for some unknown reason, clients still try to hit it. In fact, the bad agents seem to only include the date parameters. Itís odd and makes no sense, but not much I can about it.

The second was a bad tag from one of my articles that is stuck in some far off database. I ignore that error as it works as intended. But, the link that caused the third 404 error was new:

https://www.reids4fun.com/?utm_medium=referral&utm_source=unsplash

The link looked legitimate, although not something I generated. Iíd recently joined Unsplash and wondered if the link came from them. Although I couldnít trace the origin, it sparked enough interest that I did a little digging.

Campaigns, parameters, and ads.
Doing some searching, I learned that marketers use the ďutm_Ē prefix to track clicks. Google calls it custom campaigns, and content creators use it to know where a link comes from. Being Google, they ham up the the sales and ads side of things.

Not being a big marketer, I wasnít familiar with the UTM standard, although I was aware of campaigns. You can configure them in Google Analytics, but given Iíve not bothered to promote my site, I never learned how they work. What surprised me was that someone would use them when referring to my site.

Digging into my logs, I discovered that they werenít anything new. I use Feed Burner to syndicate my blog and it adds them as well. It makes sense, as Google bought Feed Burner and at one time was using it to sell AdWords. Iím guessing Unsplash did something similar, but I wasnít able to replicate it from their website.

Now, Feed Burner adds the UTM parameters to each blog post. As I ignore parameters passed to articles, those links worked fine. But, using UTM on my home page is a different story. There, any unknown actions get the 404 treatment. I now had a problem.

Duplicate content and all that jazz.
You see, I made the fatal decision years ago to redirect any unknown actions to my index page. The action router is pretty complex and I was often changing, and breaking, things as I messed with it. An easy fix was to print the main page when something didnít route right.

The problem, though, is I wasnít redirecting or setting a canonical page. That meant, each query became their own unique page, even if the content was the same. As I learned about SEO, I discovered that this was bad.

The fix was to set canonical links, redirect old links, and mark content as unknown when things were wrong. This worked well as I was starting to drop all those old query parameters. I still needed to deal with them, but in general they redirected to the new URLís. Except for the main page.

To fix the duplicate content problem, I changed the default behavior. Now, only the primary link would print the main page. any other link would dump you into a 404 page. Although a good assumption at the time, it was now dumping those UTM links into a black hole.

An external driver.
Although driven by the desire to end my duplicate page issue, I had another problem. Years ago, spammers figured out how to add comments to my site. Using different parameters, they created unique links to try to boost the rank of other sites.

They also leveraged the behavior of a couple of my scripts to do the same thing without posting a comment. Although it didnít work as they intended, it was too late. The spammers had already shoved those URLís into bot code. I still get random queries using malformed and invalid links because of that. Dumping them to 404ís is the best way to avoid giving them anything.

While adding redirects and not found pages, I also cleaned up my action router. That was when I introduced the behavior that broke the UTM links. Here is the relevant code to give you an idea of what I was doing.

our $ReqURL = $pageurl . $ENV{REQUEST_URI};
# get action
our $requesturi = $ENV{REQUEST_URI};
our $raction = $requesturi;
# do some clean-up
$raction =~ s/\/+/\//g; # reduce all slashes to a single one
chop $raction if $raction =~ /[^\/]\/$/; # remove trailing slash
# take action
if ($raction ne '/') {
  # View article
  if ($raction =~ /^\/(\d+)/) {
  $info{'id'} = $1; viewnews();
  } else {
  # deal with no action found...
  notfound(); # 404
  }
} else {
  # deal with no action passed...
  if ($ReqURL eq "$pageurl/") {
  print_main(); # print main if no action
  } else {
  notfound(); # 404
  }
}

Close, but still needs fixing.
Although my routine worked, I was now breaking legitimate traffic. The real problem was my the REQUEST_URI environment variable. By default, the server passes the query string to it. For reference, hereís its value using the UTM link above:

/?utm_medium=referral&utm_source=unsplash

In the earlier code, this would match as an action and, not finding one, would give you a 404 error. That wasnít intended, so an easy fix is to remove the query string from the $raction variable.

With that fixed, I took a look at the test I did when there was no action passed. As written, I would still send the user to a 404. But, I donít need to do that. Instead, I should redirect to the canonical, aka non-duplicate, version of the page.

Here is the fixed version of the code:

our $ReqURL = $pageurl . $ENV{REQUEST_URI};
# get action
our $requesturi = $ENV{REQUEST_URI};
our $raction = $requesturi;
# do some clean-up
$raction = (split ('\?', $raction))[0]; # remove query
$raction =~ s/\/+/\//g; # reduce all slashes to a single one
chop $raction if $raction =~ /[^\/]\/$/; # remove trailing slash
# take action
if ($raction ne '/') {
  # View article
  if ($raction =~ /^\/(\d+)/) {
  $info{'id'} = $1; viewnews();
  } else {
  # deal with no action found...
  notfound(); # 404
  }
} else {
  # someone put in variables that weren't actions, redirect to base
  if ($raction ne $requesturi) {
  redirecturl("$pageurl$raction");
  }
  print_main(); # print main if no action
}

Now, anyone legitimacy hitting my site will still reach it. The UTM campaign is still captured, I see it in my logs, so nothing lost there. And, the spammers arenít getting anything either as they look like theyíre sending people to my home page.

Easy enough. The research took longer than the fix.



Comments on this article:

No comments so far.

Write a comment:

Type The Letters You See.
[captcha image][captcha image][captcha image][captcha image][captcha image][captcha image]
not case sensitive