How to Fix My Mistake That Blocked Campaign Parameters
In my attempt to reduce bad hits to my site, by accident I blocked legitimate traffic.
Do you have those moments where you discover something and want to dig into it? Today was that day for me. Although what I found wasn’t new, I still want to share what I learned—that way I won’t forget later. In this post, I’ll walk you through what I found, why it’s broken, and what I did to fix it.
Checking on my website.
As a regular precaution, I check on the heath of my website using through Google’s search console. Clicking dashboard, I noticed a few 404 errors, something that isn’t uncommon. As usual, I checked to see what they were so I could either correct or ignore them.
The first couple were the ones I usually see. I had a calendar on my site year ago and, for some unknown reason, clients still try to hit it. In fact, the bad agents seem to only include the date parameters. It’s odd and makes no sense, but not much I can about it.
The second was a bad tag from one of my articles that is stuck in some far off database. I ignore that error as it works as intended. But, the link that caused the third 404 error was new:
https://www.reids4fun.com/?utm_medium=referral&utm_source=unsplash
The link looked legitimate, although not something I generated. I’d recently joined Unsplash and wondered if the link came from them. Although I couldn’t trace the origin, it sparked enough interest that I did a little digging.
Campaigns, parameters, and ads.
Doing some searching, I learned that marketers use the “utm_” prefix to track clicks. Google calls it custom campaigns, and content creators use it to know where a link comes from. Being Google, they ham up the the sales and ads side of things.
Not being a big marketer, I wasn’t familiar with the UTM standard, although I was aware of campaigns. You can configure them in Google Analytics, but given I’ve not bothered to promote my site, I never learned how they work. What surprised me was that someone would use them when referring to my site.
Digging into my logs, I discovered that they weren’t anything new. I use Feed Burner to syndicate my blog and it adds them as well. It makes sense, as Google bought Feed Burner and at one time was using it to sell AdWords. I’m guessing Unsplash did something similar, but I wasn’t able to replicate it from their website.
Now, Feed Burner adds the UTM parameters to each blog post. As I ignore parameters passed to articles, those links worked fine. But, using UTM on my home page is a different story. There, any unknown actions get the 404 treatment. I now had a problem.
Duplicate content and all that jazz.
You see, I made the fatal decision years ago to redirect any unknown actions to my index page. The action router is pretty complex and I was often changing, and breaking, things as I messed with it. An easy fix was to print the main page when something didn’t route right.
The problem, though, is I wasn’t redirecting or setting a canonical page. That meant, each query became their own unique page, even if the content was the same. As I learned about SEO, I discovered that this was bad.
The fix was to set canonical links, redirect old links, and mark content as unknown when things were wrong. This worked well as I was starting to drop all those old query parameters. I still needed to deal with them, but in general they redirected to the new URL’s. Except for the main page.
To fix the duplicate content problem, I changed the default behavior. Now, only the primary link would print the main page. any other link would dump you into a 404 page. Although a good assumption at the time, it was now dumping those UTM links into a black hole.
An external driver.
Although driven by the desire to end my duplicate page issue, I had another problem. Years ago, spammers figured out how to add comments to my site. Using different parameters, they created unique links to try to boost the rank of other sites.
They also leveraged the behavior of a couple of my scripts to do the same thing without posting a comment. Although it didn’t work as they intended, it was too late. The spammers had already shoved those URL’s into bot code. I still get random queries using malformed and invalid links because of that. Dumping them to 404’s is the best way to avoid giving them anything.
While adding redirects and not found pages, I also cleaned up my action router. That was when I introduced the behavior that broke the UTM links. Here is the relevant code to give you an idea of what I was doing.
our $ReqURL = $pageurl . $ENV{REQUEST_URI};
# get action
our $requesturi = $ENV{REQUEST_URI};
our $raction = $requesturi;
# do some clean-up
$raction =~ s/\/+/\//g; # reduce all slashes to a single one
chop $raction if $raction =~ /[^\/]\/$/; # remove trailing slash
# take action
if ($raction ne '/') {
# View article
if ($raction =~ /^\/(\d+)/) {
$info{'id'} = $1; viewnews();
} else {
# deal with no action found...
notfound(); # 404
}
} else {
# deal with no action passed...
if ($ReqURL eq "$pageurl/") {
print_main(); # print main if no action
} else {
notfound(); # 404
}
}
Close, but still needs fixing.
Although my routine worked, I was now breaking legitimate traffic. The real problem was my the REQUEST_URI environment variable. By default, the server passes the query string to it. For reference, here’s its value using the UTM link above:
/?utm_medium=referral&utm_source=unsplash
In the earlier code, this would match as an action and, not finding one, would give you a 404 error. That wasn’t intended, so an easy fix is to remove the query string from the $raction variable.
With that fixed, I took a look at the test I did when there was no action passed. As written, I would still send the user to a 404. But, I don’t need to do that. Instead, I should redirect to the canonical, aka non-duplicate, version of the page.
Here is the fixed version of the code:
our $ReqURL = $pageurl . $ENV{REQUEST_URI};
# get action
our $requesturi = $ENV{REQUEST_URI};
our $raction = $requesturi;
# do some clean-up
$raction = (split ('\?', $raction))[0]; # remove query
$raction =~ s/\/+/\//g; # reduce all slashes to a single one
chop $raction if $raction =~ /[^\/]\/$/; # remove trailing slash
# take action
if ($raction ne '/') {
# View article
if ($raction =~ /^\/(\d+)/) {
$info{'id'} = $1; viewnews();
} else {
# deal with no action found...
notfound(); # 404
}
} else {
# someone put in variables that weren't actions, redirect to base
if ($raction ne $requesturi) {
redirecturl("$pageurl$raction");
}
print_main(); # print main if no action
}
Now, anyone legitimacy hitting my site will still reach it. The UTM campaign is still captured, I see it in my logs, so nothing lost there. And, the spammers aren’t getting anything either as they look like they’re sending people to my home page.
Easy enough. The research took longer than the fix.