我之前对 robots.txt 的理解是错的。
I was wrong about robots.txt

原始链接: https://evgeniipendragon.com/posts/i-was-wrong-about-robots-txt/

我惨痛地吸取了一个关于 `robots.txt` 及其对社交媒体预览影响的宝贵教训。 为了数据隐私,我一开始全面禁止所有爬虫,结果发现我的 LinkedIn 帖子失去了链接预览,互动也直线下降。 通过 LinkedIn 的 Post Inspector,我发现我的 `robots.txt` 阻止了 LinkedInBot 访问,导致它无法获取必要的 Open Graph (OG) 元标签来生成预览。 OG 协议允许网页在社交网络上成为富对象,需要标题、类型、图像和 URL。 为了解决这个问题,我更新了我的 `robots.txt`,专门允许 LinkedInBot 访问。 这次经历凸显了彻底测试变更并理解其影响范围的重要性。 在追求数据隐私的同时,我无意中阻碍了内容的可见性。 这个错误让我发现了 Open Graph 和 Linkedin Post Inspector,从而更加认识到仔细考虑意外后果的必要性。

相关文章

原文

Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for.

Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.

This is what my LinkedIn posts used to look like: no link preview and little engagement with the post

After doing a quick search, I found a tool called LinkedIn Post Inspector. This tool can show everything that you would want to know about links that you are about to share on the platform. I plugged in my recent article in the tool and it revealed to me the reason why I could no longer see the previews - the robots.txt file had a directive that would not allow LinkedIn bot to scrape the web pages. This was the error message that I got:

Fair enough! Thinking of it now - it makes so much sense! I should have seen that ahead of time. Whenever you want to post something on LinkedIn or other social media platform, they would need to request the page the link of which you are about to share. These bots need that page in order to get access to the meta tags necessary to create the preview. Those are known as Open Graph meta tags. And it all comes from Open Graph Protocol, which was originally created at Facebook.

According to the OPG website,

The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.

There are only a few required tags needed to implement this for your web resource. But this protocol is highly extensible and allows for all sorts of media to be used as meta information about your page.

To turn your web pages into graph objects, you need to add basic metadata to your page. We’ve based the initial version of the protocol on RDFa which means that you’ll place additional <meta> tags in the <head> of your web page. The four required properties for every page are:

  • og:title - The title of your object as it should appear within the graph, e.g., “The Rock”.
  • og:type - The type of your object, e.g., “video.movie”. Depending on the type you specify, other properties may also be required.
  • og:image - An image URL which should represent your object within the graph.
  • og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., “https://www.imdb.com/title/tt0117500/".

This simple yet powerful protocol is what makes the posts across all of the Internet more presentable and informative.

To turn this issue around I updated my robots.txt to allow LinkedInBot to crawl my resources. If I would want to start posting on other social media sites and see the previews for the posts, I would need to include those other bots here as well. My new current configuration looks like this:

User-agent: LinkedInBot
Allow: /

User-agent: *
Disallow: /

Turns out that sometimes drastic measures like blocking all crawlers could make content presentation suffer. What I missed when I implemented that change was the fact that I didn’t thoroughly test the impact of blocking all of the traffic like that. Now that I have encountered that issue, it led me to learn more about the Open Graph Protocol and tools like LinkedIn Post Inspector which told me what the problem was.

When working on any feature - no matter how small it is - make sure that you understand the domain in which you are creating the changes. As the practice shows, sometimes that is not how things work out. Initially I did not connect the dots between the OPG and robots.txt prohibitions.

Sometimes you have to break a few eggs before you make an omelet. Well&mldr; you always have to break some eggs to make an omelet&mldr; You get the point.

联系我们 contact @ memedata.com