cancel
Showing results for 
Search instead for 
Did you mean: 

Problems with Fastly Content Delivery Network in Southbank POP

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Problems with Fastly Content Delivery Network in Southbank POP

This is a repost from another thread because I'm not sure it's the same problem as the thread where it was.

 

So, we've had lots of problems in our office since late December, via bng in Southbank

Currently connected to

lo0.central10.psb-bng03.plus.net

I'd be interested to see if anyone is having the same problem as us, and we have a test for it too:

Run either of these commands about 20 times:

Windows PowerShell

invoke-webrequest http://rubygems.org/gems/hirb-0.7.3.gem

Linux

curl -I http://rubygems.org/gems/hirb-0.7.3.gem

 One or more of those should fail with either of the followingr:

curl: (56) Recv failure: Connection reset by peer
invoke-webrequest : An error occurred while sending the request.
At line:1 char:1
+ invoke-webrequest http://rubygems.org/gems/hirb-0.7.3.gem
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Method: GET, Re...rShell/6.0.0
}:HttpRequestMessage) [Invoke-WebRequest], HttpRequestException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

In terms of what's happening, the remote side is sending TCP resets unexpectedly:

Screenshot from 2017-01-23 12-23-00.png

 We're confident it's not a problem with the RubyGems website as we download stuff like that all the time in Amazon Web Services and there's no report of problems on Twitter or angry software developers complaining about it in other forums.

The site's hosted on the Fastly CDN, so we've spoken to them and they've at first glance suggested Equal Cost Multipath Routing issues somewhere on this network.

The problem appears to be specific to Fastly.

We've tested other CDNs without finding issues, including:

  • Amazon CloudFront
  • Akamai
  • Microsoft Azure
  • Limelight
  • CloudFlare

 

However, we're concerned it might be a network issue on PlusNet nevertheless, and possibly due to some change in Southbank around mid December.

 

I'd be interested if others could do the same test and report the same results.

I personally do not have this issue at home in the Colindale POP.

 

 

12 REPLIES 12
dvorak
Moderator
Moderator
Posts: 29,473
Thanks: 6,623
Fixes: 1,482
Registered: ‎11-01-2008

Re: Problems with Fastly Content Delivery Network in Southbank POP

Tried on PN and non-PN, both working ok:

 

r:~$ curl -I http://rubygems.org/gems/hirb-0.7.3.gem
HTTP/1.1 200 OK
x-amz-id-2: OCF/i2auRZ33jdfCJEEhZSx7fdc3UD+kd8W6bqsVHZjz8awMzfhGXq02nMFUYzuO/FjvrNffRc0=
x-amz-request-id: 3DE229AA8535D09F
x-amz-replication-status: COMPLETED
Last-Modified: Tue, 02 Feb 2016 05:19:13 GMT
ETag: "2d26d51bebafd812563d20c86f935114"
x-amz-version-id: u3A8QNi4S7IdEddn9GzYJU3BnGWqgFJc
Content-Type: binary/octet-stream
Via: 1.1 varnish
Fastly-Debug-Digest: dd0310b15f68508cf3da06a06bc052de40faf5af72967bba3275d5cbcf741da7
X-Backend: F_S3 54.231.169.21:443, fastlyshield--shield_ssl_cache_sea1923_SEA 199.27.74.23:443
Content-Length: 46080
Accept-Ranges: bytes
Date: Tue, 24 Jan 2017 10:59:27 GMT
Via: 1.1 varnish
Age: 3617
Connection: keep-alive
X-Served-By: cache-sea1923-SEA, cache-ams4430-AMS
X-Cache: HIT, HIT
X-Cache-Hits: 1, 1
X-Timer: S1485255567.768037,VS0,VE143
Vary: Fastly-SSL
Server: RubyGems.org

PN:

e$ curl -I http://rubygems.org/gems/hirb-0.7.3.gem
HTTP/1.1 200 OK
x-amz-id-2: OCF/i2auRZ33jdfCJEEhZSx7fdc3UD+kd8W6bqsVHZjz8awMzfhGXq02nMFUYzuO/FjvrNffRc0=
x-amz-request-id: 3DE229AA8535D09F
x-amz-replication-status: COMPLETED
Last-Modified: Tue, 02 Feb 2016 05:19:13 GMT
ETag: "2d26d51bebafd812563d20c86f935114"
x-amz-version-id: u3A8QNi4S7IdEddn9GzYJU3BnGWqgFJc
Content-Type: binary/octet-stream
Via: 1.1 varnish
Fastly-Debug-Digest: dd0310b15f68508cf3da06a06bc052de40faf5af72967bba3275d5cbcf741da7
X-Backend: F_S3 54.231.169.21:443, fastlyshield--shield_ssl_cache_sea1926_SEA 199.27.74.26:443
Content-Length: 46080
Accept-Ranges: bytes
Date: Tue, 24 Jan 2017 10:59:30 GMT
Via: 1.1 varnish
Age: 3619
Connection: keep-alive
X-Served-By: cache-sea1926-SEA, cache-lcy1151-LCY
X-Cache: HIT, HIT
X-Cache-Hits: 37, 1
X-Timer: S1485255570.325458,VS0,VE129
Vary: Fastly-SSL
Server: RubyGems.org

 

Customer / Moderator
If it helped click the thumb
If it fixed it click 'This fixed my problem'
randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Did you try just the once, or repeatedly?

 

Tends to happen approx 1/20 times.

 

Our use case is using bundler in a Ruby project, which can download about 30 of those files, but will abort on a broken download.

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Further update:

 

Our thinking was that if it is some device along the path that is per-packet load balancing resulting in ecmp rehashing, that consolidating the inbound path towards us onto one router along your transit path might resolve the issue (at least temporarily so that we could definitively point to that as the problem). However, if you're still experiencing TCP RSTs, we're unable to unequivocally point out the culprit. However, looking at your pcaps you attached, the issue still seems to point to per-packet load-balancing somewhere within your network.

To give you some context in what we think is happening: a cache in a Fastly datacenter will establish a client session and expects to receive all request flow during that session. When a different cache receives a request for that same data flow midstream, it will respond with a RESET because it doesn't have any context for these packets and no way of knowing what to do with them. So it's not necessarily that they are being received out of order, but that they are being hashed onto different cache servers who have not "agreed" upon a TCP connection for that information.

By design, Fastly's ECMP decisions are made using a 5 tuple hash by a stateless but sticky load balancing function in an ethernet switch (other fields such as DSCP/ECN are not used to determine hashing).

With our stateless load balancing mechanism, TCP packets for a given session that arrive at our network on two different paths get sent to two different servers, resulting in the RST you see. Typically this happens when a device on the source network is doing per-packet load balancing or some similar behavior. We've seen a couple issues in the past where this behavior appeared to be caused by a bug in a firewall as well. Once the first packet in the TCP stream is sent down a different link anywhere in the network, it can end up leaving on a different one of your transit circuits, and the paths the two parts of the stream take diverge further from there on.

Can you please try accessing the same assets behind a different network and see if you notice the same thing? We've seen this before with our customers where they don't see the same issues when trying to load the same objects behind a different ISP. This would at least provide a strong case that the issue is within your local provider's mesh and allow you to reach out to them with evidence of this.

As said before, we don't see this issue outside of the Southbank POP on identical network hardware and configurations.

Has something like this happened before?

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Actually, turns out all these paths are visible.

 

Here's a "good" Plus.Net connection not too far away, also via southbank POP. They more or less all go through ldn-b4 at Telia.

  

Good 1Good 1

 

 

Good 2Good 2

 

 

Good 3Good 3

 

 

Good 4Good 4

 On the bad connection, there's two distinct, different paths that packets can take.

Bad 1Bad 1

 

 

Bad 2Bad 2

 

 

Bad 3Bad 3

 

 

Bad 4Bad 4

What's the difference between the BT Global Services vs. Faraday House handoff to Telia?

We're on fixed IP btw. 

 

 

 

 

 

 

 

MrSilver
Pro
Posts: 550
Thanks: 82
Fixes: 9
Registered: ‎05-10-2016

Re: Problems with Fastly Content Delivery Network in Southbank POP

Hi

 

Not aware I've seen anything similar on the forums, but the logic of what you pasted seems to make sense. Packets arriving on different route to their DC then get to different servers and TCP sends a RST.

I tried two lots of tests

 

Tracing route to 151.101.128.70 over a maximum of 30 hops

  1    14 ms     1 ms     1 ms  ZyXEL.Home [192.168.1.1]
  2    11 ms     9 ms     9 ms  lo0.central10.pcn-bng02.plus.net [195.166.130.249]
  3    10 ms    11 ms     *     411.be6.pcn-ir01.plus.net [84.93.253.75]
  4    10 ms    10 ms    10 ms  core1-BE1.colindale.ukcore.bt.net [195.99.125.132]
  5    12 ms    11 ms    11 ms  195.99.127.83
  6    12 ms    11 ms    10 ms  t2c3-et-3-1-0-0.uk-lon1.eu.bt.net [166.49.211.236]
  7    12 ms    10 ms    10 ms  213.248.82.249
  8     *        *        *     Request timed out.
  9    11 ms    11 ms     *     151.101.128.70
 10    39 ms    10 ms    10 ms  151.101.128.70

Trace complete.

 

Tracing route to 151.101.128.70 over a maximum of 30 hops

  1     3 ms     2 ms     2 ms  ZyXEL.Home [192.168.1.1]
  2    10 ms     8 ms     8 ms  lo0.central10.pcn-bng03.plus.net [195.166.130.250]
  3    10 ms    10 ms     *     411.be7.pcn-ir01.plus.net [84.93.253.83]
  4     9 ms    12 ms     9 ms  core1-BE1.colindale.ukcore.bt.net [195.99.125.132]
  5    11 ms     9 ms     9 ms  core3-hu0-8-0-0.faraday.ukcore.bt.net [195.99.127.36]
  6    10 ms     9 ms     9 ms  213.137.183.38
  7    10 ms    10 ms     9 ms  ldn-b3-link.telia.net [213.248.67.97]
  8    11 ms     9 ms     9 ms  ldn-bb2-link.telia.net [62.115.116.250]
  9    10 ms     9 ms     9 ms  ldn-b4-link.telia.net [62.115.124.201]
 10     *        *        *     Request timed out.
 11    11 ms     9 ms     9 ms  151.101.128.70

Trace complete.

both routes (PCN not PSB mind) were fine for over 200 attempts on win10.

 

I got my brother to do a quick trace on BT and got this

Tracing route to 151.101.0.70 over a maximum of 30 hops

  1    14 ms     2 ms     2 ms  bthub [192.168.1.254]
  2     *        *        *     Request timed out.
  3     *        *        *     Request timed out.
  4     9 ms     8 ms     8 ms  31.55.186.184
  5    11 ms     9 ms     9 ms  core3-hu0-6-0-3.faraday.ukcore.bt.net [195.99.127.194]
  6     9 ms     8 ms     8 ms  213.137.183.32
  7     9 ms     9 ms     9 ms  ldn-b3-link.telia.net [213.248.67.97]
  8     9 ms     9 ms     9 ms  ldn-bb3-link.telia.net [62.115.117.6]
  9    10 ms     9 ms    14 ms  ldn-b4-link.telia.net [62.115.124.203]
 10     *        *        *     Request timed out.
 11    11 ms    10 ms    10 ms  151.101.0.70

Trace complete.

Even in your bad ones you have a couple of the multihop telia links too. your good 4 and bad 3 look the same traces?

 

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Even in your bad ones you have a couple of the multihop telia links too. your good 4 and bad 3 look the same traces?

 

Yes, that's right. I was being more completest than anything. It's difficult to know what path the TCP RST came back on, but I suspect it's when when there's a TCP connection initiated via either Faraday House or BT Global Services, and then the HTTP GET request is sent along the other one.

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

An update on this for anyone's who's interested:

 

It's bounced around in PlusNet a bit, with the gist of the response being that the current multipathing is by design, though further investigation's not been ruled out (pending the usual dial test).

 

Fastly Network Engineering are continuing to investigate.

They have some nice posts explaining their architecture:

https://www.fastly.com/blog/building-and-scaling-fastly-network-part-1-fighting-fib

https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balancing-requests

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Still waiting for a dial test on ticket #142217286 after Plus Net failed to do one twice/thrice in a row.

randomvariable
Dabbler
Posts: 14
Thanks: 2
Registered: ‎23-01-2017

Re: Problems with Fastly Content Delivery Network in Southbank POP

Dial tests were completed over the weekend.

Problem's been confirmed. and the issue's been sent off for analysis with some network traces.

 

Please mention if you're also seeing the problem, by the way, as I'm curious to know how widespread it is.

If you know whether or not your own WBMC Shared or Dedicated, that'd be handy (how can you tell the difference?)

Riza
Rising Star
Posts: 110
Thanks: 7
Fixes: 1
Registered: ‎24-07-2014

Re: Problems with Fastly Content Delivery Network in Southbank POP

I am on psb-bng03 and have had no problems at all for the past couple of months, though I did initially with BNG01 packet loss (though this was months ago; probably resolved by now).

 

Here is my traceroute

Tracing route to 151.101.128.70 over a maximum of 30 hops

1 <1 ms <1 ms <1 ms LINKSYSHOME [192.168.1.1]
2 12 ms 11 ms 13 ms lo0.central10.psb-bng03.plus.net [195.166.130.25
4]
3 12 ms 12 ms 12 ms 411.be7.psb-ir01.plus.net [84.93.253.115]
4 12 ms 12 ms 12 ms core1-BE1.southbank.ukcore.bt.net [195.99.125.13
0]
5 12 ms 12 ms 12 ms 195.99.127.68
6 12 ms 12 ms 12 ms 213.137.183.32
7 13 ms 12 ms 12 ms ldn-b3-link.telia.net [213.248.67.97]
8 13 ms 12 ms 12 ms ldn-bb3-link.telia.net [62.115.117.20]
9 13 ms 12 ms 12 ms ldn-b4-link.telia.net [62.115.124.203]
10 12 ms 12 ms 13 ms 149.6.9.158
11 13 ms 13 ms 13 ms 151.101.128.70

Trace complete.

 

Tracing route to rubygems.org [151.101.64.70]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms EA6400-B652 [192.168.1.1]
2 12 ms 11 ms 11 ms lo0.central10.psb-bng03.plus.net [195.166.130.254]
3 12 ms 12 ms 12 ms 411.be7.psb-ir01.plus.net [84.93.253.115]
4 12 ms 12 ms 12 ms core1-BE1.southbank.ukcore.bt.net [195.99.125.130]
5 12 ms 13 ms 12 ms 195.99.127.68
6 14 ms 12 ms 12 ms 213.137.183.38
7 12 ms 12 ms 12 ms ldn-b3-link.telia.net [213.248.67.97]
8 12 ms 45 ms 117 ms ldn-bb2-link.telia.net [62.115.116.244]
9 12 ms 12 ms 12 ms ldn-b4-link.telia.net [62.115.124.201]
10 16 ms 12 ms 13 ms 149.6.9.158
11 13 ms 13 ms 13 ms 151.101.64.70

Trace complete.

And I am on the new network:

Tracing route to ntp.plus.net [212.159.13.49]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms EA6400-B652 [192.168.1.1]
2 12 ms 11 ms 11 ms lo0.central10.psb-bng03.plus.net [195.166.130.25
4]
3 12 ms 12 ms 12 ms 411.be7.psb-ir02.plus.net [84.93.253.119]
4 12 ms 12 ms 12 ms be1.psb-ir01.plus.net [195.166.129.176]
5 12 ms 12 ms 12 ms cdns01.plus.net [212.159.13.49]

Trace complete.

To tell if you are on the new network (WBMC Dedicated), a simple route to ntp.plus.net will determine this; if you see the words 'ir' (e.g. 'ir01') then you are on the dedicated network.

 

I personally don't use the Fastly CDN, but I have used other providers (Google Apis, Cloudflare, Amazon) and they have been great overall on my connection.

Kelly
Hero
Posts: 5,497
Thanks: 380
Fixes: 9
Registered: ‎04-04-2007

Re: Problems with Fastly Content Delivery Network in Southbank POP

Hi!  Are you still seeing these issues?

Kelly Dorset
Ex-Broadband Service Manager
Kelly
Hero
Posts: 5,497
Thanks: 380
Fixes: 9
Registered: ‎04-04-2007

Re: Problems with Fastly Content Delivery Network in Southbank POP

Just a public follow up.  We've replied to your ticket about this.  Let us know if dropping the MTU of your devices helps (rather than just the router)

Kelly Dorset
Ex-Broadband Service Manager