-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcontent.json
More file actions
1 lines (1 loc) · 838 KB
/
content.json
File metadata and controls
1 lines (1 loc) · 838 KB
1
{"meta":{"title":"i博客","subtitle":"仰望星空前,还需脚踏实地。","description":null,"author":"vcpu.me","url":"http://vcpu.me"},"pages":[{"title":"","date":"2018-03-06T03:33:59.000Z","updated":"2018-03-06T03:33:59.000Z","comments":true,"path":"index.html","permalink":"http://vcpu.me/index.html","excerpt":"","text":"hello world"},{"title":"","date":"2017-06-22T09:22:34.000Z","updated":"2018-03-31T07:13:21.000Z","comments":true,"path":"about/index.html","permalink":"http://vcpu.me/about/index.html","excerpt":"","text":"小程一枚,坐标上海浦东, 有工匠之心,偶尔文人气质 很喜欢写博客这种分享方式,因水平有限,文章如有疏漏,请多多包涵 本博客用于分享个人技术总结和心得 搞过3年的负载均衡产品,同时也接触过审计,FW,DPI等安全设备,也开发过webrtc直播系统 做过2年的Openstack SDN网络,喜欢研究架构实现和底层技术为团队添砖加瓦,解决业务场景难题等 对于tcp/ip,socket,协议分析,协议栈、webrtc、二三层转发、linux系统、高性能、虚拟网络、ovs、bridge、vhostnet、SDN Openstack网络架构等都有些自己的看法 常年用服务端c语言开发,也会shell,python ,接触过c++,java,go等 最喜欢阅读开源代码,从中体会实现思想、提升自己,有时候读了好代码会惊起一身鸡皮疙瘩 感到遗憾的是自己还没有成为过开源软件的开发者,现在也在努力中 不常在如有事,请email 1037365462@qq.com"},{"title":"标签","date":"2017-06-05T04:50:47.000Z","updated":"2017-06-05T04:51:50.000Z","comments":false,"path":"tags/index.html","permalink":"http://vcpu.me/tags/index.html","excerpt":"","text":""},{"title":"categories","date":"2017-06-05T04:59:19.000Z","updated":"2017-06-05T05:00:31.000Z","comments":false,"path":"categories/index.html","permalink":"http://vcpu.me/categories/index.html","excerpt":"","text":""},{"title":"","date":"2017-06-22T14:58:14.000Z","updated":"2017-06-22T14:58:14.000Z","comments":true,"path":"top/index.html","permalink":"http://vcpu.me/top/index.html","excerpt":"","text":"AV.initialize(\"l5u4v8AcoPCp9wlPGVJTG6Ny-gzGzoHsz\", \"RQdYIOd2snvd1muAW65nsn25\"); var time=0 var title=\"\" var url=\"\" var query = new AV.Query('Counter');//表名 query.notEqualTo('id',0); //id不为0的结果 query.descending('time'); //结果按阅读次数降序排序 query.limit(20); //最终只返回10条结果 query.find().then(function (todo) { for (var i=0;i"}],"posts":[{"title":"最精简的vxlan封装解封装过程","slug":"vxlan_openvswitch_simple","date":"2018-06-19T10:00:00.000Z","updated":"2020-12-31T14:17:44.142Z","comments":true,"path":"vxlan_openvswitch_simple/","link":"","permalink":"http://vcpu.me/vxlan_openvswitch_simple/","excerpt":"vxlan使用背景如下图","text":"vxlan使用背景如下图 vxlan 环境实现脚本12345678910111213141516171819#vbox虚拟机1ip netns add leftip link add name veth1 type veth peer name sw1-p1ip link set dev veth1 netns leftip netns exec left ifconfig veth1 10.0.0.1/24 upovs-vsctl add-br sw1ovs-vsctl add-port sw1 sw1-p1ip link set sw1-p1 up#vbox虚拟机2ip netns add rightip link add name veth1 type veth peer name sw2-p1ip link set dev veth1 netns rightip netns exec right ifconfig veth1 10.0.0.2/24 upovs-vsctl add-br sw2ovs-vsctl add-port sw2 sw2-p1ip link set sw2-p1 up vxlan实现过程粗略过程说明 默认端口4789 123456789101112131415[root@localhost ~]# netstat -nulpActive Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nameudp 0 0 0.0.0.0:14803 0.0.0.0:* 2250/dhclientudp 0 0 0.0.0.0:12221 0.0.0.0:* 723/dhclientudp 0 0 0.0.0.0:68 0.0.0.0:* 2250/dhclientudp 0 0 0.0.0.0:68 0.0.0.0:* 723/dhclientudp 0 0 127.0.0.1:323 0.0.0.0:* 674/chronydudp 0 0 0.0.0.0:4789 0.0.0.0:* -udp6 0 0 :::48083 :::* 723/dhclientudp6 0 0 ::1:323 :::* 674/chronydudp6 0 0 :::4789 :::* -udp6 0 0 :::42780 :::* 2250/dhclient[root@localhost ~]#[root@localhost ~]# 系统建立的vxlan_sys_4789接口用来目的端口为4789的内核udp socket监听 1234567891011121314151617181920212223242526272829[root@localhost ~]# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:69:6e:c9 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3 valid_lft 86079sec preferred_lft 86079sec inet6 fe80::236e:8cc4:b25d:f30b/64 scope link valid_lft forever preferred_lft forever3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:87:25:16 brd ff:ff:ff:ff:ff:ff inet 192.168.56.102/24 brd 192.168.56.255 scope global dynamic enp0s8 valid_lft 706sec preferred_lft 706sec inet6 fe80::8a01:cdf3:b4e3:5db6/64 scope link valid_lft forever preferred_lft forever4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether a2:b4:8e:70:bb:cd brd ff:ff:ff:ff:ff:ff5: sw1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 56:78:07:db:b7:49 brd ff:ff:ff:ff:ff:ff6: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000 link/ether 42:b4:6c:96:43:9c brd ff:ff:ff:ff:ff:ff7: sw1-p1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP qlen 1000 link/ether 6e:f1:2a:cb:97:e4 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::6cf1:2aff:fecb:97e4/64 scope link valid_lft forever preferred_lft forever 物理口可以抓取到vxlan流量 12345678[root@localhost ~]# tcpdump -i enp0s8 -ne udptcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes01:34:59.020741 08:00:27:92:8d:41 > 08:00:27:87:25:16, ethertype IPv4 (0x0800), length 148: 192.168.56.103.54252 > 192.168.56.102.4789: VXLAN, flags [I] (0x08), vni 03a:72:db:83:a4:1f > 12:d8:2b:1f:c1:52, ethertype IPv4 (0x0800), length 98: 10.0.0.2 > 10.0.0.1: ICMP echo request, id 2426, seq 40, length 6401:34:59.020889 08:00:27:87:25:16 > 08:00:27:92:8d:41, ethertype IPv4 (0x0800), length 148: 192.168.56.102.52961 > 192.168.56.103.4789: VXLAN, flags [I] (0x08), vni 012:d8:2b:1f:c1:52 > 3a:72:db:83:a4:1f, ethertype IPv4 (0x0800), length 98: 10.0.0.1 > 10.0.0.2: ICMP echo reply, id 2426, seq 40, length 64^C 如果想将解封装后流量发给openvswitch需要在openvswitch上建立类型为vxlan的口,如tun0 12345678910111213141516171819202122232425262728293031323334353637[root@localhost ~]# ovs-ofctl show sw1OFPT_FEATURES_REPLY (xid=0x2): dpid:0000567807dbb749n_tables:254, n_buffers:256capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(sw1-p1): addr:6e:f1:2a:cb:97:e4 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 3(tun0): addr:c2:58:e0:88:9d:1e config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(sw1): addr:56:78:07:db:b7:49 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len= 说明: 在接收到对端发过来的vxlan数据流量时候,因其目的端口是4789,内核收到发给这个端口的数据包时,会对数据流量进行vxlan解封装,因为vxlan接口配置到ovs上,所以会将解封装后的数据流量送给openvswitch转发处理 如果openvswitch数据流量发给了配置的vxlan口,vxlan口会对数据包进行vxlan封装,然后对封装后的数据包进行二三层转发 vxlan收包处理过程openvswitch vxlan收包过程如下 默认情况下发给4789端口的udp数据包,会在内核态呗截取,交给vxlan_rcv处理,vxlan_rcv该函数负责解封装然后将数据包挂入gcells1234567891011121314151617181920212223242526272829303132333435363738390xffffffff8156efa0 : __napi_schedule+0x0/0x50 [kernel] 0xffffffffa045d67b : vxlan_rcv+0x99b/0xb00 [vxlan] 0xffffffff815e2818 : udp_queue_rcv_skb+0x1f8/0x4f0 [kernel] 0xffffffff815e355a : __udp4_lib_rcv+0x54a/0x880 [kernel] 0xffffffff815e3dfa : udp_rcv+0x1a/0x20 [kernel] 0xffffffff815b1584 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1869 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b120a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b1b96 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff81570062 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff815702f8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff81570380 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571498 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc : e1000_clean_rx_irq+0x2ac/0x4f0 [e1000] 0xffffffffa0047d31 : e1000_clean+0x281/0x8f0 [e1000] 0xffffffff81570b20 : net_rx_action+0x170/0x380 [kernel] 0xffffffff8108f63f : __do_softirq+0xef/0x280 [kernel] 0xffffffff8169919c : call_softirq+0x1c/0x30 [kernel] 0xffffffff8102d365 : do_softirq+0x65/0xa0 [kernel] 0xffffffff8108f9d5 : irq_exit+0x115/0x120 [kernel] 软中断出发时候net_rx_action 会处理调用gro_cell_poll从gcells中取出skb进行消耗最终调用__netif_receive_skb_core下的ovs_vport_receive将数据包送给openvswitch流程 1234567891011121314151617181920212223242526270xffffffffa043ea40 : ovs_vport_receive+0x0/0xd0 [openvswitch] 0xffffffffa043fc8e : netdev_frame_hook+0xde/0x160 [openvswitch] 0xffffffff8156fcc2 : __netif_receive_skb_core+0x1e2/0x800 [kernel] 0xffffffff815702f8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff81570380 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571498 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa045a30a : gro_cell_poll+0x7a/0xc0 [vxlan] 0xffffffff81570b20 : net_rx_action+0x170/0x380 [kernel] 0xffffffff8108f63f : __do_softirq+0xef/0x280 [kernel] 0xffffffff8169919c : call_softirq+0x1c/0x30 [kernel] 0xffffffff8102d365 : do_softirq+0x65/0xa0 [kernel] 0xffffffff8108f9d5 : irq_exit+0x115/0x120 [kernel] 0xffffffff81699d38 : do_IRQ+0x58/0xf0 [kernel] 0xffffffff8168eded : ret_from_intr+0x0/0x15 [kernel] 数据包送给openvswitch流程在openvswitch内部处理过程和无差别,因为此时数据包已经是解过封装了。所以该数据包会发给namespace left 该数据包会呗放入到CPU队列中等待left namespace协议栈读取消耗1234567891011121314151617181920212223242526272829303132333435363738390xffffffff8156f130 : enqueue_to_backlog+0x0/0x170 [kernel] 0xffffffff8156f2e5 : netif_rx_internal+0x45/0x120 [kernel] 0xffffffff8156f3de : dev_forward_skb+0x1e/0x30 [kernel] 0xffffffffa03a34ba : veth_xmit+0x2a/0x60 [veth] 0xffffffff8156f8a1 : dev_hard_start_xmit+0x171/0x3b0 [kernel] 0xffffffff81572656 : __dev_queue_xmit+0x466/0x570 [kernel] 0xffffffff81572770 : dev_queue_xmit+0x10/0x20 [kernel] 0xffffffffa03881d4 : ovs_vport_send+0x44/0xb0 [openvswitch] 0xffffffffa037a300 : do_output.isra.31+0x40/0x150 [openvswitch] 0xffffffffa037b74d : do_execute_actions+0x73d/0x890 [openvswitch] 0xffffffffa037b8e1 : ovs_execute_actions+0x41/0x130 [openvswitch] 0xffffffffa037e929 : ovs_packet_cmd_execute+0x2c9/0x2f0 [openvswitch] 0xffffffff815a6d5a : genl_family_rcv_msg+0x20a/0x430 [kernel] 0xffffffff815a7011 : genl_rcv_msg+0x91/0xd0 [kernel] 0xffffffff815a4f89 : netlink_rcv_skb+0xa9/0xc0 [kernel] 0xffffffff815a54b8 : genl_rcv+0x28/0x40 [kernel] 0xffffffff815a467d : netlink_unicast+0xed/0x1b0 [kernel] 0xffffffff815a4a5e : netlink_sendmsg+0x31e/0x690 [kernel] 0xffffffff81555ef0 : sock_sendmsg+0xb0/0xf0 [kernel] 0xffffffff81556799 : ___sys_sendmsg+0x3a9/0x3c0 [kernel] namespace left协议栈收到该数包发现是发给本机接口的数据包,直接回复icmp reply1234567891011121314151617181920212223242526272829303132333435363738390xffffffff815e8040 : icmp_rcv+0x0/0x380 [kernel] 0xffffffff815b1584 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1869 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b120a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b1b96 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff81570062 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff815702f8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8157159e : process_backlog+0xae/0x170 [kernel] 0xffffffff81570b20 : net_rx_action+0x170/0x380 [kernel] 0xffffffff8108f63f : __do_softirq+0xef/0x280 [kernel] 0xffffffff8169919c : call_softirq+0x1c/0x30 [kernel] 0xffffffff8102d365 : do_softirq+0x65/0xa0 [kernel] 0xffffffff8108e894 : local_bh_enable+0x94/0xa0 [kernel] 0xffffffffa037e930 : ovs_packet_cmd_execute+0x2d0/0x2f0 [openvswitch] 0xffffffff815a6d5a : genl_family_rcv_msg+0x20a/0x430 [kernel] 0xffffffff815a7011 : genl_rcv_msg+0x91/0xd0 [kernel] 0xffffffff815a4f89 : netlink_rcv_skb+0xa9/0xc0 [kernel] 0xffffffff815a54b8 : genl_rcv+0x28/0x40 [kernel] 0xffffffff815a467d : netlink_unicast+0xed/0x1b0 [kernel] 0xffffffff815a4a5e : netlink_sendmsg+0x31e/0x690 [kernel] vxlan发包过程因为最终数据包从openvswitch侧发给了vxlan口,vxlan口会调用dev_hard_start_xmit将数据包发送出去,因为是vxlan口所以需要对数据包进行封装,很显然封装的过程具体实现细节 发生在udp_tunnel_xmit_skb 和 iptunnel_xmit函数中,最后调用ip_local_out_sk将封装好的数据包当成本机数据包发出去,当然此时二层、三次转发查找路由的过程,都是借用的本机发包的流程了,这里就不再详细说明了1234567891011121314151617181920212223242526272829303132333435363738390xffffffff815fbfc0 : iptunnel_xmit+0x0/0x1a0 [kernel] 0xffffffffa02b12b3 : udp_tunnel_xmit_skb+0xe3/0x100 [udp_tunnel] 0xffffffffa039a253 : vxlan_xmit_one+0x7e3/0xb60 [vxlan] 0xffffffffa039b81f : vxlan_xmit+0x41f/0xce0 [vxlan] 0xffffffff8156f8a1 : dev_hard_start_xmit+0x171/0x3b0 [kernel] 0xffffffff81572656 : __dev_queue_xmit+0x466/0x570 [kernel] 0xffffffff81572770 : dev_queue_xmit+0x10/0x20 [kernel] 0xffffffffa03881d4 : ovs_vport_send+0x44/0xb0 [openvswitch] 0xffffffffa037a300 : do_output.isra.31+0x40/0x150 [openvswitch] 0xffffffffa037b74d : do_execute_actions+0x73d/0x890 [openvswitch] 0xffffffffa037b8e1 : ovs_execute_actions+0x41/0x130 [openvswitch] 0xffffffffa037f445 : ovs_dp_process_packet+0x85/0x110 [openvswitch] 0xffffffffa0387aac : ovs_vport_receive+0x6c/0xd0 [openvswitch] 0xffffffffa0388c8e : netdev_frame_hook+0xde/0x160 [openvswitch] 0xffffffff8156fcc2 : __netif_receive_skb_core+0x1e2/0x800 [kernel] 0xffffffff815702f8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8157159e : process_backlog+0xae/0x170 [kernel] 0xffffffff81570b20 : net_rx_action+0x170/0x380 [kernel] 0xffffffff8108f63f : __do_softirq+0xef/0x280 [kernel] 0xffffffff8169919c : call_softirq+0x1c/0x30 [kernel] vlxan数据包UDP端口的选择从代码实现来看,应该是根据vxlan封装前的源目的ip和端口进行hash获取的UDP发送端口,细节后续再研究 12345678910111213141516171819202122232425262728static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb, int min, int max, bool use_eth){ u32 hash; if (min >= max) { /* Use default range */ inet_get_local_port_range(net, &min, &max); } hash = skb_get_hash(skb); if (unlikely(!hash) && use_eth) { /* Can't find a normal hash, caller has indicated an Ethernet * packet so use that to compute a hash. */ hash = jhash(skb->data, 2 * ETH_ALEN, (__force u32) skb->protocol); } /* Since this is being sent on the wire obfuscate hash a bit * to minimize possbility that any useful information to an * attacker is leaked. Only upper 16 bits are relevant in the * computation for 16 bit port value. */ hash ^= hash << 16; return htons((((u64) hash * (max - min)) >> 32) + min);}","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"openvswitch Datapath","slug":"openvswitch-Datapath","permalink":"http://vcpu.me/tags/openvswitch-Datapath/"},{"name":"openvswitch 内核工作原理","slug":"openvswitch-内核工作原理","permalink":"http://vcpu.me/tags/openvswitch-内核工作原理/"},{"name":"vxlan工作过程","slug":"vxlan工作过程","permalink":"http://vcpu.me/tags/vxlan工作过程/"},{"name":"vxlan工作架构","slug":"vxlan工作架构","permalink":"http://vcpu.me/tags/vxlan工作架构/"}]},{"title":"最简单的openvswitch转发过程内核代码级实现","slug":"ovs_simple_forward1","date":"2018-06-18T10:00:00.000Z","updated":"2018-06-20T03:38:37.000Z","comments":true,"path":"ovs_simple_forward1/","link":"","permalink":"http://vcpu.me/ovs_simple_forward1/","excerpt":"笔者写这篇文章出于好奇,对namespace和openvswitch如何工作在内核协议栈中的好奇,也出于对openvswitch了解的渴望,因此这篇文章仅仅是开端,后续会有更多文章揭秘openvswitch具体工作过程,不啰嗦了,下面步入正题。 这篇文章是基于两个namespace经过openvswitch之间进行通信的场景,分析寻找该通信底层发生的事情;所以该篇文章会涉及namespace协议栈处理和openvswitch通信过程两大部分;不过后续文章书写部分并没有按照这个书写逻辑;后文是根据笔者具体实践步骤书写的,这两部分内容会蕴含其中,具体请看后文。 最简单的环境1234567891011121314151617181920 br0 +--------------------------------------+ +--+ +--+ +---+ | tap1 tap2| +---+ | +--+ +--+ | | | | | | +--------------------------------------+ | | | | | | | | |+------------------+ +-------------------+| tap1 | | tap2 ||192.168.1.101/24 | | 192.168.1.102/24 || | | || | | || | | || namespace ns1 | | namespace ns1 || | | |+------------------+ +-------------------+","text":"笔者写这篇文章出于好奇,对namespace和openvswitch如何工作在内核协议栈中的好奇,也出于对openvswitch了解的渴望,因此这篇文章仅仅是开端,后续会有更多文章揭秘openvswitch具体工作过程,不啰嗦了,下面步入正题。 这篇文章是基于两个namespace经过openvswitch之间进行通信的场景,分析寻找该通信底层发生的事情;所以该篇文章会涉及namespace协议栈处理和openvswitch通信过程两大部分;不过后续文章书写部分并没有按照这个书写逻辑;后文是根据笔者具体实践步骤书写的,这两部分内容会蕴含其中,具体请看后文。 最简单的环境1234567891011121314151617181920 br0 +--------------------------------------+ +--+ +--+ +---+ | tap1 tap2| +---+ | +--+ +--+ | | | | | | +--------------------------------------+ | | | | | | | | |+------------------+ +-------------------+| tap1 | | tap2 ||192.168.1.101/24 | | 192.168.1.102/24 || | | || | | || | | || namespace ns1 | | namespace ns1 || | | |+------------------+ +-------------------+ 组网实现脚本123456789101112131415161718ip netns add ns1ip netns add ns2ovs-vsctl add-br br0ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip link set dev tap1 upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip link set dev tap2 upip netns exec ns1 ip addr add 192.168.1.101/24 dev tap1ip netns exec ns2 ip addr add 192.168.1.102/24 dev tap2ip netns exec ns1 ip link set lo upip netns exec ns2 ip link set lo up 连通性测试1234[root@localhost ~]# ip netns exec ns1 ping 192.168.1.102 -c 100PING 192.168.1.102 (192.168.1.102) 56(84) bytes of data.64 bytes from 192.168.1.102: icmp_seq=1 ttl=64 time=0.172 ms64 bytes from 192.168.1.102: icmp_seq=2 ttl=64 time=0.051 ms 内核处理过程在执行该过程中,在我的Centos7.2环境装了systemtap,同时安装了适用systemtap所需要的对应内核的调试和开发数据包,如下 123456789101112[root@localhost ~]# uname -aLinux localhost.localdomain 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 13:29:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux[root@localhost ~]#[root@localhost ~]# rpm -qa|grep kernkernel-debuginfo-common-x86_64-3.10.0-514.26.2.el7.x86_64kernel-3.10.0-514.26.2.el7.x86_64kernel-debuginfo-3.10.0-514.26.2.el7.x86_64kernel-tools-libs-3.10.0-514.el7.x86_64kernel-3.10.0-514.el7.x86_64kernel-devel-3.10.0-514.26.2.el7.x86_64kernel-tools-3.10.0-514.el7.x86_64kernel-headers-3.10.0-514.26.2.el7.x86_64 本节使用systemtap探究第一节中描述场景下通信的内核数据包转发流程;当然在使用此方法时,笔者已经知道数据包大概流程,所以才知道追踪哪个函数,因此后文抓取协议栈的调用栈时候,请不要疑惑为什么抓取此函数 跟踪内核的stp脚本追踪调用栈的脚本如下 12345678[root@localhost ~]# cat bt.stp probe kernel.function(@1){ print(\"----------------START-------------------------\\n\") printf(\"In process [%s]\\n\", execname()) print_regs() print_backtrace() print(\"----------------END-------------------------\\n\") } 调用方法如下 12stap --all-modules bt.stp enqueue_to_backlogstap --all-modules bt.stp icmp_rcv |grep -30 \"ping\" 根据调用栈结果得出如下的流量转发路径 数据包在ns1 中处理过程 ns1 ping ns2地址首先运行用户态程序ping,ping会通过socket最终调用kernel ip_push_pending_frames将该数据包引入到ns1命名空间的协议栈 在ns1中因为该数据包是属于本机向外发出的流量因此最终会经过ip_output流程,最终根据虚拟口tap1 调用dev_hard_start_xmit将该数据包发出 被发出的数据包因为该tap1口连接到openvswitch br0上,最终会经过dev_hard_start_xmit 调用internal_dev_xmit将该数据包注入openvswitch收数据包流程中 该数据包注入到openvswitch流程后会调用ovs_vport_receive来处理 数据包在openvswitch中处理过程 ovs_vport_receive收到该数据包后,就相当于进入了openvswitch的datapath流程 在ovs_vport_receive会根据当前数据包的流量特征查找对应的flow,如果无法查找到会upcall到用户态进行流表查询流程 因为当前测试的流表是normal转发,所以查询结束后,会调用ovs_execute_actions执行具体动作,当前的测试场景下会调用do_output.isra.31处理数据包,因为该动作是要发往tap2,所以调用ovs_vport_send最终发给tap2口 因为tap2属于openvswitch的内部口,所以会调用internal_dev_recv将该数据包发给tap2处理 目前发给tap2处理后,该过程不是直接将数据包送给ns2的协议栈收报入口进行处理,而是将数据包经过netif_rx和netif_rx_internal将数据包enqueue_to_backlog挂入到CPU的队列中 后续ns2软中断时候会从该队列中读取消耗该数据包 数据包在ns2中处理过程 ns2协议栈会经过软中断处理do_softirq后,最终调用process_backlog处理cpu队列上的数据包 process_backlog会调用协议栈的__netif_receive_skb来消耗数据包 该数据包会经过ip层处理ip_rcv,因为是发给本机的数据包所以调用ip_local_deliver最终交给icmp_rcv进行处理回复 总结经过以上流程的详细描述,可以知道openvswitch是负责连接两个ns转发的路径,而ns1和ns2协议栈是独立的,如果想要通信需要借助openvswith这样的虚拟交换机; 且经过虚拟交换机将数据包注入ns2过程并不是直接调用ns2的收包处理函数,而是将数据包挂到CPU队列中,待ns2协议栈中断触发时候自己去处理消耗,其实这也间接的说明了其数据包在大模块间通信时候采用的生产者和消费者模型 而不是直接调用ns2的收包处理函数去处理,其实这样是有好处的,它的好处是将数据包提供模块和数据包消耗模块分开,留给数据包消耗模块更大自主空间去处理消耗该数据包。 当然笔者也想到一个问题,当ns2软中断非常忙时候,无法及时消耗数据包,会出现tap口tx drop计数;当出现该问题时当然你可以修改cpu放置的数据包队列大小来稍微更改效果;但是笔者认为更多的经历应该放在为什么ns2处理数据慢?笔者相信通过找到该问题原因,drop丢包问题也会得到解决。","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"openvswitch Datapath","slug":"openvswitch-Datapath","permalink":"http://vcpu.me/tags/openvswitch-Datapath/"},{"name":"openvswitch 内核工作原理","slug":"openvswitch-内核工作原理","permalink":"http://vcpu.me/tags/openvswitch-内核工作原理/"},{"name":"namespace协议栈工作过程","slug":"namespace协议栈工作过程","permalink":"http://vcpu.me/tags/namespace协议栈工作过程/"},{"name":"internal ovs口","slug":"internal-ovs口","permalink":"http://vcpu.me/tags/internal-ovs口/"},{"name":"数据包内核态转发过程","slug":"数据包内核态转发过程","permalink":"http://vcpu.me/tags/数据包内核态转发过程/"}]},{"title":"数据包如何从物理网卡到达云主机的应用程序?","slug":"packet_from_nic_to_user_process","date":"2018-06-12T10:00:00.000Z","updated":"2018-06-12T04:59:36.000Z","comments":true,"path":"packet_from_nic_to_user_process/","link":"","permalink":"http://vcpu.me/packet_from_nic_to_user_process/","excerpt":"笔者工作后从事的行业是linux2.6内核态的负载均衡研发,但是当时采用的mips CPU且整个架构重构了;对于通用的服务器的网络数据包运行流程,有了解,但是缺少大规模的总结和深刻的认识 因为笔者在之前工作中解决了云主机tap口丢包问题,就借此机会详细阐述下数据包处理流程吧,一为总结,二为兴趣,笔者不太喜欢看黑盒的东西。 如上图示,数据包从物理HOST外侧,经过物理网卡,发给ovs bridge,最后经过linux 安全组交给guest云主机1,最后交给guest1云主机应用程序;后文就从接近代码级别阐述此过程的详细步骤;笔者讲述的这个过程只能算是通用流程,可以用来参考,因为不同物理网卡会存在差异性。希望你通过这篇文章了解整个数据包转发流程,更深入了解底层网络实现。","text":"笔者工作后从事的行业是linux2.6内核态的负载均衡研发,但是当时采用的mips CPU且整个架构重构了;对于通用的服务器的网络数据包运行流程,有了解,但是缺少大规模的总结和深刻的认识 因为笔者在之前工作中解决了云主机tap口丢包问题,就借此机会详细阐述下数据包处理流程吧,一为总结,二为兴趣,笔者不太喜欢看黑盒的东西。 如上图示,数据包从物理HOST外侧,经过物理网卡,发给ovs bridge,最后经过linux 安全组交给guest云主机1,最后交给guest1云主机应用程序;后文就从接近代码级别阐述此过程的详细步骤;笔者讲述的这个过程只能算是通用流程,可以用来参考,因为不同物理网卡会存在差异性。希望你通过这篇文章了解整个数据包转发流程,更深入了解底层网络实现。 粗略的转发逻辑其实下图中有linux bridge设备,用来实现安全组功能,现在大部分云计算实现,已经去除linux bridge,采用ovs 流表实现。但是笔者定位tap口丢包问题时候,就是这种转发路径 所以就按照这种转发路径进行流程分析,对于想了解ovs情况读者来说,内容也只多不少吧 如上图,一个数据包从物理网卡到云主机内部粗略转发逻辑可以概括如下四大部分,后文分别对该四大部分进行较详细的解析 数据包经过物理口到达ovs 桥(br-int) 通过虚拟机网络中的veth pair重新传送给 linux bridge linux bridge 将数据包发给tap口,送给云主机 云主机收到该数据包并最终将数据包交给云主机内部程序 数据包经过物理口到达ovs 桥(br-int) 数据包经过物理服务器物理网卡送给br-int,虽然描述起来很简单,但是其实质过程是相当复杂的;这个过程涉及到物理网卡收数据包,驱动将数据包给内核协议栈,协议栈将数据包发给ovs三大部分; 因为过程较为复杂,我们在此继续将该过程拆分成三大部分进行描述,详细部分如下: 物理网卡处理 中断下半部分软中断处理 将数据包交给内核的ovs bridge处理 物理网卡处理 物理网卡收到数据包的处理流程如上图所示,详细步骤如下: 网卡收到数据包,先将高低电平转换到网卡fifo存储,网卡申请ring buffer的描述,根据描述找到具体的物理地址,从fifo队列物理网卡会使用DMA将数据包写到了该物理地址写到了,其实就是skb_buffer中 这个时候数据包已经被转移到skb_buffer中,因为是DMA写入,内核并没有监控数据包写入情况,这时候NIC触发一个硬中断,每一个硬件中断会对应一个中断号,且指定一个vCPU来处理,如上图vcpu2收到了该硬件中断 硬件中断的中断处理程序,调用驱动程序完成,a.启动软中断 硬中断触发的驱动程序会禁用网卡硬中断,其实这时候意思是告诉NIC,再来数据不用触发硬中断了,把数据DMA拷入系统内存即可 硬中断触发的驱动程序会启动软中断,启用软中断目的是将数据包后续处理流程交给软中断慢慢处理,这个时候退出硬件中断了,但是注意和网络有关的硬中断,要等到后续开启硬中断后,才有机会再次被触发 NAPI触发软中断,触发napi系统 消耗ringbuffer指向的skb_buffer NAPI循环处理ringbuffer数据,处理完成 启动网络硬件中断,有数据来时候就可以继续触发硬件中断,继续通知CPU来消耗数据包 其实上述过程过程简单描述为:网卡收到数据包,DMA到内核内存,中断通知内核数据有了,内核按轮次处理消耗数据包,一轮处理完成后,开启硬中断。其核心就是网卡和内核其实是生产和消费模型,网卡生产,内核负责消费,生产者需要通知消费者消费;如果生产过快会产生丢包,如果消费过慢也会产生问题。也就说在高流量压力情况下,只有生产消费优化后,消费能力够快,此生产消费关系才可以正常维持,所以如果物理接口有丢包计数时候,未必是网卡存在问题,也可能是内核消费的太慢。 在介绍完整体流程后,也许明白很多,其实也迷惑很多,我的疑惑是,网卡数据如何写到内核内存? 如何将网卡收到的数据写入到内核内存?引用:https://tech.meituan.com/Redis_High_Concurrency_Optimization.html NIC在接收到数据包之后,首先需要将数据同步到内核中,这中间的桥梁是rx ring buffer。它是由NIC和驱动程序共享的一片区域,事实上,rx ring buffer存储的并不是实际的packet数据,而是一个描述符,这个描述符指向了它真正的存储地址,具体流程如下: 驱动在内存中分配一片缓冲区用来接收数据包,叫做sk_buffer; 将上述缓冲区的地址和大小(即接收描述符),加入到rx ring buffer。描述符中的缓冲区地址是DMA使用的物理地址; 驱动通知网卡有一个新的描述符; 网卡从rx ring buffer中取出描述符,从而获知缓冲区的地址和大小; 网卡收到新的数据包; 网卡将新数据包通过DMA直接写到sk_buffer中。 当驱动处理速度跟不上网卡收包速度时,驱动来不及分配缓冲区,NIC接收到的数据包无法及时写到sk_buffer,就会产生堆积,当NIC内部缓冲区写满后,就会丢弃部分数据,引起丢包。这部分丢包为rx_fifo_errors,在 /proc/net/dev中体现为fifo字段增长,在ifconfig中体现为overruns指标增长。 在理解完物理网卡处理流程后,后面理解中断下本部分中断处理详细过程。 中断下半部分软中断处理见上图,在上述第7步,主要流程是一个一个的消费数据包,那么具体流程如何? 7.1 因为igb_ckean_rx_irq会循环消耗数据包,但是需要有个度,否则就会一直在消耗数据包,整个CPU一只停留在处理部分,用户态程序就没有机会真正消费数据包了,所以这循环有次数限制,内核参数是net.core.netdev_budget = 300,其含义是最大300个,无论如何 都应该退出软中断部分,空出CPU时间片,就有机会调度应用程序了。 7.2 取出skb,调用napi_gro_receive,这个函数先做一些GRO包合并动作,然后根据是否开启RPS执行如下流程 7.2.1: 开启了RPS,将会将数据包放到hash到的vCPU队列中,默认大小由net.core.netdev_max_backlog = 1000 控制,数据包挂入这个队列后,本包处理就结束了,继续处理下一个包。也就是数据包如果超过1000个,就会被丢弃,所以用户态消耗还是要跟上才行。 当然此处涉及到到数据包挂到cpu3队列后,后续执行流程是cpu3中断触发时候,会从取出数据调用__netif_receive_skb_core处理消耗该数据包 7.2.2:没有开启RPS,直接使用vcpu2调用__netif_receive_skb_core进行处理消耗该数据包 rps由上述描述可以知道,开启rps后,接收硬件中断vCPU2 和处理下半步中断的vcpu2就有机会把大量的数据包简短处理,直接挂到其它vcpu队列上,这样就能减少该vcpu2压力,vcpu2就能处理更大流量。RPS适合单网卡队列,多vcpu的使用场景 __netif_receive_skb_core__netif_receive_skb_core 是协议栈处理数据包的入口函数,你使用tcpdump抓包就是在此起作用,也就是说如果你tcpdump抓到数据包,代表该数据包已经到达协议栈入口了。 软中断处理结尾是调用__netif_receive_skb_core进行消耗数据包,后文的ovs处理流程的入口也在该函数中调用 将数据包交给内核的ovs bridge处理 上图所示是数据包在ovs内转发流程,其中netdev_frame_hook是ovs入口,被__netif_receive_skb_core调用用来处理收到的数据包;数据流量进入ovs bridge 根据数据包信息查找流表,该数据包主要有如下两种处理情况 流量发给控制器 流量在内核态直接转发 流量发给控制器内核流量通过netlink将数据包传给用户态的ovs-vswitchd进程,该进程会对流量解析,根据解析和预制处理逻辑处理(一般是下发流表) 流量在内核态直接转发以当前分析情况为例说明,数据包转发到linux veth口的详细过程,以及最后如何发给linux bridge的? 根据流表的动作执行发送动作,最终会调用dev_hard_star_xmit,该函数会调用xmit_one , 最后调用ndo_start_xmit发送数据包(当前采用的linux 虚拟口,所以调用veth_xmit)。 其中tcpdump工作在xmit_one 其中虚拟口发包计数统计工作在veth_xmit 所以抓包是在包个数统计之前的,这一点要注意 linux虚拟机口工作原理描述一边用于收包收到数据包后立刻将该数据包发出,所谓的发出是指: “将数据包收取到后立刻再次注入到协议栈,交给另一个处理对象”,具体处理方法如下: veth_xmit将数据包发出 调用dev_forward_sk,调用协议栈收包入口netif_rx_internal, 我们当前情况很可能netif_rx_internal将数据直接放入CPU队列中,该数据包就被注入了协议栈 另一个处理对象,会经过软中断触发时候,立刻处理该CPU队列的数据包,协议栈会再次收到数据包,相当于连接linux bridge的口收到了数据包,协议栈会将该数据包送给linux bridge linux bridge 将数据包发给tap口 连接该linux bridge的是linux veth pair口,由之前的描述可以知道,数据包会经过linux veth pair被注入到协议栈,协议栈会再次调用_netif_reveive_skb处理该数据包; 而因该函数是veth口收到的数据,该接口连接的是linux bridge,协议栈会调用br_handle_frame处理该数据包。linux bridge处理数据包会经过如下几个情况: 协议栈收到该数据包会先经过NF_BR_PRE_ROUTING钩子,该钩子在处理之前,你可以通过iptabls加PRE ROUTING策略,加的策略会通过该钩子函数工作 经过NF_BR_PRE_ROUTING钩子后,会判断是将数据包当成本机数据包处理,还是继续转发(后文就以转发为例进行说明) 数据包转发时候会经过NF_BR_FORWARD钩子函数,该函数是所有转发的数据包都会经过的,安全组策略就是在此钩子中work的 转发数据包处理完成后,最后将调用br_forward_finish将数据包发出去,NF_BR_POST_ROUTING钩子存在此处,数据包发出前最后经历过的处理 最后会调用dev_hard_start_xmit函数将数据包发出 经过linux bridge转发处理后,linux最后将数据包调用dev_hard_start_xmit发给tap口了,后文描述tap详细处理过程 tap口详细处理过程按照我们本次研究的场景,linux bridge连接的是tap口,linux bridge会将数据包交给tap口;终极目标是将数据包将给用户态进程qume进程的云主机进行消耗; 所以需要将数据包从内核态转发到用户态,且触发中断告知云主机数据包来了,让云主机work起来,处理消耗数据包。很显然tap口自己是完成不了,接口紧紧能处理接收数据包 而数据内核态和用户态穿越以及中断触发,就需要通过另一个层级的内核线程驱动了,在此研究场景下使用的vhost-net技术;也就是说tap口和vhost配合完成将数据包交给云主机。 后文先说明tap口如何处理,以及tap口和vhost如何配合的 tap口如何处理数据包linux bridge最后会调用 dev_hard_start_xmit,因为连接的是tap口,所以会调用tap口的xmit_one将数据包送给tap口入口函数。 如上图,tcpdump抓取数据包点在xmit_one调用,经过抓包点后会最后将数据包调用tun_net_xmit发出去,tun_net_xmit这个函数很重要,承载了tap数据的逻辑,具体点介绍是: 发包统计计数在此统计、drop丢包计数在此统计 tun_net_xmit会将数据包放入socket.sk->sk_receive_queue,socket队列中,且该队列有长度限制,默认长度大小为tap口的tx_queuelen:一般为1000个数据包,如果长度大于1000后数据包会被丢弃,遇到这种情况,说明云主机处理的慢了,你就想办法优化云主机处理速度吧 数据包放入socket队列中,需要唤醒vhost线程工作,具体工作见tap口和vhost线程如何配合 如下:txqueuelen:1000 为tap口socket队列长度,单位是包个数 12345678root@compute-001:~# ifconfig tapc3072bad-18tapc3072bad-18 Link encap:Ethernet HWaddr fe:16:3e:09:6f:46 inet6 addr: fe80::fc16:3eff:fe09:6f46/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:3569 errors:0 dropped:0 overruns:0 frame:0 TX packets:372 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:294046 (294.0 KB) TX bytes:26034 (26.0 KB) tap口和vhost线程如何配合其实tap口和vhost配合很简单,tap口接收到数据包后放入挂到socket队列后,调用tun_net_xmit后,唤醒vhost线程让起工作即可, 这个时候vhost线程会将数据包考入云主机的ring buffer中,且触发中断告诉云主机数据包已经来了,让云主机去处理消耗数据包,详细见vhost线程工作过程 vhost线程工作过程vhost线程,是云主机vhost-net后,建立云主机时候一并建立的线程,线程名称组成为vhost-26735,其中26735为qume进程号 说到这里获取设计常用的接口工作技术,按照器性能高低来个排序, vhost-user > vhost >virtio ,在此不多做解释了 如下图: vhost线程是个死循环,它被唤醒后,循环做两件事情 事情1:注入中断触发gust主机活动起来 事情2:循环从tap口的socket队列中取出数据,直接将数据包考入用户态云主机的ringbuffer中,注意ringbuffer是vhost线程内核和用户态共享内存 做完这两件事情,后续的事情就交给kvm触发中断物理主机vcpu如何将cpu让给gust主机的CPU了,这些事情后文讲述 kvm如何中断触发云主机协议栈?如下图vhost线程会调用kvm一些接口触发中断,为了触发中断主要做了两件事情 事情1:判断物理cpu运行中,让cpu运行先退出,为了有机会注入中断 事情2:想目标cpu添加一个请求,这个请求vcpu死循环会去检查,如果发现了这个请求,就真正注入中断 也就是说,这两件事情其实只是让cpu退出,方便cpu注入,和提前写入注入中断标记,其过程还需要vcpu死循环来完成 同样如下图:vcpu死循环主要干了什么? vcpu在运行的时候一直处于循环状态,在循环体中,会有环节检查中断注入标记,如果发现该标记就调用kvm_x86_ops->run触发中断 该触发会最终调用中断处理函数vp_interrupt -> vp_ring_interrupt-> 最终触发协议栈的运行,让协议栈调用__netif_receive_core处理数据包 云主机收到该数据包并最终将数据包交给云主机内部程序其处理就是正常linux协议栈收包处理是一致的,这里就不多做说明了 总结到此叙述算是完成了,这还仅仅是针对于特定场景下收包过程,虽然描述完成了,其实我也还是有很多疑问点,因此这就当作你一个总结吧,后续根据不同的环境你可以自己跟踪下运行过程。 在跟踪此流程过程中,借用了网上一些总结、同时也借用systemtap探究了调用栈、并且配合linux kernel源代码,但是其过程介绍的还是相对分散了,我们对这个流程再来一次汇总吧,如下图","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"linux bridge","slug":"linux-bridge","permalink":"http://vcpu.me/tags/linux-bridge/"},{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/tags/linux-kernel/"},{"name":"数据包转发原理","slug":"数据包转发原理","permalink":"http://vcpu.me/tags/数据包转发原理/"},{"name":"openvswitch","slug":"openvswitch","permalink":"http://vcpu.me/tags/openvswitch/"},{"name":"协议栈处理过程","slug":"协议栈处理过程","permalink":"http://vcpu.me/tags/协议栈处理过程/"},{"name":"网卡驱动处理过程","slug":"网卡驱动处理过程","permalink":"http://vcpu.me/tags/网卡驱动处理过程/"},{"name":"处理数据包中断上半部分和下半部分","slug":"处理数据包中断上半部分和下半部分","permalink":"http://vcpu.me/tags/处理数据包中断上半部分和下半部分/"},{"name":"netfilter","slug":"netfilter","permalink":"http://vcpu.me/tags/netfilter/"},{"name":"RPS算法工作原理","slug":"RPS算法工作原理","permalink":"http://vcpu.me/tags/RPS算法工作原理/"},{"name":"vhost-net线程工作原理","slug":"vhost-net线程工作原理","permalink":"http://vcpu.me/tags/vhost-net线程工作原理/"},{"name":"linux veth口工作原理","slug":"linux-veth口工作原理","permalink":"http://vcpu.me/tags/linux-veth口工作原理/"},{"name":"数据包抓包点和数据包统计个数位置","slug":"数据包抓包点和数据包统计个数位置","permalink":"http://vcpu.me/tags/数据包抓包点和数据包统计个数位置/"},{"name":"kvm中断触发云主机中断过程","slug":"kvm中断触发云主机中断过程","permalink":"http://vcpu.me/tags/kvm中断触发云主机中断过程/"}]},{"title":"云主机tap口丢包问题","slug":"tap_drop","date":"2018-06-11T10:00:00.000Z","updated":"2018-06-12T05:08:25.000Z","comments":true,"path":"tap_drop/","link":"","permalink":"http://vcpu.me/tap_drop/","excerpt":"背景某一用户在云平台上,在同一个VPC建立3台云主机,2台centos6.3(16vcpu,8G内存)和一台windows2008 sever(4 vcpu,4G内存)云主机; 我们编号为A,B,C ,其中A代表压力客户端云主机, B代表应用后台服务器云主机, C代表存储服务器云主机 通过A持续压力B,发现网络流量只有50Mb情况下,在外部ping服务器B出现大幅动的延迟波动,在15ms以上,有时候甚至50ms;而无压力情况下ping 浮动ip延迟在5ms以下; 且此时查看云主机接口发现连接云主机B的tap口丢包很严重,约丢1000个左右。","text":"背景某一用户在云平台上,在同一个VPC建立3台云主机,2台centos6.3(16vcpu,8G内存)和一台windows2008 sever(4 vcpu,4G内存)云主机; 我们编号为A,B,C ,其中A代表压力客户端云主机, B代表应用后台服务器云主机, C代表存储服务器云主机 通过A持续压力B,发现网络流量只有50Mb情况下,在外部ping服务器B出现大幅动的延迟波动,在15ms以上,有时候甚至50ms;而无压力情况下ping 浮动ip延迟在5ms以下; 且此时查看云主机接口发现连接云主机B的tap口丢包很严重,约丢1000个左右。 困惑为什么tap口有丢包计数,使用tcpdump抓取tap口的数据包和云主机内部的icmp数据包,而云主机内部没有收到该数据包,所以云主机明明抓取到该数据包 为什么会有丢包计数?为什么云主机内部收不到?且云主机内部接口没有丢包计数 答案这也是后续分析linux源码找到的答案,tcpdump 抓取tap口数据包在计数统计之前,所以tcpdump在tap口抓取到该数据包并不代表该数据包已经发到云主机内部了,所以 可以推断,该tap丢包问题,应该是云主机处理数据包较慢,无法及时取出tap送给的数据包,导致tap发送数据包的队列满了,丢弃了后面后面收到的数据包; 性能问题易疏导不易调大,所以重点是为什么云主机处理数据包慢?当然这些都是后续解决该问题过程中逐渐形成的认知,写在此处以回答最初的疑惑 问题分析阶段1上述问题是用户在使用云主机发现的性能问题,在高压力情况下才出现,问题点在于压力情况下的云主机网络处理的稳定性和丢包问题;在处理该问题时候我还是一个云计算的刚刚入门不到5个月的小白,对虚拟机网络调优还接触不多。 于是从流量角度对该问题进行分析复现,通过分析问题现场抓取的数据包可以知道,该应用属于小包应用,每秒大概单向7w个数据包,所以自然而然通过pps测试来进行问题的复现,当时就借用了阿里pps测试用例对云主机的pps情况进行测试 工作 https://www.alibabacloud.com/help/zh/faq-detail/55757.htm 测试结果如下: 流测试后,约5分钟pps在25w左右波动,且有0.5%左右丢包率 5分钟后丢包服务的丢包率为0,pps稳定在13.5w,一直维持到20分钟测试结束 测试过程中ping server1 和server2未出现丢包 问题分析阶段2性能优化前:收包PPS(Packet per Second)约25-30万;发包PPS约为12万 性能优化后:收包PPS(Packet per Second)约50-55万,丢包率约0.73%;发包PPS约为40万 包大小和协议:UDP包,大小约64Bytes 虚拟机环境(Server):16vCore/16G 测试工具:iperf3 虚机操作系统:Centos 7.3 阿里云同等规格(16vCore/16G)云主机: 不同测试场景结果差异较大, 8台陪练机,一台测试机场景下,收包pps为40wpps左右,丢包较严重,平均丢包率36% 将陪练机减少至3台时性能表现最好,收包pps为可达到规格介绍中60wpps,丢包率约2.3% 做的优化操作如下 网络方向: 开启虚拟网卡软件层面的RPS和RFS 开启虚拟网卡硬件层面的多队列(RSS) CPU资源方向: 宿主机 QEMU和CPU 做亲和处理 调整虚机网卡中断和CPU关联关系,优化中断处理 问题分析阶段3经过上述的pps测试后,虽然能够提升pps能力,但是也随之发现一个带宽的问题,经过对云主机的测试发现,通过TCP进行带宽测试,每次测试持续5分钟 测试命令和步骤如下 一个客户端,一个服务器端 服务器端启动netperf服务器端进程 客户端启动netperf客户端进程 测试命令如下:netperf -H serverip -t TCP_STREAM -l 300 – -m 1440 测试结果:发现云主机的带宽在5000Mb/s-6000Mb每秒 测试结果分析:通过对比友商的云主机性能,单连接带宽测试维持在8000Mb/s 以上,且波动较少,所以得知,虽然当时云主机可以通过多条连接进行带宽测试,也能将带宽打到8000Mb以上,但是单条连接和多条连接 是不一样的,单条连接更能说明当前云主机的带宽还需要调优。 问题分析阶段4思路抓住单带宽上不去的问题,对云主机和云主机所在HOST主机进行调优工作,所做的尝试核心指导思想如下 关闭irqbalance 设置计算节点CPU处于高性能模式 虚拟机使用VCPU和nova、neutron等服务使用VCPU分离 限定云主机进程尽量在同一个NUMA node中 云主机内调优: 开启RPS(Receive Package Steering)和RFS(Receive Flow Steering) 总结:该阶段所作的努力是为了减少波动性,关闭HOST主机的irqbalance想法是,云主机的qume进程不会因为该程序的调度而频繁出现缓存丢失情况,导致性能降低 让英特尔CPU处于高性能模式,属于踩的不小的坑才做的发现的调优参数,因为HOST主机CPU如果处于power save模式,会使得云主机出现休眠会间接导致很多问题,且CPU也没有发挥性能优势,导致性能下降 然后就是HOST主机侧也会有处云主机外的其它辅助、云存储进程等,这些进程也会和云主机qume进程争抢CPU,导致互相影响,导致云主机的性能降低 限定云主机进程尽量在同一个NUMA node中,云主机在同一个NUMA node命中L2 L3缓存的概率大大增加,减少了跨NUMA访问带来的高延迟内存访问,提升云主机访存时间提升性能 当时采用的centos7作为调优的云主机系统,采用单队列,虽然云主机的CPU很多,但是所有的网络流量中断只触发CPU 0 ,导致cpu0的si特别高,而且会出现收包忙不过来的情况,所以在单队列情况下,云主机哪部就开启了RPS 开启RPS后,CPU0其实仍主要被中断触发,但是其软中断时间就短了,cpu0只需要将数据包根据hash挂到其它cpu所处理的队列中,触发个信号告诉其它CPU处理即可,其它CPU就把该数据包触发软中断,消耗该数据包。这样就会cpu0减轻了很大的负担,他就能处理更多的网络流量了 开启RFS,和RFS配合使用在哪个CPU收包,就使用哪个CPU处理上层应用程序,这样可以减少缓存的miss,提升性能 taskset绑定应用程序,这个其实属于更细粒度的调优方法了,让应用程序固定在某一个CPU,减少缓存miss,当然绑定CPU应该尽量的空闲,保证出现低争抢,效果会更好。 上述算是调整思想的一个汇总,其实还有一部分没有提到就是云主机CPU绑定方法,所有的做法其实和CPU绑定出于同源,CPU绑定好了,再稍稍调整下参数,单TCP带宽问题不大 CPU绑定方法virsh emulatorpin instance-000000a8 5 –live virsh vcpupin instance-000000a8 0 6 –live virsh vcpupin instance-000000a8 1 7 –live 将和云主机中断模拟的线程交给vcpu5处理,将云主机的cpu交给无力cpu6和7号处理 结果经过上述操作,带带宽稳定在8000Mb以上,达到调优目标了 问题分析阶段5经过上述折腾后,于是和用户申请现网测试环境,开始调试工作,主要做的动作如下 云主机内开启RPS和RFS1234567echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpusecho 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cntecho 4096 > /proc/sys/net/core/rps_sock_flow_entriesecho 0 > /sys/class/net/eth0/queues/rx-0/rps_cpusecho 0 > /sys/class/net/eth0/queues/rx-0/rps_flow_cntecho 0 > /proc/sys/net/core/rps_sock_flow_entries host主机隔离以及绑定做如下操作: 1,分离部分跑的较高的进程到0,2,4,6 vcpu, 2,将云主机进程binding到8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 taskset -cp 0,2,4,6 939 taskset -cp 0,2,4,6 195385 taskset -cp 0,2,4,6 117891 taskset -cp 0,2,4,6 4575 taskset -cp 0,2,4,6 117972 taskset -cp 0,2,4,6 195374 查看是否生效 taskset -cp 939 taskset -cp 195385 taskset -cp 117891 taskset -cp 4575 taskset -cp 117972 taskset -cp 195374 virsh emulatorpin instance-0000025f 8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 查看是否生效 takset -cp 55428 经过调整后,用户进行应用测试,发现延迟波动很小,虽tap有小部分丢包,但是ping fip没有出现丢包了,满足了用户的需求 问题分析阶段6虽然知道怎么调整了,可是怎么样形成可以实施的方案? 且还有一点没有考虑清楚,其实云主机的进程有两种,一种是模拟中断处理网络的线程;一种是云主机工作vcpu;前面主要是针对于模拟中断处理网络的线程做绑定操作,而对云主机工作vcpu并没有进行绑定动作 于是vcpupin也需要运用到方案 解析方案1 将HOST物理主机CPU使用率超过2%的进程固定到0-3CPU处理,将云主机enulator pin 到4-7,vcpu pin 到后剩余的CPU 方案2 将HOST物理主机CPU使用率超过2%的进程固定到0-3CPU处理,将云主机enulator pin 到4-7,vcpu pin同一个NUMA 方案3 isocpus/nohzfull/rcunocbs隔离出CPU5-31,将云主机enulator pin 到4-7,一对一vcpu pin同一个NUMA物理vpcu 于是对上述的方案进行验证工作,发现效果,方案1 > 方案2 > 方案3 但是方案3需要重启,对于已经上线的云平台来说,重启物理主机会带来较大的风险,想比较方案2比方案1稍稍麻烦一些,于是方案1 是最佳选择 问题总结经过上述过程,该tap丢包问题定位,到最终的解决方案算是完成了,具体的技术细节就不在此文描述了,你可以自己goole,当然后续文章会对整个收报流程进行分析 并结合当前问题定位过程,进行一轮理论与 实践的结合分析,目的是将整个数据包从物理网卡到云主机应用程序之间到底经过了什么步骤表达清楚,后续再此遇到丢包等问题时候 可以快速的定位问题,和给出问题的可能性,举一反三。","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"tap口丢包","slug":"tap口丢包","permalink":"http://vcpu.me/tags/tap口丢包/"},{"name":"云主机带宽测试不稳定","slug":"云主机带宽测试不稳定","permalink":"http://vcpu.me/tags/云主机带宽测试不稳定/"},{"name":"ping云主机高延迟","slug":"ping云主机高延迟","permalink":"http://vcpu.me/tags/ping云主机高延迟/"},{"name":"云主机cpu绑定","slug":"云主机cpu绑定","permalink":"http://vcpu.me/tags/云主机cpu绑定/"},{"name":"高性能云主机方法","slug":"高性能云主机方法","permalink":"http://vcpu.me/tags/高性能云主机方法/"},{"name":"NUMA优化概念","slug":"NUMA优化概念","permalink":"http://vcpu.me/tags/NUMA优化概念/"},{"name":"CPU挤压导致性能下降","slug":"CPU挤压导致性能下降","permalink":"http://vcpu.me/tags/CPU挤压导致性能下降/"},{"name":"使用CPU资源的隔离","slug":"使用CPU资源的隔离","permalink":"http://vcpu.me/tags/使用CPU资源的隔离/"}]},{"title":"linux服务端程序tunning过程简要总结","slug":"tunning_simple","date":"2018-06-11T10:00:00.000Z","updated":"2018-06-12T02:16:01.000Z","comments":true,"path":"tunning_simple/","link":"","permalink":"http://vcpu.me/tunning_simple/","excerpt":"如何定位性能瓶颈问题?步骤1:网络丢包的原因定位ethtool -S eth0rx_dropped代表丢包发生在linux 内核缓冲区,丢包发生在docker的虚拟网卡上 cat /proc/net/softnet_stat12345root@xnet-controller1:~# cat /proc/net/softnet_stat0f77ed22 00000000 00000b03 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000101a1151 00000000 000007d9 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000001fa636a0 00000000 0000639d 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000000f9afaf9 00000000 00000f08 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 第一列:中断处理程序接收的帧数 第二列:由于超过cpu队列 netdev_max_backlog 1000 丢弃帧数 第三列:net_rx_action 函数中处理数据包超过300 netdev_budget 或者 运行时间超过2个时间片 内核为每个cpu都存储一个softnet_data对象,如果所有中断都由一个cpu核处理,如果接受数据包量非常大,超过中断处理速度,就会导致超过netdev_max_backlog而丢包","text":"如何定位性能瓶颈问题?步骤1:网络丢包的原因定位ethtool -S eth0rx_dropped代表丢包发生在linux 内核缓冲区,丢包发生在docker的虚拟网卡上 cat /proc/net/softnet_stat12345root@xnet-controller1:~# cat /proc/net/softnet_stat0f77ed22 00000000 00000b03 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000101a1151 00000000 000007d9 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000001fa636a0 00000000 0000639d 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000000f9afaf9 00000000 00000f08 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 第一列:中断处理程序接收的帧数 第二列:由于超过cpu队列 netdev_max_backlog 1000 丢弃帧数 第三列:net_rx_action 函数中处理数据包超过300 netdev_budget 或者 运行时间超过2个时间片 内核为每个cpu都存储一个softnet_data对象,如果所有中断都由一个cpu核处理,如果接受数据包量非常大,超过中断处理速度,就会导致超过netdev_max_backlog而丢包 top查看%si软中断分布情况cpu0比其它要高很多 查看网卡型号确认网卡队列1lspci -vvv | grep Eth 中断分布均匀?方法1: cat /proc/interrupts 方法2: systemtap工具 1234567891011121314151617181920212223242526272829303132333435363738394041global hard, soft, wqprobe irq_handler.entry {hard[irq, dev_name]++;}probe timer.s(1) {println(\"==irq number:dev_name\")foreach( [irq, dev_name] in hard- limit 5) {printf(\"%d,%s->%d\\n\", irq, kernel_string(dev_name), hard[irq, dev_name]); }println(\"==softirq cpu:h:vec:action\")foreach( [c,h,vec,action] in soft- limit 5) {printf(\"%d:%x:%x:%s->%d\\n\", c, h, vec, symdata(action), soft[c,h,vec,action]); }println(\"==workqueue wq_thread:work_func\")foreach( [wq_thread,work_func] in wq- limit 5) {printf(\"%x:%x->%d\\n\", wq_thread, work_func, wq[wq_thread, work_func]); }println(\"\\n\")delete harddelete softdelete wq}probe softirq.entry {soft[cpu(), h,vec,action]++;}probe workqueue.execute {wq[wq_thread, work_func]++}probe begin {println(\"~\")} 中断分布非常不均匀,主要集中在CPU0上 步骤2:如何优化首先是数据包超过了netdev_max_backlog造成了丢弃,那么调大netdev_max_backlog? 不靠谱,根据生产消费模型来讲,问题点出在消费的慢,所以应该直面问题本质,问题本质是中断处理过慢,即使不丢包问题也会体现在其它方面,比如服务器响应慢。 配置多个core处理中断 处理中断的CPU和业务的CPU分开,这样不回相互中断影响 考虑NUMA架构因素,将中断和应用程序使用相同NUMA 步骤3:如果在经过了步骤1和步骤2,你对性能目标没有达到,那就说明问题出在应用程序本身使用如下方法去定位吧 perf systap valgrind dropwatch netstat -s ethtool -S netstat -nat ifconfig ftrace valgrind -e -b查看线程调用栈找到性能瓶颈 频繁gdb查看调用栈找到性能瓶颈 查看内核调用栈找到性能瓶颈 非常有意思的点Linux wake affinity特性当两个NUMA节点处理中断时,CPU实例化的softnet_data以及驱动分配的sk_buffer都可能是跨node的,数据接收后对上层应用程序来说,跨node访问的几率也大大提高,并且无法充分利用L2、L3 cache,增加了延时。 同时,由于Linux wake affinity 特性,如果两个进程频繁互动,调度系统会觉得它们很有可能共享同样的数据,把它们放到同一CPU核心或NUMA Node有助于提高缓存和内存的访问性能,所以当一个进程唤醒另一个的时候,被唤醒的进程可能会被放到相同的CPU core或者相同的NUMA节点上。此特性对中断唤醒进程时也起作用,在上一节所述的现象中,所有的网络中断都分配给CPU 0去处理,当中断处理完成时,由于wakeup affinity特性的作用,所唤醒的用户进程也被安排给CPU 0或其所在的numa节点上其他core。而当两个NUMA node处理中断时,这种调度特性有可能导致应用进程在CPU core之间频繁迁移,造成性能损失。 参考https://tech.meituan.com/Redis_High_Concurrency_Optimization.html https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"linux高性能tunning","slug":"linux高性能tunning","permalink":"http://vcpu.me/tags/linux高性能tunning/"},{"name":"服务端程序性能","slug":"服务端程序性能","permalink":"http://vcpu.me/tags/服务端程序性能/"},{"name":"网络丢包","slug":"网络丢包","permalink":"http://vcpu.me/tags/网络丢包/"},{"name":"性能优化方法","slug":"性能优化方法","permalink":"http://vcpu.me/tags/性能优化方法/"},{"name":"Linux wake affinity","slug":"Linux-wake-affinity","permalink":"http://vcpu.me/tags/Linux-wake-affinity/"}]},{"title":"网卡rx dropped问题","slug":"rx_dropped","date":"2018-06-06T10:00:00.000Z","updated":"2018-06-12T02:20:15.000Z","comments":true,"path":"rx_dropped/","link":"","permalink":"http://vcpu.me/rx_dropped/","excerpt":"","text":"问题有一些设备出现rx dropped,如下图示,或者通过ifconfig 查看 详细定位error分布 我遇到了物理网卡的dropped丢包问题,详细查找丢包原因可以通过ethtool -S ens2f0 查看具体丢包分布,如上图,所有的丢包点在rx_missed_errors, 那么rx_missed_errors 代表什么? 我又能做些什么?rx_fifo_errors,rx_over_errors丢包参数含义是什么?后文主要围绕这三点进行说明 rx_missed_errors 代表什么?物理网卡接受数据buffer已经满了,无法接收数据计数 rx_missed_errors代表着DMA传送完成数据,发送硬件中断前,网卡的FIFO缓冲已经满了,导致数据丢掉的个数 Counts the number of missed packets. Packets are missed when the receive FIFO has insufficient spaceto store the incoming packet. This can be caused because of too few buffers allocated, or because thereis insufficient bandwidth on the PCI bus. Events setting this counter causes ICR.Rx Miss, the ReceiverOverrun Interrupt, to be set. This register does not increment if receives are not enabled.These packets are also counted in the Total Packets Received register as well as in Total OctetsReceived. rx_no_buffer网卡通过DMA将设备FIFO中的数据传送给ring buffer时候,已经无内存可以放数据了,也就是说ringbuffer不足了,往往是系统处理数据较慢导致 rx_over_errors系统ring buffer 满了,数据放不进去了,物理网卡无法通过DMA传送数据,该数据包被丢弃,尝尝发生在系统无法及时处理流量时候。 总结:其实上面计数信息,不同网卡计数信息含义不同,在此不做解释了,具体遇到问题,再查找具体含义吧 不管怎么说出现rx_missed_errors代表还有有流量突发,导致网卡fifo满了,个人认为应该观察rx_missed_errors 后续增长情况,如果长时间不增长,可以不用做调整 如果持续增长,你就要看看你的网卡是不是次品了;如果很少切少量的增长,其实最难定位,像我遇到的这种情况,只有rx_missed_errors 没有rx_fifo_errors和rx_over_errors计数,网卡可以通过DMA将数据送给 系统内存中,这就很奇怪了,所以我怀疑,有两种可能性导致该问题: 可能性较大:在机器刚上线时候,网卡速率和交换机速率没有调整好导致,该计数后续不会增长了 可能性较小:网卡异常","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"rx dropped","slug":"rx-dropped","permalink":"http://vcpu.me/tags/rx-dropped/"},{"name":"rx_missed_errors","slug":"rx-missed-errors","permalink":"http://vcpu.me/tags/rx-missed-errors/"},{"name":"rx_over_errors","slug":"rx-over-errors","permalink":"http://vcpu.me/tags/rx-over-errors/"}]},{"title":"二层广播和三层广播","slug":"network_brocast","date":"2018-06-03T10:00:00.000Z","updated":"2018-06-12T02:27:01.000Z","comments":true,"path":"network_brocast/","link":"","permalink":"http://vcpu.me/network_brocast/","excerpt":"","text":"广播只有2层的。跨网段访问是寻址。 三层交换机等同于路由器。也就是说源ip和目的ip在数据包里一直不变(不经过nat的情况),mac地址将时刻改变。广播只存在二层。三层情况下是路由寻址。 二层广播:全F的二层广播帧(交换机一般所发送的广播帧) 三层广播:192.168.1.255&255.255.255.255(区别是一个网段内的广播&全网络广播,一般是路由传播。但默认情况下路由器不转发广播包) 广播地址是一种特殊的IP地址形式,一种是直接广播地址,一种是有限广播地址。 直接广播地址包含一个有效的网络号和一个全”1”的主机号,如你说的202.163.30.255,255就是一个主机号,202则是C类的IP地址,C类IP地址就是我们常接触到的。 受限广播地址是32位全1的IP地址(255.255.255.255)。该地址用于主机配置过程中IP数据报的目的地址,此时,主机可能还不知道它所在网络的网络掩码,甚至连它的IP地址也不知道。在任何情况下,路由器都不转发目的地址为受限的广播地址的数据报,这样的数据报仅出现在本地网络中。 受限广播可以用在计算机不知道自己IP地址的时候,比如向DHCP服务器索要地址时、PPPOE拨号时等. 直接广播可用于本地网络,也可以跨网段广播,比如主机192.168.1.1/30可以发送广播包到192.168.1.7,使主机192.168.1.5/30也可以接收到该数据包,前提是之间的路由器要开启定向广播功能. 另外,无论是何种广播,它第二层目的MAC地址都是FF-FF-FF-FF-FF-FF,这样交换机也就可以往外flood广播包.区别于多播的MAC帧头是01005A+subnetid eg: 因为访问的目的地址是直接广播地址,所以云主机并没有学习其arp,而是广播发出,如果不是广播地址,云主机会首先发arp学习其目的mac,然后将目的mac填入icmp数据包,将icmp数据包发出去 DHCP 请求目的地址是255.255.255.255三层受限广播,且目的MAC是全F","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"二层广播","slug":"二层广播","permalink":"http://vcpu.me/tags/二层广播/"},{"name":"三层广播","slug":"三层广播","permalink":"http://vcpu.me/tags/三层广播/"}]},{"title":"openstack and Dragonflow DNS","slug":"openstackdns","date":"2018-05-28T10:00:00.000Z","updated":"2018-06-08T12:28:32.000Z","comments":true,"path":"openstackdns/","link":"","permalink":"http://vcpu.me/openstackdns/","excerpt":"为什么会有这篇文章,这篇文章总结了openstack和dragonflow有关DNS的技术点,search 和 domain 经常出现在dns配置当中,于是就借此机会来个大总结。详细见后文 默认情况下的三层云主机的DNS云主机内部的DNS配置,如下图所示 针对于linux系统DNS配置位于/etc/resolv.conf 默认openstack配置建立的云主机DNS指向DHCP服务器的地址,云主机内所有DNS请求流量会发给网络节点的DHCP namespace,该DHCP namespace中含有一个53端口dnsmsq服务,接收到该DNS请求后 会向已经配置的DNS服务器转发DNS请求,如果配置的外网DNS服务器不可达则,该53端口的DNS代理服务向云主机回复refused,告知云主机无法获取DNS","text":"为什么会有这篇文章,这篇文章总结了openstack和dragonflow有关DNS的技术点,search 和 domain 经常出现在dns配置当中,于是就借此机会来个大总结。详细见后文 默认情况下的三层云主机的DNS云主机内部的DNS配置,如下图所示 针对于linux系统DNS配置位于/etc/resolv.conf 默认openstack配置建立的云主机DNS指向DHCP服务器的地址,云主机内所有DNS请求流量会发给网络节点的DHCP namespace,该DHCP namespace中含有一个53端口dnsmsq服务,接收到该DNS请求后 会向已经配置的DNS服务器转发DNS请求,如果配置的外网DNS服务器不可达则,该53端口的DNS代理服务向云主机回复refused,告知云主机无法获取DNS 默认情况下的二层云主机的DNS 在创建二层云主机子网的时候,往往指定DNS服务器地址,这样通过该DHCP获取ip的云主机DNS会设置成配置的DNS服务器地址 Dragonflow DNS设置通过研究openstack DNS设置情况可知,云主机内部的DNS解析地址是根据DHCP服务器获取的,如果没有直接指定就使用DHCP服务器地址作为DNS解析地址。 Dragonflow 支持分布式DHCP,其默认DNS发生了些变化,dragonflow核心思想是为了去除集中式,拥抱分布式。所以DHCP服务器分布到不同HOST节点上,还去除了dnsmsq服务 它的默认配置可以在dragonflow配置中[df_dhcp_app]指定,当然如果此时建立子网时候也配置了DNS服务器地址,按照子网配置为准。在子网没有配置DNS服务器时,就采用dragonflow配置的df_dns_servers服务器 如果没有做配置draonfow配置就按照drafonflow代码中指定的默认配置,默认时8.8.8.8 和8.8.4.4;当然在中国建议改成114.114.114.114 DNS疑惑知识点 关于DNS 云主机常用的配置会在下面进行解释,该解释主要通过实验方式;且会将实验数据也进行记录,并分析,详见下文 search openstacklocal 是什么?背景在/etc/resolv.conf 配置如下 123# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTENsearch openstacklocalnameserver 114.114.114.114 试验1:12stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping abping: unknown host ab 12345listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes19:39:53.287070 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 77: 192.168.56.155.38054 > 114.114.114.114.53: 16466+ A? ab.openstacklocal. (35)19:39:53.295123 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 152: 114.114.114.114.53 > 192.168.56.155.38054: 16466 NXDomain 0/1/0 (110)19:39:53.295365 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 62: 192.168.56.155.42830 > 114.114.114.114.53: 57622+ A? ab. (20)19:39:53.302207 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 137: 114.114.114.114.53 > 192.168.56.155.42830: 57622 NXDomain 0/1/0 (95) 说明:域名ab 没有点,认为其不是域名的可能性大些,因为配置了openstacklocal,所以先添加后缀,进行查找,查找失败后以域名的方式再次进行查找 试验2:12stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping ab.cping: unknown host ab.c 123419:47:07.151911 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 64: 192.168.56.155.46028 > 114.114.114.114.53: 14292+ A? ab.c. (22)19:47:07.159660 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 139: 114.114.114.114.53 > 192.168.56.155.46028: 14292 NXDomain 0/1/0 (97)19:47:07.159833 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 79: 192.168.56.155.57417 > 114.114.114.114.53: 10459+ A? ab.c.openstacklocal. (37)19:47:07.167700 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 154: 114.114.114.114.53 > 192.168.56.155.57417: 10459 NXDomain 0/1/0 (112) 说明:ab.c被认为是域名,所以先进行查找,如果无法查找到ip后,加上openstacklocal继续进行查找 试验312stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping ab.ping: unknown host ab. 1219:48:21.188557 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 62: 192.168.56.155.45154 > 114.114.114.114.53: 27691+ A? ab. (20)19:48:21.196942 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 137: 114.114.114.114.53 > 192.168.56.155.45154: 27691 NXDomain 0/1/0 (95) 说明: ab. 以. 结尾,则被认为是域名,不需要继续在域openstacklocal 查找,所以不会继续添加.openstacklocal. 后继续dns查询 试验4:12345stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping www.qq.comPING www.qq.com (180.163.26.39) 56(84) bytes of data.64 bytes from 180.163.26.39: icmp_seq=1 ttl=50 time=7.38 ms64 bytes from 180.163.26.39: icmp_seq=2 ttl=50 time=5.52 ms64 bytes from 180.163.26.39: icmp_seq=3 ttl=50 time=5.23 ms 123419:49:18.360814 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 70: 192.168.56.155.46482 > 114.114.114.114.53: 35785+ A? www.qq.com. (28)19:49:18.368714 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 86: 114.114.114.114.53 > 192.168.56.155.46482: 35785 1/0/0 A 180.163.26.39 (44)19:49:18.377071 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 86: 192.168.56.155.40497 > 114.114.114.114.53: 55908+ PTR? 39.26.163.180.in-addr.arpa. (44)19:49:18.383946 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 145: 114.114.114.114.53 > 192.168.56.155.40497: 55908 NXDomain 0/1/0 (103) 说明:www.qq.com 被认为是域名,先进行查找,结果获取到了ip,所以就没有继续添加.openstacklocal. 后继续dns查询 总结:如果域名无法查询到ip,配置search 后,会继续添加配置的serach 域再次查找 如果是域名称(没有.),则先加上serach 后缀进行查找,如果无法查找到结果后,直接将该域最为终极域名进行查找 如果查找到域名,就不再加上serach 后缀进行查找了 domain openstacklocal是什么?背景在/etc/resolv.conf 配置如下 进行过如下试验,对域名 ab ab. .ab ab.c 进行试验 实验1: ab### 解析ab 12stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping abping: unknown host ab 抓包结果 1234567stack@p-controller:~$ sudo tcpdump -i br-ex -ne -a udptcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes19:22:27.959705 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 77: 192.168.56.155.39978 > 114.114.114.114.53: 58424+ A? ab.openstacklocal. (35)19:22:27.969578 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 152: 114.114.114.114.53 > 192.168.56.155.39978: 58424 NXDomain 0/1/0 (110)19:22:27.969777 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 62: 192.168.56.155.42798 > 114.114.114.114.53: 30556+ A? ab. (20)19:22:28.096722 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 137: 114.114.114.114.53 > 192.168.56.155.42798: 30556 NXDomain 0/1/0 (95) 抓包结果解析说明 如上,抓取的数据包所示,配置了domain openstacklocal , 因为该查询的名称最后没有小数字点,就自动补充 .openstacklocal. 到ab 的末尾 实验2: 解析ab. 解析ab. 12stack@p-controller:~/dragonflow/dragonflow/controller/apps$ ping ab.ping: unknown host ab. 抓包结果 12319:31:43.677121 08:00:27:12:3f:5b > d6:aa:3e:a1:1c:bb, ethertype IPv4 (0x0800), length 62: 192.168.56.155.39224 > 114.114.114.114.53: 17912+ A? ab. (20)19:31:43.678045 66:2f:fa:40:17:6e > 52:54:00:12:35:00, ethertype IPv4 (0x0800), length 62: 1.1.1.14.39224 > 114.114.114.114.53: 17912+ A? ab. (20)19:31:43.685143 d6:aa:3e:a1:1c:bb > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 137: 114.114.114.114.53 > 192.168.56.155.39224: 17912 NXDomain 0/1/0 (95) 抓包结果解析说明 如上,抓取的数据包所示,虽然配置了domain openstacklocal , 但是解析的域名 ab. 末尾含有 “.” ,因此不会给其末尾加上 .openstacklocal. 当然也尝试过 .ab ab.c 这两种情况,一种情况 “.” 在首部 ,一种情况 ‘.’ 在中间, “.”放在首部,域名无法解析 在中间情况和试验1 相同 总结“domain”指定本地的域名,如果查詢時的名称沒有末尾包含小數點,自动补充domain 带到末尾 如果多个nameserver 执行顺序如何,主DNS 异常,多久能切换到备用DNS ? 使用cirros系统进行测试工作,这种情况下,cirros云主机会先使用第一个nameserver地址,如果无法联通,尝试约5s后,使用第二个nameserver地址,所以应该使得nameserver中所有的dns列表均是可用的dns 所以nameserver中配置的域名一定要是高可用的,且不能存在废弃的域名解析,假设,你在DNS首位配置了一个无法解析的DNS服务器地址,虽然第二个DNS服务器可用,针对于linux 那么你每次DNS解析 要花费5s+ 的时间,这就给用户带来了非常差的上网体验","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"dragonflow dns相关","slug":"dragonflow-dns相关","permalink":"http://vcpu.me/tags/dragonflow-dns相关/"},{"name":"云主机dns解析","slug":"云主机dns解析","permalink":"http://vcpu.me/tags/云主机dns解析/"},{"name":"openstack 云主机stack","slug":"openstack-云主机stack","permalink":"http://vcpu.me/tags/openstack-云主机stack/"},{"name":"DNS试验","slug":"DNS试验","permalink":"http://vcpu.me/tags/DNS试验/"},{"name":"search","slug":"search","permalink":"http://vcpu.me/tags/search/"},{"name":"domain","slug":"domain","permalink":"http://vcpu.me/tags/domain/"}]},{"title":"Dragonflow安全组实现三(详细)","slug":"dragonflow_security_group3","date":"2018-05-21T10:00:00.000Z","updated":"2018-06-12T06:39:18.000Z","comments":true,"path":"dragonflow_security_group3/","link":"","permalink":"http://vcpu.me/dragonflow_security_group3/","excerpt":"流量从其它云主机发起,然后该云主机被动回复的流量场景下,此情况的安全组工作情况入方向安全组入方向安全组是指控制数据流量进入云主机的安全组,下文是针对于这种安全组的详细分析。分析方式有多种,下文是按照流量时刻进行分析的,流量时刻包含未放通入方向安全组、放通安全组第一个数据包、放通安全组建立会话后;根据这些流量时刻可以观察安全组起作用的整个过程,方便理解。","text":"流量从其它云主机发起,然后该云主机被动回复的流量场景下,此情况的安全组工作情况入方向安全组入方向安全组是指控制数据流量进入云主机的安全组,下文是针对于这种安全组的详细分析。分析方式有多种,下文是按照流量时刻进行分析的,流量时刻包含未放通入方向安全组、放通安全组第一个数据包、放通安全组建立会话后;根据这些流量时刻可以观察安全组起作用的整个过程,方便理解。 未放通入方向安全组12cookie=0x0, duration=2362.677s, table=105, n_packets=1563, n_bytes=153174, idle_age=0, priority=100,ip,reg7=0xa3 actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=4987.907s, table=110, n_packets=662, n_bytes=71525, idle_age=334, priority=1 actions=drop 数据根据流量流入主机的标记0xa3判定数据包需要进入安全组流程 将流量通过ct注入到table110处理,安全组建立的会话被网段信息分割(不同网段建立CT不会冲突,因为不同域) 因为没有方通安全组智能命中默认drop策略,该数据包被丢弃 放通安全组第一个数据包1234cookie=0x0, duration=2362.677s, table=105, n_packets=1563, n_bytes=153174, idle_age=0, priority=100,ip,reg7=0xa3 actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=2093.711s, table=110, n_packets=4, n_bytes=392, idle_age=51, priority=11,conj_id=9,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=2204.825s, table=110, n_packets=0, n_bytes=0, idle_age=2204, priority=11,ct_state=+new-est-rel-inv+trk,reg7=0xa3 actions=conjunction(9,1/2)cookie=0xb00000000, duration=1291.328s, table=110, n_packets=0, n_bytes=0, idle_age=1291, priority=11,icmp actions=conjunction(9,2/2) 此时安全组被放通了 数据根据流量流入主机的标记0xa3判定数据包需要进入安全组流程 在安全组判断流程table110中,有一种特殊的判定条件就是conj_id=9,这种条件的意思是匹配条件仍在该table中,查找流量会继续查找匹配conjunction(9,1/2) conjunction(9,2/2),如果能够同时命中这两个条件,该会话会commit记录到conntrack中,将该流量交给table115发出处理 conjunction(9,1/2) +new-est-rel-inv+trk 数据包会话未建立完成(一正一反数据包建立称为est)、未关联会话、无异常状态(ovs有问题或者数据包来的状态检测不通过)、数据包是进入ct状态后复制的数据包不是从其它地方误闯进来的 conjunction(9,2/2) 安全组协议条件判断 conjunction(9,1/2) 以会话为角度的状态条件判断、conjunction(9,2/2)是协议白名单条件判断 conj_id=9 的意思是必须要同时满足conjunction(9,1/2)、conjunction(9,2/2) 才算满足判断条件才能执行后续的actions 经过过这个流程后,会话状态建立了,但是处于初始状态+new ; 如果条件不满足不会执行commit动作,也就不会建立conntrack表项 stack@p-controller:~/dragonflow$ sudo conntrack -E -e ALL|grep 2.2.2.6 [NEW] icmp 1 30 src=2.2.2.10 dst=2.2.2.6 type=8 code=0 id=43521 [UNREPLIED] src=2.2.2.6 dst=2.2.2.10 type=0 code=0 id=43521 zone=11 那么该安全组产生的会话什么时候更新状态 ? 在此就先简单讲解吧,作为小标题,因为它并不符合入方向的流量时刻,会话的更新需要该云主机收到流量后,根据协议回复该流量,然后再次进入CT流程会触发如下更新: [UPDATE] icmp 1 30 src=2.2.2.10 dst=2.2.2.6 type=8 code=0 id=43521 src=2.2.2.6 dst=2.2.2.10 type=0 code=0 id=43521 zone=11 放通安全组建立会话12cookie=0x0, duration=2362.677s, table=105, n_packets=1563, n_bytes=153174, idle_age=0, priority=100,ip,reg7=0xa3 actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=4947.447s, table=110, n_packets=1470, n_bytes=144060, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) 方通安全组后续流量 数据根据流量流入主机的标记0xa3判定数据包需要进入安全组流程 在安全组条件table110中,-new+est-rel-inv+trk 会话建立完成了,不用再查找条件了,可以根据已有的会话直接放入到云主机了 出方向安全组流量到达云主机后,云主机回复此协议流量,会再次经过出方向安全组流程,下文详细分析。分析方式有多种,下文是按照流量时刻进行分析的;但是在被动回复的场景下,流量时刻分为第一个回复的数据包、和后续回复的流量;其实这两种流量场景从宏观流表上不存在差异性,但其底层具体的动作是有差异的,第一个回复的数据包时刻,会触发conntrack会话从new变更为est状态; 和后续回复的流量会一直是est状态,不存在状态变化过程。 第一个回复的数据包时刻12cookie=0x0, duration=9890.264s, table=10, n_packets=8389, n_bytes=822122, idle_age=0, priority=100,ip,reg6=0xa3 actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=12447.715s, table=15, n_packets=8658, n_bytes=848484, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) 第一个数据包会根据标记0xa3,查找ct更新ct状态为est,并复制产生一个数据包,将该数据包交给table出安全组规则进行判别 table15 出方向安全组规则查找地,这类规则可分为两种: 一种是状态判断 一种是安全组条件的判断 ,但是在被动回复的场景下我们只关注状态判断就好了;安全组条件判断一般发生在主动出发场景。 状态判断是 -new+est-rel-inv+trk : 只要是est状态数据包、数据包是进入ct状态后复制的数据包不是从其它地方误闯进来的 就算状态检测通过,直接退出出安全组判断,将该流量交给后续流程(table20)处理 后续回复的流量cookie=0x0, duration=9890.264s, table=10, n_packets=8389, n_bytes=822122, idle_age=0, priority=100,ip,reg6=0xa3 actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=12447.715s, table=15, n_packets=8658, n_bytes=848484, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) 因为这个时候会话的状态已经维持在est状态,所以直接可以过table15的状态检查,直接退出安全组,将该流量交给table20处理 流量当前云主机发起,然后该云主机会收到回复的流量,此场景下情况的安全组工作情况这种流量场景会先经过云主机出方向安全组,后经过云主机入方向安全组,和被动回复场景正好相反 出方向安全组默认情况下出方向的安全组都是放通的,所以在此将流量时刻分析直接简化为: 放通第一个数据包,放通安全组建立会话 放通第一个数据包1234cookie=0x0, duration=11539.315s, table=10, n_packets=8580, n_bytes=840840, idle_age=5, priority=100,ip,reg6=0xa3 actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=11264.259s, table=15, n_packets=258, n_bytes=25284, idle_age=5, priority=11,conj_id=9,ip actions=ct(commit,table=20,zone=NXM_NX_CT_ZONE[])cookie=0xa00000000, duration=11366.388s, table=15, n_packets=0, n_bytes=0, idle_age=11366, priority=11,ip actions=conjunction(9,2/2)cookie=0x0, duration=11406.839s, table=15, n_packets=0, n_bytes=0, idle_age=11406, priority=11,ct_state=+new-est-rel-inv+trk,reg6=0xa3 actions=conjunction(9,1/2) conjunction(9,2/2) 安全组协议条件判断 放通所有的ip数据包 ,加入的安全组策略全部体现在此处 conjunction(9,1/2) 以会话为角度的判断,主要是建立状态后续就不需要多次查询安全组协议条件了 放通安全组建立会话12cookie=0x0, duration=11539.315s, table=10, n_packets=8580, n_bytes=840840, idle_age=5, priority=100,ip,reg6=0xa3 actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=22154.590s, table=15, n_packets=8869, n_bytes=869162, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) 云主机发出的协议流量被回复后,会话状态转为est,后续流量直接查看会话状态 入方向安全组12cookie=0x0, duration=31.769s, table=105, n_packets=13, n_bytes=1274, idle_age=1, priority=100,ip,reg7=0xa3 actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=55.210s, table=110, n_packets=34, n_bytes=3332, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie值: 在table15和table110中,有一种协议放通的策略,该策略和conj_id相同,应该是额外的功能吧,你可以使用该功能很方便的过滤出你想看的安全组流表 默认行为: 出方向全部放通,入方向不放通,如果此云主机作为服务端使用,需要ssh、http等服务,需要额外开启 安全组功能是针对于ip的层的判别策略 安全组的流表优先级不同? 优先级肯定不同,安全组是按照建立的顺序从低到高优先级逐渐延展,这样查找时候会按照优先级高的安全组开始查找,运气差的话要循环查找一遍 不同安全组是如何区分的? 由上文我们知道,conjunction(9,2/1) 是以会话为角度的判断,其实还有一个条件reg6=0x9c,根据这个条件呢;不同云主机接口可以查找不同的安全组策略;这个条件也是区分安全组的关键 安全组的etcd长什么样? 123456789101112131415[('{\"name\": \"default\", \"unique_key\": 4, \"rules\": [{\"direction\": \"ingress\", \"protocol\": 6, \"ethertype\": \"IPv4\", \"port_range_max\": 65535, \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"port_range_min\": 1, \"remote_ip_prefix\": \"0.0.0.0/0\", \"id\": \"19bf513a-89a0-4f2e-91a8-53e7e11c0a7c\"}, {\"direction\": \"egress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"id\": \"6384d6b0-684d-4102-a2f0-44ae2abff45a\"}, {\"direction\": \"egress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"id\": \"63902a8d-375b-485e-ba44-d5679d16128a\"}, {\"remote_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"direction\": \"ingress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"id\": \"79c9bca5-ec8f-4e8c-bb6b-059cac18490c\"}, {\"direction\": \"ingress\", \"protocol\": 1, \"ethertype\": \"IPv4\", \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"remote_ip_prefix\": \"0.0.0.0/0\", \"id\": \"d3116bc4-614f-4e3e-889b-562261714e24\"}, {\"remote_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"direction\": \"ingress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\", \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 0, \"id\": \"f95d6c8e-b4f9-4de0-b16f-87a9bdb4513b\"}], \"topic\": \"5e347d7b2f7541198031e12916acaa0b\", \"version\": 6, \"id\": \"253df9b8-3e41-4153-ae43-6c4990dcd74f\"}’, {u'mod_revision': u'44933', u'create_revision': u'621', u'version': u'3', u'key': '/secgroup/253df9b8-3e41-4153-ae43-6c4990dcd74f’})] 远程安全组怎么实现的?研究远程安全组的场景是,位于同一个vpc的两台云主机,拥有不同的安全组策略(安全组default和安全组xxxx),这两个安全组策略均放通了对外主动访问的流量;只有安全组default 的云主机放通了default安全组可以访问的远程安全组访问策略 那么其是如何实现的? 我们研究入方向安全组的变化吧 在远程安全组下发之前: 12345678910111213141516cookie=0x0, duration=422.928s, table=105, n_packets=375, n_bytes=36750, idle_age=13, priority=100,ip,reg7=0xaa actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=422.928s, table=105, n_packets=0, n_bytes=0, idle_age=422, priority=100,ipv6,reg7=0xaa actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=422.828s, table=105, n_packets=262, n_bytes=25676, idle_age=0, priority=100,ip,reg7=0xad actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=422.828s, table=105, n_packets=0, n_bytes=0, idle_age=422, priority=100,ipv6,reg7=0xad actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=425.459s, table=110, n_packets=495, n_bytes=48510, idle_age=13, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=425.459s, table=110, n_packets=0, n_bytes=0, idle_age=425, priority=65534,ct_state=-new+rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=425.459s, table=110, n_packets=0, n_bytes=0, idle_age=425, priority=65534,ct_state=+new+rel-inv+trk,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=425.459s, table=110, n_packets=0, n_bytes=0, idle_age=425, priority=65534,ct_state=+new+rel-inv+trk,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=425.459s, table=110, n_packets=0, n_bytes=0, idle_age=425, priority=65534,ct_state=+inv+trk actions=dropcookie=0x0, duration=422.928s, table=110, n_packets=0, n_bytes=0, idle_age=422, priority=6,conj_id=4,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=422.928s, table=110, n_packets=0, n_bytes=0, idle_age=422, priority=6,conj_id=4,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=422.828s, table=110, n_packets=3, n_bytes=294, idle_age=78, priority=13,conj_id=11,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=422.828s, table=110, n_packets=0, n_bytes=0, idle_age=422, priority=13,conj_id=11,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=422.928s, table=110, n_packets=0, n_bytes=0, idle_age=422, priority=6,ct_state=+new-est-rel-inv+trk,reg7=0xaa actions=conjunction(4,1/2)cookie=0x0, duration=422.828s, table=110, n_packets=0, n_bytes=0, idle_age=422, priority=13,ct_state=+new-est-rel-inv+trk,reg7=0xad actions=conjunction(11,1/2)cookie=0x0, duration=425.459s, table=110, n_packets=139, n_bytes=13622, idle_age=0, priority=1 actions=drop 远程安全组下发之后 123456789101112131415161718cookie=0x0, duration=361.237s, table=105, n_packets=326, n_bytes=31948, idle_age=1, priority=100,ip,reg7=0xaa actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=361.237s, table=105, n_packets=0, n_bytes=0, idle_age=361, priority=100,ipv6,reg7=0xaa actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=361.137s, table=105, n_packets=200, n_bytes=19600, idle_age=1, priority=100,ip,reg7=0xad actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=361.137s, table=105, n_packets=0, n_bytes=0, idle_age=361, priority=100,ipv6,reg7=0xad actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=363.768s, table=110, n_packets=397, n_bytes=38906, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=363.768s, table=110, n_packets=0, n_bytes=0, idle_age=363, priority=65534,ct_state=-new+rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=363.768s, table=110, n_packets=0, n_bytes=0, idle_age=363, priority=65534,ct_state=+new+rel-inv+trk,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=363.768s, table=110, n_packets=0, n_bytes=0, idle_age=363, priority=65534,ct_state=+new+rel-inv+trk,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=363.768s, table=110, n_packets=0, n_bytes=0, idle_age=363, priority=65534,ct_state=+inv+trk actions=dropcookie=0x0, duration=361.237s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=6,conj_id=4,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=361.237s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=6,conj_id=4,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=361.137s, table=110, n_packets=3, n_bytes=294, idle_age=17, priority=13,conj_id=11,ip actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x0, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,conj_id=11,ipv6 actions=ct(commit,table=115,zone=NXM_NX_CT_ZONE[])cookie=0x800000000, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,icmp,nw_src=192.168.58.107 actions=conjunction(11,2/2)cookie=0x800000000, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,icmp,nw_src=192.168.58.109 actions=conjunction(11,2/2)cookie=0x0, duration=361.237s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=6,ct_state=+new-est-rel-inv+trk,reg7=0xaa actions=conjunction(4,1/2)cookie=0x0, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,ct_state=+new-est-rel-inv+trk,reg7=0xad actions=conjunction(11,1/2)cookie=0x0, duration=363.768s, table=110, n_packets=126, n_bytes=12348, idle_age=22, priority=1 actions=drop 通过对比发现: cookie=0x800000000, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,icmp,nw_src=192.168.58.107 actions=conjunction(11,2/2) cookie=0x800000000, duration=361.137s, table=110, n_packets=0, n_bytes=0, idle_age=361, priority=13,icmp,nw_src=192.168.58.109 actions=conjunction(11,2/2) 如果访问安全组xxxx的云主机,源地址是192.168.58.107 192.168.58.109可以访问;其实这两个云主机的安全组都属于安全组default,也就相当于放通了安全组default的访问策略;总结一句话就是远程安全组通关源地址条件实现。通过研究发现dragonflow 还解决了源ip地址很多的问题,它会通过在一个网段的地址通过mask 方式减少策略条目。 减少条目方法eg: 192.168.58.100 192.168.58.111 换成条件为192.168.58.110/31 cookie=0xa00000000, duration=149.523s, table=110, n_packets=0, n_bytes=0, idle_age=149, priority=13,icmp,nw_src=192.168.58.110/31 actions=conjunction(11,2/2) 远程安全组是同一个网段才下发?远程安全组和是不是同一个网段没有关联,其不关注使用该安全组的云主机是否处于同一个网段,其只根据源ip进行策略添加判断;因此有时候会出现虽然放通了该安全组,但是从来不会被命中情况;其实想想也基本能理解了,作为安全组策略,它们无法智能或者武断的分辨出不同云主机之间连通关系,所以只能一到切实现该功能。 安全组的代码都在什么位置? sg.py 遇到的问题12345678910112018-04-19 06:52:09.803 DEBUG dragonflow.db.api_nb [-] Could not get object 487e697b-d4e7-4471-b958-22b1367c1bd8 from table lport from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2962018-04-19 06:52:09.803 DEBUG dragonflow.db.api_nb [-] ('Traceback (most recent call last):n File \"/opt/stack/dragonflow/dragonflow/db/api_nb.py\", line 290, in getn _get_topic(lean_obj),n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 117, in get_keyn return self._get_key(self._make_key(table, key), key)n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 123, in _get_keyn raise df_exceptions.DBKeyNotFound(key=key)nDBKeyNotFound: DB Key not found, key=487e697b-d4e7-4471-b958-22b1367c1bd8n',) from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2972018-04-19 06:52:09.803 WARNING dragonflow.controller.topology [-] No logical port found for ovs port: OvsPort object2018-04-19 06:52:09.807 DEBUG dragonflow.db.api_nb [-] Could not get object bf8c2c33-b61f-4bf1-937d-ce48ec63439c from table lport from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2962018-04-19 06:52:09.807 DEBUG dragonflow.db.api_nb [-] ('Traceback (most recent call last):n File \"/opt/stack/dragonflow/dragonflow/db/api_nb.py\", line 290, in getn _get_topic(lean_obj),n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 117, in get_keyn return self._get_key(self._make_key(table, key), key)n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 123, in _get_keyn raise df_exceptions.DBKeyNotFound(key=key)nDBKeyNotFound: DB Key not found, key=bf8c2c33-b61f-4bf1-937d-ce48ec63439cn',) from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2972018-04-19 06:52:09.807 WARNING dragonflow.controller.topology [-] No logical port found for ovs port: OvsPort object2018-04-19 06:52:09.817 DEBUG dragonflow.db.api_nb [-] Could not get object 71c111ac-3eac-4dba-91c6-6195b50091a4 from table lport from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2962018-04-19 06:52:09.817 DEBUG dragonflow.db.api_nb [-] ('Traceback (most recent call last):n File \"/opt/stack/dragonflow/dragonflow/db/api_nb.py\", line 290, in getn _get_topic(lean_obj),n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 117, in get_keyn return self._get_key(self._make_key(table, key), key)n File \"/opt/stack/dragonflow/dragonflow/db/drivers/etcd_db_driver.py\", line 123, in _get_keyn raise df_exceptions.DBKeyNotFound(key=key)nDBKeyNotFound: DB Key not found, key=71c111ac-3eac-4dba-91c6-6195b50091a4n',) from (pid=28510) get /opt/stack/dragonflow/dragonflow/db/api_nb.py:2972018-04-19 06:52:09.817 WARNING dragonflow.controller.topology [-] No logical port found for ovs port: OvsPort object 1234567891011121314151617181920212223Pdb) bt /usr/local/bin/df-local-controller(10)<module>()-> sys.exit(main()) /opt/stack/dragonflow/dragonflow/cmd/eventlet/df_local_controller.py(17)main()-> df_local_controller.main() /opt/stack/dragonflow/dragonflow/controller/df_local_controller.py(352)main()-> controller.run() /opt/stack/dragonflow/dragonflow/controller/df_local_controller.py(121)run()-> self.nb_api.process_changes() /opt/stack/dragonflow/dragonflow/db/api_nb.py(207)process_changes()-> self._notification_cb(next_update) /opt/stack/dragonflow/dragonflow/controller/df_local_controller.py(292)_handle_update()-> self._handle_db_change(update) /opt/stack/dragonflow/dragonflow/controller/df_local_controller.py(306)_handle_db_change()-> self.sync() /opt/stack/dragonflow/dragonflow/controller/df_local_controller.py(137)sync()-> self.topology.check_topology_info() /opt/stack/dragonflow/dragonflow/controller/topology.py(294)check_topology_info()-> lport = self._get_lport(ovs_port) /opt/stack/dragonflow/dragonflow/controller/topology.py(278)_get_lport()-> lport = self.nb_api.get(ovs_port.lport)> /opt/stack/dragonflow/dragonflow/db/api_nb.py(296)get()-> LOG.debug( (Pdb) ovs_port OvsPort(attached_mac=EUI(‘fa:16:3e:f3:c7:0a’), id=u’db4b9d2d-1bb2-4648-9308-0e5aa32719bf’, lport=LogicalPortProxy(id=487e697b-d4e7-4471-b958-22b1367c1bd8), name=u’tap487e697b-d4’, ofport=-1, type=u’compute’) 这个问题已经定位了,是因为在check_topology_info的时候,会根据ovs口,来查找etcd,如果存在下面情况(在ovsdb存在,但实际口已经没有了),会导致etcd中没有数据,因此会出现上述的告警;这个告警属于正常的 ;根本原因是ovs口的No such device问题。","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"openvswitch安全组实现细节","slug":"openvswitch安全组实现细节","permalink":"http://vcpu.me/tags/openvswitch安全组实现细节/"},{"name":"dragonflow安全组实现偏向于安全组作用原理介绍","slug":"dragonflow安全组实现偏向于安全组作用原理介绍","permalink":"http://vcpu.me/tags/dragonflow安全组实现偏向于安全组作用原理介绍/"},{"name":"按照流量时刻分析安全组起作用过程","slug":"按照流量时刻分析安全组起作用过程","permalink":"http://vcpu.me/tags/按照流量时刻分析安全组起作用过程/"},{"name":"linux conntrack","slug":"linux-conntrack","permalink":"http://vcpu.me/tags/linux-conntrack/"}]},{"title":"openvswitch bridge br-ex三层转发","slug":"ovs_l3_forward","date":"2018-05-16T10:00:00.000Z","updated":"2018-06-12T05:28:37.000Z","comments":true,"path":"ovs_l3_forward/","link":"","permalink":"http://vcpu.me/ovs_l3_forward/","excerpt":"ovs bridge br-ex三层转发概念openvswitch bridge三层转发,和linux bridge三层转发类似,只有给br-ex桥配置上ip地址后才存在三层转发的概念。 三层转发的概念和传统的三层转发有一定差异性,给bridge配置地址这种模式是,赋予bridge一种能力,该能力能让访问bridge地址的数据流进入协议栈中的具体应用中;eg:配置bridge一个地址后 我们可以ssh连接这个地址控制这台服务器;这种能把流量送给上层应用的能力叫做三层转发能力。当然这种能力只有给ovs bridge配置地址且开启主机的forward功能后才具备。","text":"ovs bridge br-ex三层转发概念openvswitch bridge三层转发,和linux bridge三层转发类似,只有给br-ex桥配置上ip地址后才存在三层转发的概念。 三层转发的概念和传统的三层转发有一定差异性,给bridge配置地址这种模式是,赋予bridge一种能力,该能力能让访问bridge地址的数据流进入协议栈中的具体应用中;eg:配置bridge一个地址后 我们可以ssh连接这个地址控制这台服务器;这种能把流量送给上层应用的能力叫做三层转发能力。当然这种能力只有给ovs bridge配置地址且开启主机的forward功能后才具备。 在br-ex抓包过程发生在哪里?笔者认为,在br-ex上抓取数据包,说明该数据包已经从br-ex上送到上层协议栈处理了。br-ex抓包点在ip协议栈收包过程中。 什么类型流量能在br-ex上抓取到?根据流量模型来讲,广播,组播和单播流量都存在。 我们先从单播ip流量来说明ovs bridge normal情况下三层转发的概念: 前提条件是你给ovs bridge(br-ex)配置ip地址,当数据包到达bridge后的目的mac是网桥的mac(网桥mac也是物理口的mac)时候,这种类型流量会经过br-ex送给协议栈上层处理,你可以在tcpdump -i br-ex抓取到这种流量,这种类型流量会存在两种处理方式 方式1:应用层接受该流量,eg ssh 方式2:无应用层接受该流量,流表是normal情况下,该流量会根据目的地址查路由将该数据包三层转发走 当然还有一部分单播流量也会渗透到br-ex的应用层,这部分流量是属于洪泛流量(也就是ovs-appctl fdb/show br-ex看不到目的mac);这部分流量会流到上层协议栈,而被tcpdump -i br-ex抓取到 那么针对于广播流量和组播流量,这些流量也会传递给应用层一份,这部分流量你可以使用tcpdump -i br-ex抓取到 什么类型流量无法在br-ex上抓取到,只能在物理口抓取到?normal情况下二层转发且不泛洪流量(ovs-appctl fdb/show br-ex 查询已经生成转发表项)和ovs bridge精细流表转发情况下,这种类型流量不会上传到用户态协议栈处理,也就无法在br-ex上抓取到流量","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"openvswitch三层转发","slug":"openvswitch三层转发","permalink":"http://vcpu.me/tags/openvswitch三层转发/"},{"name":"openvswitch配置IP后转发流程","slug":"openvswitch配置IP后转发流程","permalink":"http://vcpu.me/tags/openvswitch配置IP后转发流程/"}]},{"title":"linux策略路由的添加","slug":"policy_router","date":"2018-05-13T10:00:00.000Z","updated":"2018-06-12T05:33:34.000Z","comments":true,"path":"policy_router/","link":"","permalink":"http://vcpu.me/policy_router/","excerpt":"","text":"添加记录1234567891011121314151617181920212223stack@pdf-compute:~$ ip rule show0: from all lookup local32765: from 169.254.169.254 lookup 232766: from all lookup main32767: from all lookup defaultstack@pdf-compute:~$stack@pdf-compute:~$ sudo ip rule add from 192.168.56.0/24 table 10stack@pdf-compute:~$ sudo ip route add default via 192.168.56.10 table 10RTNETLINK answers: File existsstack@pdf-compute:~$ ip route show table 10default via 192.168.56.10 dev br-exstack@pdf-compute:~$stack@pdf-compute:~$ sudo ip route delete default via 192.168.56.10 table 10stack@pdf-compute:~$ ip route show table 10stack@pdf-compute:~$stack@pdf-compute:~$ sudo ip route add default via 192.168.56.10 table 10stack@pdf-compute:~$stack@pdf-compute:~$ ip route show table 10default via 192.168.56.10 dev br-exstack@pdf-compute:~$ sudo ip route add 114.114.114.114 via 192.168.56.10 table 10stack@pdf-compute:~$ ip route show table 10default via 192.168.56.10 dev br-ex114.114.114.114 via 192.168.56.10 dev br-ex 步骤总结添加 ip rule add from 192.168.56.0/24 table 10 ip route add default via 192.168.56.10 table 10 ip route add 114.114.114.114 via 192.168.56.10 table 10 删除 ip route delete 114.114.114.114 via 192.168.56.10 table 10 sudo ip route delete default via 192.168.56.10 table 10 sudo ip rule delete from 192.168.56.0/24","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"linux主机策略路由","slug":"linux主机策略路由","permalink":"http://vcpu.me/tags/linux主机策略路由/"}]},{"title":"CONJUNCTIVE MATCH FIELDS原理","slug":"conjuctive","date":"2018-05-08T10:00:00.000Z","updated":"2018-06-12T05:34:17.000Z","comments":true,"path":"conjuctive/","link":"","permalink":"http://vcpu.me/conjuctive/","excerpt":"","text":"问题?一条单独的openflow的字段只能匹配一个条件,但是总会有需要匹配多个条件的使用场景。 为了匹配一组字段可以通过多加对应个数的流表来实现,当然这是最简单高效的方法。 eg: 把原地址是a,b,c,d发给控制器处理,你可以通过如下多条流表实现: ip,ip_src=a actions=controller ip,ip_src=b actions=controller ip,ip_src=c actions=controller ip,ip_src=d actions=controller 当然和其相识的场景如下,e,f,g,h的目的地址的发给控制器处理 ip,ip_dst=e actions=controller ip,ip_dst=f actions=controller ip,ip_dst=g actions=controller ip,ip_dst=h actions=controller 但是问题来了,你把上述两种情况流表,都放到一个table中,就会出现问题: 当前情况是,如果源ip是a,b,c,d,目的ip是e,f,g,h;或者两个条件只要满足一个,数据包就被发给controller处理 很显然,上述流表可能命中多个,一般来讲出现这种情况需要优先级概念来区分查找到精确的流表,否则优先级别全部相同情况下,命中策略就是不稳定性的 也不知道会命中哪一条策略。 假定,如果想要连续匹配,源地址和目的地址必须分别在abcd和efgh中选取,那么共有16总流表来实现此需求,当然我们举例的场景使用这种方法不会别扭,但是如果情况更复杂? 综上所述,conjunction就诞生了(ovs2.4+支持) 如何实现的?32-bit id for conjunctive 32位的id是openflow table中具有唯一性 如何实现的?首先根据conjunction ID找到第一条流表,然后,查找对应conjunction 匹配集合;conjunction(id, k/n)来表示,其中id为conjunction ID ,k是dimension的编号,n是dimension 总数 那么怎么样算是满足条件?n个dimension集合中,每一个集合必须有一个满足;这样必须有n个条件满足才算这个conjunction 满足情况,进行相应的动作执行。 eg: conj_id=1234 actions=controller ip,ip_src=10.0.0.1 actions=conjunction(1234, 1/2) ip,ip_src=10.0.0.4 actions=conjunction(1234, 1/2) ip,ip_src=10.0.0.6 actions=conjunction(1234, 1/2) ip,ip_src=10.0.0.7 actions=conjunction(1234, 1/2) ip,ip_dst=10.0.0.2 actions=conjunction(1234, 2/2) ip,ip_dst=10.0.0.5 actions=conjunction(1234, 2/2) ip,ip_dst=10.0.0.7 actions=conjunction(1234, 2/2) ip,ip_dst=10.0.0.8 actions=conjunction(1234, 2/2) 两个条件,源IP(dimension 1 of2)和目的IP(dimension 2 of 2)同时满足的写法;假设有如下流量模型: a. src ip 10.0.0.1 dst ip 114.114.114.114 b. src ip 10.0.0.7 dst ip 10.0.0.8 c. src ip 10.0.0.8 dst ip 10.0.0.1 d. src ip 1.1.1.1 dst ip 2.2.2.2 上述abcd四个选项,只有b是满足条件的,可以执行actions=controller(将数据包发给控制器)","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"openvswitch CONJUNCTIVE","slug":"openvswitch-CONJUNCTIVE","permalink":"http://vcpu.me/tags/openvswitch-CONJUNCTIVE/"},{"name":"openvswitch conjunctive 原理解析","slug":"openvswitch-conjunctive-原理解析","permalink":"http://vcpu.me/tags/openvswitch-conjunctive-原理解析/"}]},{"title":"openvswitch手动编译方法","slug":"build_ovs","date":"2018-04-28T10:00:00.000Z","updated":"2018-06-12T06:06:29.000Z","comments":true,"path":"build_ovs/","link":"","permalink":"http://vcpu.me/build_ovs/","excerpt":"","text":"手动编译编译环境:Linux ubuntu 4.4.0-87-generic 编译安装123apt-get updateapt-get install -y dh-autoreconf libssl-dev libcap-ng-dev openssl python python-pippip install six 12wget http://openvswitch.org/releases/openvswitch-2.7.0.tar.gztar zxvf openvswitch-2.7.0.tar.gz && cd openvswitch-2.7.0 12345./boot.sh./configure --with-linux=/lib/modules/$(uname -r)/buildmake -jmake installmake modules_install 加载启动服务12345678910/sbin/modprobe openvswitchmkdir -p /usr/local/etc/openvswitchovsdb-tool create /usr/local/etc/openvswitch/conf.db vswitchd/vswitch.ovsschemamkdir -p /usr/local/var/run/openvswitchmkdir -p /usr/local/var/log/openvswitchovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-fileovs-vsctl --no-wait initovs-vswitchd --pidfile --detach --log-fileexport PATH=$PATH:/usr/local/share/openvswitch/scriptsovs-ctl start 停止服务123kill `cd /usr/local/var/run/openvswitch && cat ovsdb-server.pid ovs-vswitchd.pid`export PATH=$PATH:/usr/local/share/openvswitch/scriptsovs-ctl stop 开启服务1234ovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-fileovs-vswitchd --pidfile --detach --log-fileexport PATH=$PATH:/usr/local/share/openvswitch/scriptsovs-ctl start 编译deb数据包使用此方法在4.4.0-87-generic系统下编译openvswitch-2.6.1 、openvswitch-2.7.0和 openvswitch-2.8.0 12345apt-get install build-essential fakerootdpkg-checkbuilddeps# 已经编译过,需要首先clean# fakeroot debian/rules cleanDEB_BUILD_OPTIONS='parallel=8 nocheck' fakeroot debian/rules binary","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"openvswitch编译运行过程","slug":"openvswitch编译运行过程","permalink":"http://vcpu.me/tags/openvswitch编译运行过程/"},{"name":"openvswitch 打包deb","slug":"openvswitch-打包deb","permalink":"http://vcpu.me/tags/openvswitch-打包deb/"}]},{"title":"两台云主机处于同一个物理主机的fip和安全组会话冲突","slug":"ovs_connection_crach","date":"2018-04-23T10:00:00.000Z","updated":"2018-06-12T06:15:34.000Z","comments":true,"path":"ovs_connection_crach/","link":"","permalink":"http://vcpu.me/ovs_connection_crach/","excerpt":"","text":"问题描述:在开启安全组情况下,同物理机器上同一个br-int两个绑定fip的云主机icmp通信 时通时不通,TCP协议通信始终没有无法连通 模拟的脚本1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677#Creating ns1 with IP 10.100.5.8(FIP 192.168.56.31, Mac fa:16:3e:00:eb:c0)sudo ip netns add ns1sudo ovs-vsctl add-br test0sudo ip link add vns1 type veth peer name vpeerns1sudo ip link set vpeerns1 netns ns1sudo ip link set vns1 upsudo ip netns exec ns1 ip link set vpeerns1 address fa:16:3e:1d:3d:01sudo ip netns exec ns1 ip addr add dev vpeerns1 10.100.5.8/24sudo ip netns exec ns1 ip link set vpeerns1 upsudo ip netns exec ns1 ip route add default via 10.100.5.1sudo ovs-vsctl add-port test0 vns1#Creating ns2 with IP 10.100.5.9 Mac fa:16:3e:13:85:be(FIP 192.168.56.32, Mac fa:16:3e:71:b6:6e)sudo ip netns add ns2sudo ip link add vns2 type veth peer name vpeerns2sudo ip link set vpeerns2 netns ns2sudo ip link set vns2 upsudo ip netns exec ns2 ip link set vpeerns2 address fa:16:3e:13:85:besudo ip netns exec ns2 ip addr add dev vpeerns2 10.100.5.9/24sudo ip netns exec ns2 ip link set vpeerns2 upsudo ip netns exec ns2 ip route add default via 10.100.5.1sudo ovs-vsctl add-port test0 vns2#Dispatcher Tablesudo ovs-ofctl -v -O OpenFlow13 add-flow test0 \"table=0,priority=10,in_port=1,ip,actions=goto_table=40\"sudo ovs-ofctl -v -O OpenFlow13 add-flow test0 \"table=0,priority=10,in_port=2,ip,actions=goto_table=40\"sudo ovs-ofctl -v -O OpenFlow13 add-flow test0 \"table=0,priority=42,arp,actions=goto_table=81\"#L3 and Floating Ip Conversion(Router IP 10.100.5.1 Mac fa:16:3e:57:f2:9a)sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=21,priority=42,ip,nw_dst=10.100.5.9 actions=resubmit(,251)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=21,priority=42,ip,nw_dst=10.100.5.8 actions=resubmit(,251)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=21,priority=42,ip,nw_dst=192.168.56.31 actions=set_field:fa:16:3e:00:eb:c0->eth_dst,goto_table:25\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=21,priority=42,ip,nw_dst=192.168.56.32 actions=set_field:fa:16:3e:71:b6:6e->eth_dst,goto_table:25\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=25,priority=10,ip,dl_dst=fa:16:3e:00:eb:c0,nw_dst=192.168.56.31 actions=set_field:10.100.5.8->ip_dst,set_field:fa:16:3e:57:f2:9a->eth_src,set_field:fa:16:3e:1d:3d:01->eth_dst,goto_table:26\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=25,priority=10,ip,dl_dst=fa:16:3e:71:b6:6e,nw_dst=192.168.56.32 actions=set_field:10.100.5.9->ip_dst,set_field:fa:16:3e:57:f2:9a->eth_src,set_field:fa:16:3e:13:85:be->eth_dst,goto_table:26\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=26,priority=10,ip,nw_src=10.100.5.8 actions=set_field:192.168.56.31->ip_src,goto_table:28\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=26,priority=10,ip,nw_src=10.100.5.9 actions=set_field:192.168.56.32->ip_src,goto_table:28\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=27,priority=10,ip,nw_dst=10.100.5.8 actions=resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=27,priority=10,ip,nw_dst=10.100.5.9 actions=resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=28,priority=10,ip,nw_src=192.168.56.31 actions=set_field:fa:16:3e:00:eb:c0->eth_src,resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=28,priority=10,ip,nw_src=192.168.56.32 actions=set_field:fa:16:3e:71:b6:6e->eth_src,resubmit(,21)\"#ACL ruless egress from vmsudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=40,priority=61010,ip,dl_src=fa:16:3e:1d:3d:01,nw_src=10.100.5.8,actions=ct(table=41,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=40,priority=61010,ip,dl_src=fa:16:3e:13:85:be,nw_src=10.100.5.9,actions=ct(table=41,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=40,priority=0,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=62020,ct_state=-new+est-rel-inv+trk,actions=resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=62020,ct_state=-new-est+rel-inv+trk,actions=resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=62020,ct_state=+inv+trk,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=1000,ct_state=+new+trk,ip,dl_src=fa:16:3e:1d:3d:01,nw_src=10.100.5.8,actions=ct(commit,zone=5002),resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=1000,ct_state=+new+trk,ip,dl_src=fa:16:3e:13:85:be,nw_src=10.100.5.9,actions=ct(commit,zone=5002),resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=50,ct_state=+new+trk,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=0,actions=drop\"#ARP respondersudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=81,priority=100,arp,arp_tpa=192.168.56.31,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:fa:16:3e:00:eb:c0->eth_src,load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e00ebc0->NXM_NX_ARP_SHA[],load:0xc0a8381f->NXM_OF_ARP_SPA[],load:0->NXM_OF_IN_PORT[],load:0x500->NXM_NX_REG6[],write_metadata:0/0x1,goto_table:220\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=81,priority=100,arp,arp_tpa=192.168.56.32,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:fa:16:3e:71:b6:6e->eth_src,load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e71b66e->NXM_NX_ARP_SHA[],load:0xc0a83820->NXM_OF_ARP_SPA[],load:0->NXM_OF_IN_PORT[],load:0x500->NXM_NX_REG6[],write_metadata:0/0x1,goto_table:220\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=81,arp,arp_tpa=10.100.5.1,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],set_field:fa:16:3e:57:f2:9a->eth_src,load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e7c661a->NXM_NX_ARP_SHA[],load:0xa640501->NXM_OF_ARP_SPA[],load:0->NXM_OF_IN_PORT[],load:0x100->NXM_NX_REG6[],resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=81,priority=10,arp,actions=NORMAL\"#Dispatcher Tablesudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=220,priority=62020,ip,dl_dst=fa:16:3e:1d:3d:01,nw_dst=10.100.5.8,actions=output:1\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=220,priority=62020,ip,dl_dst=fa:16:3e:13:85:be,nw_dst=10.100.5.9,actions=output:2\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=220,priority=62020,arp,dl_dst=fa:16:3e:1d:3d:01,actions=output:1\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=220,priority=62020,arp,dl_dst=fa:16:3e:13:85:be,actions=output:2\"#ACL ruless ingress to vmsudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=251,priority=61010,ip,dl_dst=fa:16:3e:1d:3d:01,nw_dst=10.100.5.8,actions=ct(table=252,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=251,priority=61010,ip,dl_dst=fa:16:3e:13:85:be,nw_dst=10.100.5.9,actions=ct(table=252,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=251,priority=0,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=62020,ct_state=-new+est-rel-inv+trk,actions=resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=62020,ct_state=-new-est+rel-inv+trk,actions=resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=62020,ct_state=+inv+trk,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=50,ct_state=+new+trk,actions=drop\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=1000,ct_state=+new+trk,ip,dl_dst=fa:16:3e:1d:3d:01,nw_dst=10.100.5.8,actions=ct(commit,zone=5002),resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=1000,ct_state=+new+trk,ip,dl_dst=fa:16:3e:13:85:be,nw_dst=10.100.5.9,actions=ct(commit,zone=5002),resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=0,actions=drop\" 分析通过观察流量路径发现,数据包在table15被丢弃,table15是出方向安全组。详细说明如下 eg: 云主机A fip为FIPA 和云主机B fip为FIPB icmp通信 云主机A icmp request数据包可以到达云主机B 云主机B回复该数据包,经过B的出方向安全组模块时候被丢弃,计数如下流表所示 cookie=0x0, duration=1154.456s, table=15, n_packets=0, n_bytes=0, idle_age=18019, priority=65534,ct_state=+inv+trk actions=drop 云主机B回复的数据包被当成异常状态,该数据包被丢弃 另从linux conntrack 会话角度分析问题如下: conntrack -L查看会话 icmp 1 29 src=7.7.7.11 dst=192.168.210.17 type=8 code=0 id=31233 [UNREPLIED] src=192.168.210.17 dst=7.7.7.11 type=0 code=0 id=31233 mark=0 zone=7 use=1 只有正向会话,缺少源地址和目的地址转换后的会话,所以,云主机B reply数据包因查不到会话而认为该数据包异常状态,将reply丢弃 那么问题可以归纳为,在同一个云主机的两个绑定同网断fip通信,且用流表实现安全组功能时,icmp协议通信时候,fip 转换后的会话建立异常(icmp有时候能建立,TCP一直建立不了),导致云主机连通访问失败 解决办法:将这两个会话放到不通的ct中问题可解决,具体操作是: 12345678sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=40,priority=61011,ip,dl_src=fa:16:3e:1d:3d:01,nw_src=10.100.5.8,actions=ct(table=41,zone=5001)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=40,priority=61011,ip,dl_src=fa:16:3e:13:85:be,nw_src=10.100.5.9,actions=ct(table=41,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=1001,ct_state=+new+trk,ip,dl_src=fa:16:3e:1d:3d:01,nw_src=10.100.5.8,actions=ct(commit,zone=5001),resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=41,priority=1001,ct_state=+new+trk,ip,dl_src=fa:16:3e:13:85:be,nw_src=10.100.5.9,actions=ct(commit,zone=5002),resubmit(,21)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=251,priority=61011,ip,dl_dst=fa:16:3e:1d:3d:01,nw_dst=10.100.5.8,actions=ct(table=252,zone=5001)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=251,priority=62011,ip,dl_dst=fa:16:3e:13:85:be,nw_dst=10.100.5.9,actions=ct(table=252,zone=5002)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=1001,ct_state=+new+trk,ip,dl_dst=fa:16:3e:1d:3d:01,nw_dst=10.100.5.8,actions=ct(commit,zone=5001),resubmit(,220)\"sudo ovs-ofctl -O OpenFlow13 add-flow test0 \"table=252,priority=1001,ct_state=+new+trk,ip,dl_dst=fa:16:3e:13:85:be,nw_dst=10.100.5.9,actions=ct(commit,zone=5002),resubmit(,220)\"","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"openvswitch实现安全组和FIP icmp偶尔通","slug":"openvswitch实现安全组和FIP-icmp偶尔通","permalink":"http://vcpu.me/tags/openvswitch实现安全组和FIP-icmp偶尔通/"},{"name":"openvswitch实现安全组和FIP tcp一直不通","slug":"openvswitch实现安全组和FIP-tcp一直不通","permalink":"http://vcpu.me/tags/openvswitch实现安全组和FIP-tcp一直不通/"},{"name":"fip修改地址后导致ct状态判断inv","slug":"fip修改地址后导致ct状态判断inv","permalink":"http://vcpu.me/tags/fip修改地址后导致ct状态判断inv/"},{"name":"openvswitch实现安全组和FIP会遇到的问题","slug":"openvswitch实现安全组和FIP会遇到的问题","permalink":"http://vcpu.me/tags/openvswitch实现安全组和FIP会遇到的问题/"}]},{"title":"dragonflow2018多外网转发方案","slug":"dragonflow_muti_external_network","date":"2018-04-22T10:00:00.000Z","updated":"2018-06-12T07:02:39.000Z","comments":true,"path":"dragonflow_muti_external_network/","link":"","permalink":"http://vcpu.me/dragonflow_muti_external_network/","excerpt":"","text":"笔者曾经在draonglow开发社区launchpad询问过此问题https://answers.launchpad.net/dragonflow/+question/668047问题描述见我的这篇博文:http://vcpu.me/dnat/但是没有得到答案,下文为我尝试出来的一个可行方案,与大家共享 dragonflow2018 fip转发基本过程dragonflow2018版本distributed dnat(分布式浮动ip)会需要一个放置外网口的ovs桥(后续称呼为br-ex),dragonflow2018的分布式floating ip会将数据包发给该bridge,由该bridge l3 forwarding,将floating ip流量送出外网;在这个过程中外网的网桥需要起到三层网关的作用 多外网fip支持条件条件1:支持多外网fip需要申请多个vlan外网 条件2:多个vlan外网需要在br-ex能够三层转发 条件3:br-ex和上联交换机是三层互通 多个vlan外网需要在br-ex能够三层转发br-ex想要完成vlan流量的三层转发,需要在br-ex上配置fake bridge (br-ex内部口), 给fake bridge配置上三层地址并加上tag属性;且该fake bridge能够和交换机的网关互通。 vlan fip 外向流量的目的mac需要是相同tag接口的mac; 这样vlan fip外出流量在到达br-ex时候会根据路由找到网关mac地址进行三层转发。 那就意味着n个不同vlan网络需要n个fake bridge口才能进行转发。 br-ex和上联交换机是三层互通使用br-ex上对应vlan的fake bridge口能够和交换机互联,fake bridge的网关配置在物理交换机上,和网关互通。 两种方案方案1: 互联的地址使用fip网段的地址,fip网关直接落到物理交换机上,在br-ex根据外网vlan号新增两个br-ex三层网关口 eg: 192.168.57.10 属于vlan207 是vlan207 外网fip的网关 192.168.58.10 属于vlan208 是vlan208 外网fip 的网关 vlan 207 外部网络需要在br-ex新增如下配置: sudo ovs-vsctl add-br vlan207 br-ex 207 sudo ifconfig vlan207 192.168.57.155/24 up vlan 208 外部网络需要在br-ex新增如下配置: sudo ovs-vsctl add-br vlan208 br-ex 208 sudo ifconfig vlan208 192.168.58.155/24 up 配置策略路由,不同fip出去的网关不同: sudo ip rule add from 192.168.57.0/24 table 10 sudo ip route add default via 192.168.57.10 table 10 sudo ip rule add from 192.168.58.0/24 table 11 sudo ip route add default via 192.168.58.10 table 11 !!!如果你不想修改mac地址转发逻辑可以直接将新增的fake bridge mac修改和br-ex默认mac地址一样 sudo ifconfig vlan208 hw ether 08:00:27:12:3f:5b 优点:fip网关直接一步落到物理设备上 缺点:每个计算节点占用了一个外网地址,浪费fip 方案2:使用私有网地址完成fake bridge和物理网络的互联","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"dragonflow多外网方案","slug":"dragonflow多外网方案","permalink":"http://vcpu.me/tags/dragonflow多外网方案/"}]},{"title":"Using vlan external netowrk fip,dragonflow dirtributed dnat does not work","slug":"dnat","date":"2018-04-18T04:00:00.000Z","updated":"2018-04-18T03:29:49.000Z","comments":true,"path":"dnat/","link":"","permalink":"http://vcpu.me/dnat/","excerpt":"","text":"My problem is : Dragonflow distributed dnat can work ok when external network type is vlan ? if ok ,dragonflow distributed dnat support two or more external vlan network ? Details below: Using flat external netowork fip , dragonflow distributed dnat work fine123456789101112131415161718192021222324252627282930313233+------------------------------------------------------+ +----------------------------------+| | | MARK: C || br-ex enp0s3 |XXXXXXXXXXXXXXXXXXX| 192.168.56.10 || 08:00:27:12:3f:5b | | external network gateway || 192.168.56.155 | | |+------------------------------------------------------+ +----------------------------------+ | | | | | | | +-----------------------------------------------------------+ flows: A -> br-int(dnat) -> br-ex(l3 forward) -> C | | | | | br-int | | | | | +-----------------------------------------------------------+ | | | fa:16:3e:d4:60:2d | flat floating ip 192.168.56.55 | dragonflow distribute dnat | +--------------------+ | 1.1.1.6 | | Mark: A | | | | | +--------------------+ Instructions: 192.168.56.0/24 is my external flat network subnet. using dragonflow dnat A->C icmp flows: A -> br-int(dnat) -> br-ex(l3 forward) -> C 1.1.1.6 with fip ping gateway.( A->C connected). When icmp packet arrived at br-ex , layer3 forwarding is noraml. Catch packets in br-ex interface. 123456tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes21:40:50.340646 fa:16:3e:d4:60:2d > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 188, length 6421:40:50.340672 08:00:27:12:3f:5b > be:e5:f2:da:42:46, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 188, length 6421:40:51.341668 fa:16:3e:d4:60:2d > 08:00:27:12:3f:5b, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 189, length 6421:40:51.341696 08:00:27:12:3f:5b > be:e5:f2:da:42:46, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 189, length 64 tcpdump catch packets in enp0s3 12345listening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes21:43:01.601739 08:00:27:12:3f:5b > be:e5:f2:da:42:46, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 319, length 6421:43:01.609858 be:e5:f2:da:42:46 > fa:16:3e:d4:60:2d, ethertype IPv4 (0x0800), length 98: 114.114.114.114 > 192.168.56.55: ICMP echo reply, id 22785, seq 319, length 6421:43:02.602343 08:00:27:12:3f:5b > be:e5:f2:da:42:46, ethertype IPv4 (0x0800), length 98: 192.168.56.55 > 114.114.114.114: ICMP echo request, id 22785, seq 320, length 6421:43:02.611243 be:e5:f2:da:42:46 > fa:16:3e:d4:60:2d, ethertype IPv4 (0x0800), length 98: 114.114.114.114 > 192.168.56.55: ICMP echo reply, id 22785, seq 320, length 64 Using vlan external netowrk fip,dragonflow dirtributed dnat does not work123456789101112131415161718192021222324252627282930313233+------------------------------------------------------+ +----------------------------------+| | | MARK: C || br-ex enp0s3 |XXXXXXXXXXXXXXXXXXX| 192.168.57.10 || 08:00:27:12:3f:5b | | external vlan network gateway || 192.168.56.155 | | |+------------------------------------------------------+ +----------------------------------+ | | + | | | | | | | | | v +-----------------------------------------------------------+ flows: A -> br-int(dnat) -> br-ex(l3 forward) -> C | | | | | br-int | work err here | | | | +-----------------------------------------------------------+ | | | fa:16:3e:39:63:c2 | vlan external network fip : 192.168.57.159 | | +--------------------+ | 1.1.1.6 | | Mark: A | | | | | +--------------------+ Catch packets in br-ex interface. and I find that icmp packet can not be l3 forwarded in br-ex . A -> c disconnect. 123456stack@p-controller:~/devstack$ sudo tcpdump -i br-ex -netcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes21:58:36.278770 fa:16:3e:39:63:c2 > 08:00:27:12:3f:5b, ethertype 802.1Q (0x8100), length 102: vlan 207, p 0, ethertype IPv4, 192.168.57.159 > 114.114.114.114: ICMP echo request, id 23041, seq 15, length 6421:58:37.279352 fa:16:3e:39:63:c2 > 08:00:27:12:3f:5b, ethertype 802.1Q (0x8100), length 102: vlan 207, p 0, ethertype IPv4, 192.168.57.159 > 114.114.114.114: ICMP echo request, id 23041, seq 16, length 6421:58:38.279781 fa:16:3e:39:63:c2 > 08:00:27:12:3f:5b, ethertype 802.1Q (0x8100), length 102: vlan 207, p 0, ethertype IPv4, 192.168.57.159 > 114.114.114.114: ICMP echo request, id 23041, seq 17, length 64 My problem is : Dragonflow distributed dnat can work ok when external network type is vlan ? if ok ,dragonflow distributed dnat support two or more external vlan network ?","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/tags/dragonflow/"},{"name":"dirtributed dnat","slug":"dirtributed-dnat","permalink":"http://vcpu.me/tags/dirtributed-dnat/"},{"name":"vlan external netowrk","slug":"vlan-external-netowrk","permalink":"http://vcpu.me/tags/vlan-external-netowrk/"},{"name":"floating ip","slug":"floating-ip","permalink":"http://vcpu.me/tags/floating-ip/"}]},{"title":"Dragonflow openvswitch流表实现原理","slug":"dragonflow_flows","date":"2018-04-17T10:00:00.000Z","updated":"2018-06-12T07:23:25.000Z","comments":true,"path":"dragonflow_flows/","link":"","permalink":"http://vcpu.me/dragonflow_flows/","excerpt":"","text":"独立唯一的标记dragonflow 2018 所有的云主机接口会被分配到一个全局独立唯一的id;这个标记可以理解成云主机网络路径的一种抽象,每个云主机的网络路径不同的,二三层通信过程实际是在选择对应通信目标的网络路径,然后进行数据包转发,所以这些id在云主之间通信起到至关重要的作用。 云主机之间的通信云主机A和云主机B,位于同一个subnet;它们的通信过程是:云主机A发出的数据包会跑到table55,根据目的mac选择网络路径(id设置到reg7); 本机table115会根据选择的网络路径将数据包转发,其它节点,会根据网络路径将数据包发给vxlan隧道 同一个路由,跨子网情况下,table55会根据目的网关的mac匹配,因为是发给网关流量设置reg5=1,交给table60匹配目的ip网段,确定子网信息(设置metadata标识),最后根据目的ip和子网信息,匹配table65找寻最后的网络路径。 网络节点qr和云主机通信l2 table根据网关mac判断,设置reg5=1 代表三层流量,l3 table根据mac(也有根据网关的ip) 选择网络路径 DNATdnat属于三层流量,会经过table55 设置reg5=1交给三层table60处理;table60根据源地址发现该流量需要dnat,交给table77;最新的方案中的目的mac是最后决定的,所以需要将流量 从定向到table55,table55会帮助引导,重新最后设置mac地址后发出的。 回来的流量直接到table55,根据目的 mac判断是fip流量后,外网fip流量交给table75分流后交给table75 nat还原;还原后从新到table55 重新寻找具体网络路径。 SNAT访问外网地址114.114.114.114,云主机发出的流量经过table55,因为目的mac是子网网关mac,所以标记三层流量交给table60,交给table70区做snat,最后交给table71将流量发出","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"Dragonflow openvswitch流表实现原理","slug":"Dragonflow-openvswitch流表实现原理","permalink":"http://vcpu.me/tags/Dragonflow-openvswitch流表实现原理/"},{"name":"Dragonflow 流表采用的标记","slug":"Dragonflow-流表采用的标记","permalink":"http://vcpu.me/tags/Dragonflow-流表采用的标记/"}]},{"title":"dragonflow self-service 和 provider通信流量走向图示","slug":"self-service_and_provider_flows","date":"2018-04-11T10:00:00.000Z","updated":"2018-06-12T07:10:23.000Z","comments":true,"path":"self-service_and_provider_flows/","link":"","permalink":"http://vcpu.me/self-service_and_provider_flows/","excerpt":"","text":"vm1 ping the same compute provider vm (qg in br-ex)123456789101112131415['1.1.1.12'] ['fa:16:3e:9a:cd:f1'] 0x82['1.1.1.2'] ['fa:16:3e:21:35:2a'] 0x1e['192.168.56.58'] ['fa:16:3e:8d:ce:3d'] 0x23['192.168.56.59'] ['fa:16:3e:11:d7:3e'] 0x7c['192.168.56.67'] ['fa:16:3e:9f:0b:ef'] 0x85['192.168.56.53'] ['fa:16:3e:21:be:e8'] 0x7d['192.168.56.62'] ['fa:16:3e:26:36:07'] 0x7f['1.1.1.8'] ['fa:16:3e:83:38:2c'] 0x83['192.168.56.50'] ['fa:16:3e:a4:37:0b'] 0x1b['1.1.1.5'] ['fa:16:3e:6c:c0:49'] 0x81['1.1.1.4'] ['fa:16:3e:18:7c:d6'] 0x80['192.168.56.60'] ['fa:16:3e:41:ef:d4'] 0x25['192.168.56.52'] ['fa:16:3e:c6:23:7e'] 0x7e['192.168.56.55'] ['fa:16:3e:d4:60:2d'] 0x84['1.1.1.1'] ['fa:16:3e:07:f8:41'] 0x20 vm1 -> qr123456789101112131415161718192021 cookie=0x0, duration=460.782s, table=0, n_packets=308, n_bytes=29680, idle_age=0, priority=100,in_port=25 actions=load:0x83->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,)cookie=0x0, duration=545.398s, table=0, n_packets=1195, n_bytes=114344, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=588.146s, table=5, n_packets=426, n_bytes=41748, idle_age=0, priority=200,ip,reg6=0x83,dl_src=fa:16:3e:83:38:2c,nw_src=1.1.1.8 actions=resubmit(,10) cookie=0x0, duration=610.596s, table=10, n_packets=448, n_bytes=43904, idle_age=0, priority=100,ip,reg6=0x83 actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=632.912s, table=15, n_packets=933, n_bytes=91434, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=640.105s, table=20, n_packets=1422, n_bytes=139356, idle_age=1, priority=1 actions=resubmit(,55) cookie=0x0, duration=667.682s, table=55, n_packets=505, n_bytes=49490, idle_age=0, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=681.907s, table=60, n_packets=519, n_bytes=50862, idle_age=0, priority=20,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x20->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=702.994s, table=75, n_packets=540, n_bytes=52920, idle_age=1, priority=100,reg7=0x20 actions=resubmit(,105) cookie=0x0, duration=755.881s, table=105, n_packets=751, n_bytes=67174, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=761.733s, table=115, n_packets=616, n_bytes=59416, idle_age=1, priority=100,reg7=0x20 actions=output:4 qg -> provider vm212345678910111213cookie=0x0, duration=779.369s, table=0, n_packets=900, n_bytes=78729, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2)cookie=0x0, duration=787.445s, table=2, n_packets=908, n_bytes=79399, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=815.975s, table=55, n_packets=671, n_bytes=64750, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:26:36:07 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=834.281s, table=75, n_packets=783, n_bytes=73781, idle_age=0, priority=100,reg7=0x7f actions=resubmit(,105)cookie=0x0, duration=864.773s, table=105, n_packets=706, n_bytes=70633, idle_age=0, priority=100,ip,reg7=0x7f actions=ct(table=110,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=894.181s, table=110, n_packets=1455, n_bytes=142590, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=897.977s, table=115, n_packets=877, n_bytes=79776, idle_age=1, priority=100,reg7=0x7f actions=output:22 provider vm2 -> qg1234567891011121314151617cookie=0x0, duration=906.113s, table=0, n_packets=788, n_bytes=74760, idle_age=0, priority=100,in_port=22 actions=load:0x7f->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,)cookie=0x0, duration=915.570s, table=0, n_packets=2345, n_bytes=224804, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=938.984s, table=5, n_packets=776, n_bytes=76048, idle_age=0, priority=200,ip,reg6=0x7f,dl_src=fa:16:3e:26:36:07,nw_src=192.168.56.62 actions=resubmit(,10)cookie=0x0, duration=951.183s, table=10, n_packets=788, n_bytes=77224, idle_age=0, priority=100,ip,reg6=0x7f actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=972.483s, table=15, n_packets=1611, n_bytes=157878, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20)cookie=0x0, duration=984.694s, table=20, n_packets=2454, n_bytes=240492, idle_age=1, priority=1 actions=resubmit(,55)cookie=0x0, duration=1028.389s, table=55, n_packets=889, n_bytes=85778, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=1041.584s, table=75, n_packets=904, n_bytes=87192, idle_age=0, priority=100,reg7=0x25 actions=resubmit(,80)cookie=0x0, duration=1071.120s, table=80, n_packets=934, n_bytes=90076, idle_age=0, priority=100,metadata=0x7 actions=output:1 qr -> vm112345678910111213141516171819 cookie=0x0, duration=1098.192s, table=0, n_packets=963, n_bytes=92806, idle_age=0, priority=100,in_port=4 actions=load:0x20->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=1124.903s, table=0, n_packets=2996, n_bytes=287090, idle_age=1, priority=1 actions=resubmit(,5)cookie=0x0, duration=1180.867s, table=5, n_packets=1048, n_bytes=100968, idle_age=1, priority=200,reg6=0x20 actions=resubmit(,10)cookie=0x0, duration=1207.420s, table=10, n_packets=1073, n_bytes=103362, idle_age=0, priority=1 actions=resubmit(,20) cookie=0x0, duration=1220.656s, table=20, n_packets=3162, n_bytes=309876, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=1247.127s, table=55, n_packets=1083, n_bytes=106134, idle_age=1, priority=100,metadata=0x8,dl_dst=fa:16:3e:83:38:2c actions=load:0x83->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=1277.609s, table=75, n_packets=1113, n_bytes=109074, idle_age=1, priority=100,reg7=0x83 actions=resubmit(,105) cookie=0x0, duration=1316.182s, table=105, n_packets=1152, n_bytes=112896, idle_age=0, priority=100,ip,reg7=0x83 actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=1331.477s, table=110, n_packets=2329, n_bytes=228242, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=1347.498s, table=115, n_packets=1216, n_bytes=117376, idle_age=0, priority=100,reg7=0x83 actions=output:25 vm1 ping qg (qg in br-int)[‘1.1.1.2’] [‘fa:16:3e:21:35:2a’] 0x1e [‘192.168.56.58’] [‘fa:16:3e:8d:ce:3d’] 0x23 [‘1.1.1.4’] [‘fa:16:3e:b2:81:08’] 0x8a [‘192.168.56.67’] [‘fa:16:3e:9f:0b:ef’] 0x85 [‘192.168.56.52’] [‘fa:16:3e:69:94:3e’] 0x87 [‘192.168.56.50’] [‘fa:16:3e:a4:37:0b’] 0x1b [‘192.168.56.61’] [‘fa:16:3e:9b:c9:e2’] 0x86 [‘192.168.56.51’] [‘fa:16:3e:e3:27:bd’] 0x88 [‘1.1.1.7’] [‘fa:16:3e:ad:73:37’] 0x89 [‘192.168.56.55’] [‘fa:16:3e:d4:60:2d’] 0x84 [‘1.1.1.1’] [‘fa:16:3e:07:f8:41’] 0x20 123456789101112131415161718192021 cookie=0x0, duration=241.057s, table=0, n_packets=247, n_bytes=23870, idle_age=0, priority=100,in_port=28 actions=load:0x8a->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=264.039s, table=0, n_packets=546, n_bytes=52200, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=270.916s, table=5, n_packets=270, n_bytes=26460, idle_age=0, priority=200,ip,reg6=0x8a,dl_src=fa:16:3e:b2:81:08,nw_src=1.1.1.4 actions=resubmit(,10)cookie=0x0, duration=312.605s, table=10, n_packets=312, n_bytes=30576, idle_age=0, priority=100,ip,reg6=0x8a actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=332.121s, table=15, n_packets=329, n_bytes=32242, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=346.371s, table=20, n_packets=686, n_bytes=67228, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=368.640s, table=55, n_packets=367, n_bytes=35966, idle_age=0, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=377.877s, table=60, n_packets=167, n_bytes=16366, idle_age=0, priority=20,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x20->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=393.758s, table=75, n_packets=183, n_bytes=17934, idle_age=0, priority=100,reg7=0x20 actions=resubmit(,105) cookie=0x0, duration=417.581s, table=105, n_packets=607, n_bytes=52082, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=429.101s, table=115, n_packets=229, n_bytes=21826, idle_age=0, priority=100,reg7=0x20 actions=output:4 12345678910111213141516171819 cookie=0x0, duration=528.682s, table=0, n_packets=331, n_bytes=31598, idle_age=1, priority=100,in_port=4 actions=load:0x20->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,)cookie=0x0, duration=546.593s, table=0, n_packets=1128, n_bytes=108228, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=761.598s, table=5, n_packets=568, n_bytes=54600, idle_age=0, priority=200,reg6=0x20 actions=resubmit(,10) cookie=0x0, duration=772.279s, table=10, n_packets=799, n_bytes=76566, idle_age=0, priority=1 actions=resubmit(,20) cookie=0x0, duration=782.474s, table=20, n_packets=1556, n_bytes=152488, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=801.558s, table=55, n_packets=807, n_bytes=79338, idle_age=0, priority=100,metadata=0x8,dl_dst=fa:16:3e:b2:81:08 actions=load:0x8a->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=823.892s, table=75, n_packets=830, n_bytes=81592, idle_age=0, priority=100,reg7=0x8a actions=resubmit(,105)cookie=0x0, duration=836.139s, table=105, n_packets=842, n_bytes=82768, idle_age=0, priority=100,ip,reg7=0x8a actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=846.463s, table=110, n_packets=842, n_bytes=82516, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=854.035s, table=115, n_packets=871, n_bytes=84294, idle_age=0, priority=100,reg7=0x8a actions=output:28 vm1 ping provider vm2 (qg in br-int) vm1 -> qr123456789101112131415161718192021cookie=0x0, duration=241.057s, table=0, n_packets=247, n_bytes=23870, idle_age=0, priority=100,in_port=28 actions=load:0x8a->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=264.039s, table=0, n_packets=546, n_bytes=52200, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=270.916s, table=5, n_packets=270, n_bytes=26460, idle_age=0, priority=200,ip,reg6=0x8a,dl_src=fa:16:3e:b2:81:08,nw_src=1.1.1.4 actions=resubmit(,10)cookie=0x0, duration=312.605s, table=10, n_packets=312, n_bytes=30576, idle_age=0, priority=100,ip,reg6=0x8a actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=332.121s, table=15, n_packets=329, n_bytes=32242, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=346.371s, table=20, n_packets=686, n_bytes=67228, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=368.640s, table=55, n_packets=367, n_bytes=35966, idle_age=0, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=377.877s, table=60, n_packets=167, n_bytes=16366, idle_age=0, priority=20,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x20->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=393.758s, table=75, n_packets=183, n_bytes=17934, idle_age=0, priority=100,reg7=0x20 actions=resubmit(,105) cookie=0x0, duration=417.581s, table=105, n_packets=607, n_bytes=52082, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=429.101s, table=115, n_packets=229, n_bytes=21826, idle_age=0, priority=100,reg7=0x20 actions=output:4 qg -> vm21234567891011121314151617 cookie=0x0, duration=2374.818s, table=0, n_packets=425, n_bytes=40754, idle_age=0, priority=100,in_port=26 actions=load:0x86->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2424.908s, table=5, n_packets=476, n_bytes=45696, idle_age=0, priority=200,reg6=0x86 actions=resubmit(,10) cookie=0x0, duration=2456.728s, table=10, n_packets=2806, n_bytes=270228, idle_age=0, priority=1 actions=resubmit(,20) cookie=0x0, duration=2463.825s, table=20, n_packets=5470, n_bytes=536060, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=2485.994s, table=55, n_packets=309, n_bytes=30282, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:69:94:3e actions=load:0x87->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=2509.368s, table=75, n_packets=679, n_bytes=57177, idle_age=0, priority=100,reg7=0x87 actions=resubmit(,105) cookie=0x0, duration=2520.571s, table=105, n_packets=364, n_bytes=38581, idle_age=0, priority=100,ip,reg7=0x87 actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2529.962s, table=110, n_packets=2866, n_bytes=280868, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=2564.873s, table=115, n_packets=735, n_bytes=58610, idle_age=0, priority=100,reg7=0x87 actions=output:27 vm2 ->qg12345678910111213141516171819 cookie=0x0, duration=2602.476s, table=0, n_packets=439, n_bytes=42238, idle_age=0, priority=100,in_port=27 actions=load:0x87->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2625.452s, table=0, n_packets=6302, n_bytes=607104, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=2635.327s, table=5, n_packets=458, n_bytes=44884, idle_age=0, priority=200,ip,reg6=0x87,dl_src=fa:16:3e:69:94:3e,nw_src=192.168.56.52 actions=resubmit(,10) cookie=0x0, duration=2646.544s, table=10, n_packets=469, n_bytes=45962, idle_age=0, priority=100,ip,reg6=0x87 actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2662.014s, table=15, n_packets=3130, n_bytes=306740, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=2670.162s, table=20, n_packets=6294, n_bytes=616812, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=2695.409s, table=55, n_packets=735, n_bytes=71694, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:9b:c9:e2 actions=load:0x86->NXM_NX_REG7[],resubmit(,75 cookie=0x0, duration=2712.679s, table=75, n_packets=1134, n_bytes=100913, idle_age=0, priority=100,reg7=0x86 actions=resubmit(,105) cookie=0x0, duration=2732.390s, table=105, n_packets=4017, n_bytes=369881, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=2738.911s, table=115, n_packets=1181, n_bytes=104361, idle_age=0, priority=100,reg7=0x86 actions=output:26 qr->vm112345678910111213141516171819 cookie=0x0, duration=2756.115s, table=0, n_packets=2613, n_bytes=251650, idle_age=1, priority=100,in_port=4 actions=load:0x20->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2779.140s, table=0, n_packets=6936, n_bytes=668228, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=2831.453s, table=5, n_packets=2690, n_bytes=259084, idle_age=1, priority=200,reg6=0x20 actions=resubmit(,10) cookie=0x0, duration=2855.438s, table=10, n_packets=3628, n_bytes=349328, idle_age=0, priority=1 actions=resubmit(,20) cookie=0x0, duration=2866.720s, table=20, n_packets=7078, n_bytes=693644, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=2892.956s, table=55, n_packets=2889, n_bytes=283374, idle_age=1, priority=100,metadata=0x8,dl_dst=fa:16:3e:b2:81:08 actions=load:0x8a->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=2907.217s, table=75, n_packets=2904, n_bytes=284844, idle_age=0, priority=100,reg7=0x8a actions=resubmit(,105) cookie=0x0, duration=2915.352s, table=105, n_packets=2912, n_bytes=285628, idle_age=0, priority=100,ip,reg7=0x8a actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2924.665s, table=110, n_packets=3654, n_bytes=358092, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=2933.173s, table=115, n_packets=3002, n_bytes=289716, idle_age=0, priority=100,reg7=0x8a actions=output:28 vm1 ping fip (qg in br-int) vm1 -> qr123456789101112131415161718192021cookie=0x0, duration=241.057s, table=0, n_packets=247, n_bytes=23870, idle_age=0, priority=100,in_port=28 actions=load:0x8a->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=264.039s, table=0, n_packets=546, n_bytes=52200, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=270.916s, table=5, n_packets=270, n_bytes=26460, idle_age=0, priority=200,ip,reg6=0x8a,dl_src=fa:16:3e:b2:81:08,nw_src=1.1.1.4 actions=resubmit(,10)cookie=0x0, duration=312.605s, table=10, n_packets=312, n_bytes=30576, idle_age=0, priority=100,ip,reg6=0x8a actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=332.121s, table=15, n_packets=329, n_bytes=32242, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=346.371s, table=20, n_packets=686, n_bytes=67228, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=368.640s, table=55, n_packets=367, n_bytes=35966, idle_age=0, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=377.877s, table=60, n_packets=167, n_bytes=16366, idle_age=0, priority=20,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x20->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=393.758s, table=75, n_packets=183, n_bytes=17934, idle_age=0, priority=100,reg7=0x20 actions=resubmit(,105) cookie=0x0, duration=417.581s, table=105, n_packets=607, n_bytes=52082, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=429.101s, table=115, n_packets=229, n_bytes=21826, idle_age=0, priority=100,reg7=0x20 actions=output:4 qg -> vm21234567891011121314151617181920212223 cookie=0x0, duration=2374.818s, table=0, n_packets=425, n_bytes=40754, idle_age=0, priority=100,in_port=26 actions=load:0x86->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2424.908s, table=5, n_packets=476, n_bytes=45696, idle_age=0, priority=200,reg6=0x86 actions=resubmit(,10) cookie=0x0, duration=2456.728s, table=10, n_packets=2806, n_bytes=270228, idle_age=0, priority=1 actions=resubmit(,20) cookie=0x0, duration=2463.825s, table=20, n_packets=5470, n_bytes=536060, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=366.590s, table=55, n_packets=366, n_bytes=35868, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:8d:ce:3d actions=load:0x23->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=390.011s, table=75, n_packets=393, n_bytes=39384, idle_age=0, priority=200,reg7=0x23 actions=resubmit(,76) cookie=0x0, duration=411.413s, table=76, n_packets=413, n_bytes=41400, idle_age=1, priority=100,ip,reg7=0x23 actions=dec_ttl,mod_dl_src:fa:16:3e:07:f8:41,mod_dl_dst:fa:16:3e:39:1d:38,mod_nw_dst:1.1.1.13,load:0x8b->NXM_NX_REG7[],load:0x8->OXM_OF_METADATA[],resubmit(,55) cookie=0x0, duration=432.829s, table=55, n_packets=435, n_bytes=43556, idle_age=0, priority=100,metadata=0x8,dl_dst=fa:16:3e:39:1d:38 actions=load:0x8b->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=442.983s, table=75, n_packets=445, n_bytes=44536, idle_age=0, priority=100,reg7=0x8b actions=resubmit(,105) cookie=0x0, duration=517.155s, table=105, n_packets=519, n_bytes=51788, idle_age=0, priority=100,ip,reg7=0x8b actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=551.607s, table=110, n_packets=1094, n_bytes=107212, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=536.182s, table=115, n_packets=549, n_bytes=53018, idle_age=0, priority=100,reg7=0x8b actions=output:29 这个算是额外记录,从vm1 -> fip流量直接在br-int转发搞定,但是从br-int回来的流量,却经过br-ex后从外网桥转回,也就是来回路径并不相同","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"draonflow 流表结构","slug":"draonflow-流表结构","permalink":"http://vcpu.me/tags/draonflow-流表结构/"}]},{"title":"dragonflow provider流量走向图示","slug":"dragonflow_provider_flow","date":"2018-04-09T10:00:00.000Z","updated":"2018-06-12T06:47:31.000Z","comments":true,"path":"dragonflow_provider_flow/","link":"","permalink":"http://vcpu.me/dragonflow_provider_flow/","excerpt":"","text":"ARPdragonflow provider arp处理分按照代理和非代理两种,代理是指controller负责回复,非代理是指云主机直接回复 arp代理是由控制节点的流表,所有的云主机的arp请求数据包,会被拦截回复;dragonflow要求所有的云主机arp均由它自己拦截arp请求并回复;如果没有被拦截的arp请求被判定为非云主机地址,会被发给外网网桥。 针对于从外网发进来的arp流量,会根据目的mac进行匹配,如果这些流量是云主机的mac会负责将该流量转发给具体的云主机,如果无法识别mac,就在table75被丢弃 vm1 arping gw 广播1234567891011cookie=0x0, duration=607.261s, table=0, n_packets=538, n_bytes=22596, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=647.825s, table=0, n_packets=1102, n_bytes=47508, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=654.478s, table=5, n_packets=583, n_bytes=24486, idle_age=0, priority=200,arp,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,arp_spa=192.168.56.53,arp_sha=fa:16:3e:21:be:e8 actions=resubmit(,20) cookie=0x0, duration=665.119s, table=20, n_packets=589, n_bytes=24738, idle_age=0, priority=100,arp actions=resubmit(,25) cookie=0x0, duration=672.231s, table=25, n_packets=596, n_bytes=25032, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=689.054s, table=55, n_packets=728, n_bytes=34097, idle_age=0, hard_age=688, priority=100,metadata=0x7,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75),load:0x7d->NXM_NX_REG7[],resubmit(,75),load:0->NXM_NX_REG7[],resubmit(,75) 广播流量路径:vm1(drop) vm2 default1234567cookie=0x0, duration=770.378s, table=75, n_packets=698, n_bytes=29316, idle_age=1, priority=200,reg6=0x7d,reg7=0x7d,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=drop cookie=0x0, duration=777.680s, table=75, n_packets=818, n_bytes=37945, idle_age=1, priority=100,reg7=0x7f actions=resubmit(,105) cookie=0x0, duration=733.759s, table=75, n_packets=662, n_bytes=27804, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=742.917s, table=80, n_packets=671, n_bytes=28182, idle_age=0, priority=100,metadata=0x7 actions=output:1 1234567891011cookie=0x0, duration=923.834s, table=0, n_packets=1026, n_bytes=63133, idle_age=1, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=944.640s, table=2, n_packets=1043, n_bytes=64153, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=965.539s, table=55, n_packets=894, n_bytes=53640, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=1023.530s, table=75, n_packets=1081, n_bytes=66433, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=1064.648s, table=105, n_packets=2236, n_bytes=116376, idle_age=0, priority=1 actions=resubmit(,115)cookie=0x0, duration=1101.283s, table=115, n_packets=1170, n_bytes=70200, idle_age=0, priority=100,reg7=0x7d actions=output:16 从云主机发出的arp广播流量,送给table75,因为是广播流量,将该广播数据包复制给 同一个网段的云主机和复制给br-ex连接的patch口 针对于非flat外网? 广播是如何处理的? 单播123456789101112131415cookie=0x0, duration=1601.401s, table=0, n_packets=1528, n_bytes=64176, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=1623.738s, table=0, n_packets=2073, n_bytes=88290, idle_age=1, priority=1 actions=resubmit(,5) cookie=0x0, duration=1631.400s, table=5, n_packets=1555, n_bytes=65310, idle_age=1, priority=200,arp,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,arp_spa=192.168.56.53,arp_sha=fa:16:3e:21:be:e8 actions=resubmit(,20) cookie=0x0, duration=1644.069s, table=20, n_packets=1564, n_bytes=65688, idle_age=0, priority=100,arp actions=resubmit(,25)cookie=0x0, duration=1665.411s, table=25, n_packets=1585, n_bytes=66570, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=1700.542s, table=55, n_packets=639, n_bytes=30510, idle_age=1, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) cookie=0x0, duration=1765.619s, table=75, n_packets=1690, n_bytes=70980, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=1803.244s, table=80, n_packets=1727, n_bytes=72534, idle_age=0, priority=100,metadata=0x7 actions=output:1 1234567891011cookie=0x0, duration=1817.453s, table=0, n_packets=2198, n_bytes=135047, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=1830.150s, table=2, n_packets=2206, n_bytes=135527, idle_age=1, priority=1 actions=resubmit(,55) cookie=0x0, duration=1840.858s, table=55, n_packets=1765, n_bytes=105900, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=1856.090s, table=75, n_packets=2033, n_bytes=125147, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=1876.903s, table=105, n_packets=3461, n_bytes=186258, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=1881.039s, table=115, n_packets=2044, n_bytes=122640, idle_age=0, priority=100,reg7=0x7d actions=output:16 vm1 arping qg 广播/单播12345678910111213cookie=0x0, duration=2056.929s, table=0, n_packets=1966, n_bytes=82572, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2114.828s, table=0, n_packets=2547, n_bytes=108198, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=2118.433s, table=5, n_packets=2025, n_bytes=85050, idle_age=0, priority=200,arp,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,arp_spa=192.168.56.53,arp_sha=fa:16:3e:21:be:e8 actions=resubmit(,20) cookie=0x0, duration=2133.137s, table=20, n_packets=2035, n_bytes=85470, idle_age=0, priority=100,arp actions=resubmit(,25) cookie=0x0, duration=2138.258s, table=25, n_packets=161, n_bytes=6762, idle_age=1, priority=100,arp,metadata=0x7,arp_tpa=192.168.56.60,arp_op=1 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:41:ef:d4,load:0xfa163e41efd4->NXM_NX_ARP_SHA[],load:0xc0a8383c->NXM_OF_ARP_SPA[],move:NXM_NX_REG6[]->NXM_NX_REG7[],resubmit(,115) cookie=0x0, duration=2162.722s, table=115, n_packets=2352, n_bytes=137754, idle_age=0, priority=100,reg7=0x7d actions=output:16 vm1 — controller — qg 云主机的arp流量景观table25 转接后,直接将会应数据包,将该数据吧发给该云主机 vm1 arping vm2 广播1234567891011cookie=0x0, duration=2399.594s, table=0, n_packets=2273, n_bytes=95466, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2425.953s, table=0, n_packets=2822, n_bytes=119748, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=2431.584s, table=5, n_packets=2302, n_bytes=96684, idle_age=0, priority=200,arp,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,arp_spa=192.168.56.53,arp_sha=fa:16:3e:21:be:e8 actions=resubmit(,20)cookie=0x0, duration=2449.325s, table=20, n_packets=2315, n_bytes=97230, idle_age=1, priority=100,arp actions=resubmit(,25) cookie=0x0, duration=2452.901s, table=25, n_packets=91, n_bytes=3822, idle_age=1, priority=100,arp,metadata=0x7,arp_tpa=192.168.56.62,arp_op=1 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:26:36:07,load:0xfa163e263607->NXM_NX_ARP_SHA[],load:0xc0a8383e->NXM_OF_ARP_SPA[],move:NXM_NX_REG6[]->NXM_NX_REG7[],resubmit(,115)cookie=0x0, duration=2476.824s, table=115, n_packets=2695, n_bytes=153330, idle_age=0, priority=100,reg7=0x7d actions=output:16 vm1 — controller — vm1 云主机的arp流量景观table25 转接后,直接将会应数据包,将该数据吧发给该云主机 gw arping vm1 广播12345cookie=0x0, duration=2797.647s, table=0, n_packets=2727, n_bytes=168129, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=2828.673s, table=2, n_packets=2753, n_bytes=169689, idle_age=1, priority=1 actions=resubmit(,55) cookie=0x0, duration=2852.608s, table=55, n_packets=1742, n_bytes=87627, idle_age=0, hard_age=2851, priority=100,metadata=0x7,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75),load:0x7d->NXM_NX_REG7[],resubmit(,75),load:0->NXM_NX_REG7[],resubmit(,75) vm1 ,vm2,default1234567891011 cookie=0x0, duration=3043.548s, table=75, n_packets=2629, n_bytes=162259, idle_age=1, priority=100,reg7=0x7d actions=resubmit(,105) 发给vmcookie=0x0, duration=3084.365s, table=75, n_packets=1992, n_bytes=102637, idle_age=0, priority=100,reg7=0x7f actions=resubmit(,105) 发给vm cookie=0x0, duration=3110.547s, table=75, n_packets=1186, n_bytes=75679, idle_age=0, priority=51,reg6=0,metadata=0x7 actions=drop 要广播给外网在此丢弃cookie=0x0, duration=3150.668s, table=105, n_packets=4789, n_bytes=265938, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=3150.738s, table=115, n_packets=3308, n_bytes=188364, idle_age=0, priority=100,reg7=0x7d actions=output:16 cookie=0x0, duration=3150.726s, table=115, n_packets=2055, n_bytes=101898, idle_age=0, priority=100,reg7=0x7f actions=output:22 123456789101112131415 cookie=0x0, duration=3183.224s, table=0, n_packets=505, n_bytes=21210, idle_age=0, priority=100,in_port=22 actions=load:0x7f->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=3234.066s, table=0, n_packets=3526, n_bytes=149316, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=3241.916s, table=5, n_packets=562, n_bytes=23604, idle_age=0, priority=200,arp,reg6=0x7f,dl_src=fa:16:3e:26:36:07,arp_spa=192.168.56.62,arp_sha=fa:16:3e:26:36:07 actions=resubmit(,20) cookie=0x0, duration=3267.624s, table=20, n_packets=3029, n_bytes=127218, idle_age=1, priority=100,arp actions=resubmit(,25) cookie=0x0, duration=3273.763s, table=25, n_packets=2473, n_bytes=103866, idle_age=1, priority=1 actions=resubmit(,55) cookie=0x0, duration=3315.921s, table=55, n_packets=1733, n_bytes=80022, idle_age=0, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) 单播流量发给外网 cookie=0x0, duration=3340.281s, table=75, n_packets=2544, n_bytes=106848, idle_age=1, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=3347.397s, table=80, n_packets=2551, n_bytes=107142, idle_age=1, priority=100,metadata=0x7 actions=output:1 单播1234567891011cookie=0x0, duration=3547.656s, table=0, n_packets=3734, n_bytes=228559, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=3563.403s, table=2, n_packets=3748, n_bytes=229399, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=3606.613s, table=55, n_packets=86, n_bytes=5160, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:26:36:07 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=3625.110s, table=75, n_packets=2641, n_bytes=141577, idle_age=0, priority=100,reg7=0x7f actions=resubmit(,105) cookie=0x0, duration=3650.798s, table=105, n_packets=5861, n_bytes=330258, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=3664.045s, table=115, n_packets=2668, n_bytes=138678, idle_age=0, priority=100,reg7=0x7f actions=output:22 从外网流入内网,根据目的mac判定要发给云主机的reg7,然后再具体output到22口 123456789101112131415 cookie=0x0, duration=3700.659s, table=0, n_packets=1020, n_bytes=42840, idle_age=0, priority=100,in_port=22 actions=load:0x7f->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=3717.888s, table=0, n_packets=4007, n_bytes=169518, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=3724.775s, table=5, n_packets=1042, n_bytes=43764, idle_age=1, priority=200,arp,reg6=0x7f,dl_src=fa:16:3e:26:36:07,arp_spa=192.168.56.62,arp_sha=fa:16:3e:26:36:07 actions=resubmit(,20) cookie=0x0, duration=3753.486s, table=20, n_packets=3513, n_bytes=147546, idle_age=0, priority=100,arp actions=resubmit(,25)cookie=0x0, duration=3765.676s, table=25, n_packets=2963, n_bytes=124446, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=3834.332s, table=55, n_packets=2354, n_bytes=108084, idle_age=0, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) cookie=0x0, duration=3855.686s, table=75, n_packets=3052, n_bytes=128184, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80)cookie=0x0, duration=3875.989s, table=80, n_packets=3072, n_bytes=129024, idle_age=1, priority=100,metadata=0x7 actions=output:1 向外发包单播 qg arping vm1广播外网发给云主机方向流量12345cookie=0x0, duration=4174.765s, table=0, n_packets=4364, n_bytes=266842, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=4203.729s, table=2, n_packets=4388, n_bytes=267850, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=4225.600s, table=55, n_packets=2664, n_bytes=142522, idle_age=0, hard_age=4224, priority=100,metadata=0x7,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75),load:0x7d->NXM_NX_REG7[],resubmit(,75),load:0->NXM_NX_REG7[],resubmit(,75) vm1 vm2 外网12345678910111213cookie=0x0, duration=4264.216s, table=75, n_packets=3405, n_bytes=207700, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) vm1 cookie=0x0, duration=4277.591s, table=75, n_packets=3280, n_bytes=178546, idle_age=1, priority=100,reg7=0x7f actions=resubmit(,105). vm2 cookie=0x0, duration=4313.952s, table=75, n_packets=2112, n_bytes=129220, idle_age=0, priority=51,reg6=0,metadata=0x7 actions=drop 丢弃外网数据 cookie=0x0, duration=4493.285s, table=105, n_packets=7169, n_bytes=395490, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=4497.412s, table=115, n_packets=4230, n_bytes=236844, idle_age=0, priority=100,reg7=0x7d actions=output:16 cookie=0x0, duration=4497.399s, table=115, n_packets=3523, n_bytes=183138, idle_age=0, priority=100,reg7=0x7f actions=output:22 cookie=0x0, duration=4666.052s, table=55, n_packets=547, n_bytes=22974, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75) 123456789101112131415cookie=0x0, duration=4515.693s, table=0, n_packets=1788, n_bytes=75096, idle_age=0, priority=100,in_port=22 actions=load:0x7f->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=4539s, table=0, n_packets=4781, n_bytes=202026, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=4547.893s, table=5, n_packets=1818, n_bytes=76356, idle_age=0, priority=200,arp,reg6=0x7f,dl_src=fa:16:3e:26:36:07,arp_spa=192.168.56.62,arp_sha=fa:16:3e:26:36:07 actions=resubmit(,20) cookie=0x0, duration=4559.372s, table=20, n_packets=4271, n_bytes=179382, idle_age=0, priority=100,arp actions=resubmit(,25) cookie=0x0, duration=4569.552s, table=25, n_packets=3719, n_bytes=156198, idle_age=1, priority=1 actions=resubmit(,55) cookie=0x0, duration=4737.192s, table=55, n_packets=618, n_bytes=25956, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=4601.053s, table=75, n_packets=482, n_bytes=20244, idle_age=0, priority=100,reg7=0x25 actions=resubmit(,80) cookie=0x0, duration=4612.758s, table=80, n_packets=3767, n_bytes=158214, idle_age=0, priority=100,metadata=0x7 actions=output:1 单播1234567891011cookie=0x0, duration=4799.759s, table=0, n_packets=5245, n_bytes=309830, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=4812.432s, table=2, n_packets=5253, n_bytes=310166, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=4835.325s, table=55, n_packets=602, n_bytes=35202, idle_age=1, priority=100,metadata=0x7,dl_dst=fa:16:3e:26:36:07 actions=load:0x7f->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=4848.726s, table=75, n_packets=3955, n_bytes=210128, idle_age=0, priority=100,reg7=0x7f actions=resubmit(,105) cookie=0x0, duration=4869.354s, table=105, n_packets=7946, n_bytes=430068, idle_age=0, priority=1 actions=resubmit(,115) cookie=0x0, duration=4870.397s, table=115, n_packets=3948, n_bytes=201960, idle_age=0, priority=100,reg7=0x7f actions=output:22 123456789101112131415cookie=0x0, duration=4883.625s, table=0, n_packets=2154, n_bytes=90468, idle_age=0, priority=100,in_port=22 actions=load:0x7f->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=4915.097s, table=0, n_packets=5155, n_bytes=217734, idle_age=1, priority=1 actions=resubmit(,5) cookie=0x0, duration=4918.884s, table=5, n_packets=2187, n_bytes=91854, idle_age=1, priority=200,arp,reg6=0x7f,dl_src=fa:16:3e:26:36:07,arp_spa=192.168.56.62,arp_sha=fa:16:3e:26:36:07 actions=resubmit(,20) cookie=0x0, duration=4934.423s, table=20, n_packets=4645, n_bytes=195090, idle_age=0, priority=100,arp actions=resubmit(,25)cookie=0x0, duration=4948.657s, table=25, n_packets=4097, n_bytes=172074, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=4964.890s, table=55, n_packets=845, n_bytes=35490, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=5000.459s, table=75, n_packets=880, n_bytes=36960, idle_age=0, priority=100,reg7=0x25 actions=resubmit(,80) cookie=0x0, duration=5013.163s, table=80, n_packets=4166, n_bytes=174972, idle_age=0, priority=100,metadata=0x7 actions=output:1 IP云主机访问外网,会走默认策略(根据单播)判别,默认是外网网段,发给外网网口 vm1 ping qg 在l2根据目的mac判别,如果是qg地址,就将流量发给外网1234567891011121314151617cookie=0x0, duration=914.122s, table=0, n_packets=139, n_bytes=13174, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=926.621s, table=0, n_packets=392, n_bytes=33376, idle_age=1, priority=1 actions=resubmit(,5) cookie=0x0, duration=937.802s, table=5, n_packets=154, n_bytes=15092, idle_age=1, priority=200,ip,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,nw_src=192.168.56.53 actions=resubmit(,10) cookie=0x0, duration=952.053s, table=10, n_packets=169, n_bytes=16562, idle_age=0, priority=100,ip,reg6=0x7d actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=970.383s, table=15, n_packets=264, n_bytes=25872, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=982.595s, table=20, n_packets=359, n_bytes=35182, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=1010.847s, table=55, n_packets=234, n_bytes=22596, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=1120.621s, table=75, n_packets=343, n_bytes=33110, idle_age=0, priority=100,reg7=0x25 actions=resubmit(,80) cookie=0x0, duration=1129.899s, table=80, n_packets=352, n_bytes=33992, idle_age=1, priority=100,metadata=0x7 actions=output:1 12345678910111213 cookie=0x0, duration=1153.258s, table=0, n_packets=661, n_bytes=55121, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=1163.459s, table=2, n_packets=668, n_bytes=55807, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=1179.678s, table=55, n_packets=404, n_bytes=38976, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=1196.945s, table=75, n_packets=561, n_bytes=50719, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=1218.256s, table=105, n_packets=436, n_bytes=44139, idle_age=1, priority=100,ip,reg7=0x7d actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=1231.491s, table=110, n_packets=523, n_bytes=51254, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=1260.884s, table=115, n_packets=657, n_bytes=56892, idle_age=0, priority=100,reg7=0x7d actions=output:16 vm1 ping gw 如果被访问的节点属于单播,table55在处理其时候,会将该流量交给table75进行路径选择,该流量最终会交给外网发出(因为没有设置标记该流量肯定不在内网节点中)1234567891011121314151617cookie=0x0, duration=2609.046s, table=0, n_packets=1778, n_bytes=168644, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=2627.708s, table=0, n_packets=2036, n_bytes=189392, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=2634.802s, table=5, n_packets=1703, n_bytes=166894, idle_age=1, priority=200,ip,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,nw_src=192.168.56.53 actions=resubmit(,10) cookie=0x0, duration=2642.923s, table=10, n_packets=1711, n_bytes=167678, idle_age=1, priority=100,ip,reg6=0x7d actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2665.294s, table=15, n_packets=1809, n_bytes=177282, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20) cookie=0x0, duration=2675.464s, table=20, n_packets=1903, n_bytes=186494, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=2694.717s, table=55, n_packets=600, n_bytes=44410, idle_age=0, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75)cookie=0x0, duration=2720.125s, table=75, n_packets=266, n_bytes=25284, idle_age=1, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=2726.210s, table=80, n_packets=1853, n_bytes=178346, idle_age=0, priority=100,metadata=0x7 actions=output:1 12345678910111213 cookie=0x0, duration=2737.383s, table=0, n_packets=2612, n_bytes=227898, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=2747.577s, table=2, n_packets=2625, n_bytes=229756, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=2758.738s, table=55, n_packets=1887, n_bytes=181854, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=2773.966s, table=75, n_packets=2282, n_bytes=210278, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=2786.128s, table=105, n_packets=1869, n_bytes=186632, idle_age=1, priority=100,ip,reg7=0x7d actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2802.404s, table=110, n_packets=1949, n_bytes=191002, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=2816.284s, table=115, n_packets=2355, n_bytes=211292, idle_age=0, priority=100,reg7=0x7d actions=output:16 vm1 ping vm2 其实这部分包含两种情况,这里只展示了一种情况,就是两个provider的云主机在同一个计算节点的br-int上; 另外一种情况是,vm2位于另一个桥上;那么处理有什么差别? 流量流经table55会按照目的mac进行查找,因为远程主机的mac也含有表项目,也可以查找流量路径标记 但是table75并没有针对于该流量路径标记的处理,该流量会走默认处理路径(被交给外网处理) gw ping vm1根据mac查找具体的云主机12345678910111213cookie=0x0, duration=1713.178s, table=0, n_packets=591, n_bytes=46402, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=1726.168s, table=2, n_packets=604, n_bytes=47638, idle_age=0, priority=1 actions=resubmit(,55)cookie=0x0, duration=1746.737s, table=55, n_packets=1232, n_bytes=120356, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=1772.110s, table=75, n_packets=1434, n_bytes=136768, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=1790.411s, table=105, n_packets=1278, n_bytes=127978, idle_age=0, priority=100,ip,reg7=0x7d actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=1805.436s, table=110, n_packets=2266, n_bytes=222068, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=1816.578s, table=115, n_packets=1511, n_bytes=139016, idle_age=0, priority=100,reg7=0x7d actions=output:16 1234567891011121314151617 cookie=0x0, duration=1834.912s, table=0, n_packets=1361, n_bytes=130466, idle_age=0, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=1846.134s, table=0, n_packets=2771, n_bytes=245630, idle_age=1, priority=1 actions=resubmit(,5)cookie=0x0, duration=1856.520s, table=5, n_packets=1331, n_bytes=130438, idle_age=0, priority=200,ip,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,nw_src=192.168.56.53 actions=resubmit(,10) cookie=0x0, duration=1866.695s, table=10, n_packets=1341, n_bytes=131418, idle_age=0, priority=100,ip,reg6=0x7d actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=1878.661s, table=15, n_packets=2340, n_bytes=229320, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20)cookie=0x0, duration=1888.822s, table=20, n_packets=2351, n_bytes=230398, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=1903.248s, table=55, n_packets=652, n_bytes=53594, idle_age=0, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) cookie=0x0, duration=1937.775s, table=75, n_packets=439, n_bytes=42070, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=1943.869s, table=80, n_packets=445, n_bytes=42658, idle_age=0, priority=100,metadata=0x7 actions=output:1 qg ping vm112345678910111213cookie=0x0, duration=2195s, table=0, n_packets=1201, n_bytes=98278, idle_age=6, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=2202.007s, table=2, n_packets=1204, n_bytes=98458, idle_age=2, priority=1 actions=resubmit(,55) cookie=0x0, duration=2247.938s, table=55, n_packets=1637, n_bytes=159628, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:21:be:e8 actions=load:0x7d->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=2276.419s, table=75, n_packets=1953, n_bytes=183032, idle_age=0, priority=100,reg7=0x7d actions=resubmit(,105) cookie=0x0, duration=2293.667s, table=105, n_packets=1675, n_bytes=166884, idle_age=0, priority=100,ip,reg7=0x7d actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=2310.706s, table=110, n_packets=2665, n_bytes=261170, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=2315.762s, table=115, n_packets=2025, n_bytes=184700, idle_age=0, priority=100,reg7=0x7d actions=output:16 1234567891011121314151617cookie=0x0, duration=2340.117s, table=0, n_packets=1774, n_bytes=170156, idle_age=1, priority=100,in_port=16 actions=load:0x7d->NXM_NX_REG6[],load:0x7->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,)cookie=0x0, duration=2353.382s, table=0, n_packets=3191, n_bytes=286006, idle_age=0, priority=1 actions=resubmit(,5)cookie=0x0, duration=2363.778s, table=5, n_packets=1732, n_bytes=169736, idle_age=0, priority=200,ip,reg6=0x7d,dl_src=fa:16:3e:21:be:e8,nw_src=192.168.56.53 actions=resubmit(,10)cookie=0x0, duration=2372.921s, table=10, n_packets=1741, n_bytes=170618, idle_age=0, priority=100,ip,reg6=0x7d actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=2383.859s, table=15, n_packets=2742, n_bytes=268716, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20)cookie=0x0, duration=2393.010s, table=20, n_packets=2753, n_bytes=269794, idle_age=1, priority=1 actions=resubmit(,55)cookie=0x0, duration=2461.241s, table=55, n_packets=287, n_bytes=28126, idle_age=1, priority=100,metadata=0x7,dl_dst=fa:16:3e:41:ef:d4 actions=load:0x25->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=2475.472s, table=75, n_packets=302, n_bytes=29596, idle_age=0, priority=100,reg7=0x25 actions=resubmit(,80)cookie=0x0, duration=2430.835s, table=80, n_packets=827, n_bytes=79870, idle_age=0, priority=100,metadata=0x7 actions=output:1","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"draonflow 流表结构","slug":"draonflow-流表结构","permalink":"http://vcpu.me/tags/draonflow-流表结构/"}]},{"title":"dragonflow fip流量走向图示","slug":"dragonflow_fip_flows","date":"2018-04-08T10:00:00.000Z","updated":"2018-06-12T06:46:15.000Z","comments":true,"path":"dragonflow_fip_flows/","link":"","permalink":"http://vcpu.me/dragonflow_fip_flows/","excerpt":"","text":"外网网关请求fip arp流量123456789cookie=0x0, duration=86.056s, table=0, n_packets=64, n_bytes=3228, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2)cookie=0x0, duration=185.592s, table=2, n_packets=163, n_bytes=9198, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=195.967s, table=55, n_packets=174, n_bytes=9828, idle_age=0, priority=200,arp,metadata=0x7,arp_tpa=192.168.56.55,arp_op=1 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:d4:60:2d,load:0xfa163ed4602d->NXM_NX_ARP_SHA[],load:0xc0a83837->NXM_OF_ARP_SPA[],move:NXM_NX_REG6[]->NXM_NX_REG7[],load:0x84->NXM_NX_REG6[],resubmit(,75)cookie=0x0, duration=213.471s, table=75, n_packets=191, n_bytes=10848, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80)cookie=0x0, duration=222.636s, table=80, n_packets=200, n_bytes=11388, idle_age=1, priority=100,metadata=0x7 actions=output:1 在table55 反转ARP时候,把reg7 = 0 ,reg6 至标记fip接口标记,因为reg7是0,所以最终选择默认的路径,因为是外网网段将该数据包发给外网接口 vm1 ping self fip1234567891011121314151617181920212223cookie=0x0, duration=822.977s, table=0, n_packets=704, n_bytes=67984, idle_age=0, priority=100,in_port=24 actions=load:0x81->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,)cookie=0x0, duration=840.564s, table=0, n_packets=723, n_bytes=70282, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=849.699s, table=5, n_packets=712, n_bytes=69776, idle_age=1, priority=200,ip,reg6=0x81,dl_src=fa:16:3e:6c:c0:49,nw_src=1.1.1.5 actions=resubmit(,10) cookie=0x0, duration=859.863s, table=10, n_packets=723, n_bytes=70854, idle_age=0, priority=100,ip,reg6=0x81 actions=ct(table=15,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=859.980s, table=15, n_packets=723, n_bytes=70854, idle_age=0, priority=6,conj_id=4,ip actions=ct(commit,table=20,zone=NXM_NX_CT_ZONE[]) cookie=0x0, duration=877.145s, table=20, n_packets=737, n_bytes=72226, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=959.737s, table=55, n_packets=813, n_bytes=79674, idle_age=1, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=968.577s, table=60, n_packets=822, n_bytes=80556, idle_age=1, priority=70,ip,reg5=0x2,reg6=0x81,metadata=0x8,nw_src=1.1.1.5 actions=resubmit(,77)cookie=0x0, duration=982.806s, table=77, n_packets=837, n_bytes=82026, idle_age=0, priority=100,ip,reg5=0x2,reg6=0x81,metadata=0x8,nw_src=1.1.1.5 actions=dec_ttl,mod_dl_src:fa:16:3e:d4:60:2d,mod_dl_dst:00:00:00:00:00:00,mod_nw_src:192.168.56.55,load:0x7->OXM_OF_METADATA[],load:0x84->NXM_NX_REG6[],resubmit(,55)cookie=0x0, duration=1078.560s, table=55, n_packets=951, n_bytes=92506, idle_age=1, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) cookie=0x0, duration=1175.212s, table=75, n_packets=1058, n_bytes=102060, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=1182.328s, table=80, n_packets=1036, n_bytes=101528, idle_age=0, priority=200,metadata=0x7,dl_dst=00:00:00:00:00:00 actions=mod_dl_dst:08:00:27:12:3f:5b,output:1 12345678910111213141516171819 cookie=0x0, duration=7436.637s, table=0, n_packets=5183, n_bytes=488891, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=7444.618s, table=2, n_packets=4693, n_bytes=459914, idle_age=0, priority=200,ip,reg6=0,nw_src=192.168.56.55 actions=load:0x81->NXM_NX_REG6[],resubmit(,55) cookie=0x0, duration=7536.097s, table=55, n_packets=4882, n_bytes=481180, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:d4:60:2d actions=load:0x84->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=7586.900s, table=75, n_packets=5168, n_bytes=496241, idle_age=0, priority=200,reg7=0x84 actions=resubmit(,76) cookie=0x0, duration=7607.216s, table=76, n_packets=4888, n_bytes=479909, idle_age=0, priority=100,ip,reg7=0x84 actions=dec_ttl,mod_dl_src:fa:16:3e:07:f8:41,mod_dl_dst:fa:16:3e:6c:c0:49,mod_nw_dst:1.1.1.5,load:0x81->NXM_NX_REG7[],load:0x8->OXM_OF_METADATA[],resubmit(,55) cookie=0x0, duration=7664.254s, table=55, n_packets=4945, n_bytes=485495, idle_age=0, priority=100,metadata=0x8,dl_dst=fa:16:3e:6c:c0:49 actions=load:0x81->NXM_NX_REG7[],resubmit(,75)cookie=0x0, duration=7693.739s, table=75, n_packets=4954, n_bytes=485244, idle_age=1, priority=200,reg6=0x81,reg7=0x81 actions=drop 需要将该流表去除,否则ping 自己的fip 不通cookie=0x0, duration=121.067s, table=75, n_packets=133, n_bytes=13034, idle_age=0, priority=100,reg7=0x81 actions=resubmit(,105)cookie=0x0, duration=147.545s, table=110, n_packets=177, n_bytes=17346, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115)cookie=0x0, duration=154.348s, table=115, n_packets=202, n_bytes=19572, idle_age=0, priority=100,reg7=0x81 actions=output:24 gw ping fip 12345678910111213141516171819cookie=0x0, duration=42.589s, table=0, n_packets=42, n_bytes=4116, idle_age=0, priority=50,in_port=1,vlan_tci=0x0000/0x1fff actions=load:0->OXM_OF_IN_PORT[],load:0x7->OXM_OF_METADATA[],load:0x64->NXM_NX_REG8[],resubmit(,2) cookie=0x0, duration=52.497s, table=2, n_packets=51, n_bytes=4998, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=84.105s, table=55, n_packets=83, n_bytes=8134, idle_age=0, priority=100,metadata=0x7,dl_dst=fa:16:3e:d4:60:2d actions=load:0x84->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=120.791s, table=75, n_packets=120, n_bytes=11760, idle_age=0, priority=200,reg7=0x84 actions=resubmit(,76) cookie=0x0, duration=130.945s, table=76, n_packets=129, n_bytes=12642, idle_age=1, priority=100,ip,reg7=0x84 actions=dec_ttl,mod_dl_src:fa:16:3e:07:f8:41,mod_dl_dst:fa:16:3e:6c:c0:49,mod_nw_dst:1.1.1.5,load:0x81->NXM_NX_REG7[],load:0x8->OXM_OF_METADATA[],resubmit(,55)cookie=0x0, duration=426.456s, table=55, n_packets=422, n_bytes=41356, idle_age=0, priority=100,metadata=0x8,dl_dst=fa:16:3e:6c:c0:49 actions=load:0x81->NXM_NX_REG7[],resubmit(,75) cookie=0x0, duration=441.774s, table=75, n_packets=437, n_bytes=42826, idle_age=0, priority=100,reg7=0x81 actions=resubmit(,105) cookie=0x0, duration=452.940s, table=105, n_packets=448, n_bytes=43904, idle_age=0, priority=100,ip,reg7=0x81 actions=ct(table=110,zone=OXM_OF_METADATA[0..15]) cookie=0x0, duration=462.916s, table=110, n_packets=457, n_bytes=44786, idle_age=0, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,115) cookie=0x0, duration=469.042s, table=115, n_packets=478, n_bytes=46060, idle_age=0, priority=100,reg7=0x81 actions=output:24 1234567891011121314151617181920212223 cookie=0x0, duration=491.468s, table=0, n_packets=501, n_bytes=48258, idle_age=0, priority=100,in_port=24 actions=load:0x81->NXM_NX_REG6[],load:0x8->OXM_OF_METADATA[],load:0->OXM_OF_IN_PORT[],resubmit(,) cookie=0x0, duration=513.855s, table=0, n_packets=523, n_bytes=50414, idle_age=0, priority=1 actions=resubmit(,5) cookie=0x0, duration=526.255s, table=5, n_packets=520, n_bytes=50960, idle_age=0, priority=200,ip,reg6=0x81,dl_src=fa:16:3e:6c:c0:49,nw_src=1.1.1.5 actions=resubmit(,10)cookie=0x0, duration=539.468s, table=10, n_packets=533, n_bytes=52234, idle_age=0, priority=100,ip,reg6=0x81 actions=ct(table=15,zone=OXM_OF_METADATA[0..15])cookie=0x0, duration=550.488s, table=15, n_packets=543, n_bytes=53214, idle_age=1, priority=65534,ct_state=-new+est-rel-inv+trk actions=resubmit(,20)cookie=0x0, duration=561.685s, table=20, n_packets=555, n_bytes=54390, idle_age=0, priority=1 actions=resubmit(,55) cookie=0x0, duration=622.180s, table=55, n_packets=615, n_bytes=60270, idle_age=0, priority=200,metadata=0x8,dl_dst=fa:16:3e:07:f8:41 actions=load:0x2->NXM_NX_REG5[],resubmit(,60) cookie=0x0, duration=630.094s, table=60, n_packets=623, n_bytes=61054, idle_age=0, priority=70,ip,reg5=0x2,reg6=0x81,metadata=0x8,nw_src=1.1.1.5 actions=resubmit(,77)cookie=0x0, duration=640.276s, table=77, n_packets=633, n_bytes=62034, idle_age=0, priority=100,ip,reg5=0x2,reg6=0x81,metadata=0x8,nw_src=1.1.1.5 actions=dec_ttl,mod_dl_src:fa:16:3e:d4:60:2d,mod_dl_dst:00:00:00:00:00:00,mod_nw_src:192.168.56.55,load:0x7->OXM_OF_METADATA[],load:0x84->NXM_NX_REG6[],resubmit(,55)cookie=0x0, duration=675.033s, table=55, n_packets=756, n_bytes=70706, idle_age=0, priority=70,metadata=0x7,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,75) cookie=0x0, duration=701.542s, table=75, n_packets=694, n_bytes=68012, idle_age=0, priority=50,metadata=0x7 actions=resubmit(,80) cookie=0x0, duration=709.685s, table=80, n_packets=701, n_bytes=68698, idle_age=1, priority=200,metadata=0x7,dl_dst=00:00:00:00:00:00 actions=mod_dl_dst:08:00:27:12:3f:5b,output:1","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"draonflow 流表结构","slug":"draonflow-流表结构","permalink":"http://vcpu.me/tags/draonflow-流表结构/"}]},{"title":"春","slug":"chun","date":"2018-03-31T10:00:00.000Z","updated":"2018-03-31T07:09:16.000Z","comments":true,"path":"chun/","link":"","permalink":"http://vcpu.me/chun/","excerpt":"","text":"","categories":[{"name":"诗和远方","slug":"诗和远方","permalink":"http://vcpu.me/categories/诗和远方/"}],"tags":[{"name":"春","slug":"春","permalink":"http://vcpu.me/tags/春/"}]},{"title":"Dragonflow etcdv3数据读取方法","slug":"read_dragonflow_etcd_data","date":"2018-03-22T10:00:00.000Z","updated":"2018-06-12T07:17:08.000Z","comments":true,"path":"read_dragonflow_etcd_data/","link":"","permalink":"http://vcpu.me/read_dragonflow_etcd_data/","excerpt":"","text":"etcdctl命令行执行前,你需要确认是否etcdctl api版本是v3的 确认方法:123stack@odf-controller:~$ etcdctl --versionetcdctl version: 3.0.15API version: 2 etcdctl如何修改API使用v3 ?12345stack@odf-controller:~$ export ETCDCTL_API=3stack@odf-controller:~$ etcdctl versionetcdctl version: 3.0.15API version: 3.0stack@odf-controller:~$ 小细节: v2 查看version使用 etcdctl —version v3 查看version使用 etcdctl version etcdctl读取数据的数据方法方法1: etcdctl –endpoints=172.16.0.31:2379 get “” –from-key 方法2: etcdctl –endpoints=172.16.0.31:2379 get “/“ –prefix etcdctl读取目录是/port的数据方法etcdctl –endpoints=172.16.0.31:2379 get “/lport” –prefix python API读取数据1234567import etcd3gwe3gw = etcd3gw.client('172.16.0.31', 2379, ca_cert=None, cert_cert=None, timeout=None)print e3gw.get_prefix('/lport/')print e3gw.get_prefix('/')print e3gw.get('/chassis/odf-controller') Dragonflow etcd数据key和value12345678910111213141516171819202122232425262728293031323334353637383940414243stack@odf-controller:~/devstack$ etcdctl --endpoints=172.16.0.31:2379 get \"\" --from-key/chassis/odf-controller{\"tunnel_types\": [\"vxlan\", \"geneve\", \"gre\"], \"ip\": \"172.16.0.31\", \"id\": \"odf-controller\"}/chassis/pdf-compute{\"tunnel_types\": [\"vxlan\", \"geneve\", \"gre\"], \"ip\": \"172.16.0.34\", \"id\": \"pdf-compute\"}/lport/145cf8a3-1b0c-47f5-ab98-d791941ff901{\"dhcp_params\": {\"opts\": {}}, \"subnets\": [\"add073aa-bcd2-4232-a1da-40ae4695e117\"], \"binding_vnic_type\": \"normal\", \"name\": \"\", \"allowed_address_pairs\": [], \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"enabled\": true, \"unique_key\": 7, \"binding\": {\"chassis\": \"odf-controller\", \"type\": \"chassis\"}, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"ips\": [\"192.168.11.1\"], \"device_owner\": \"network:router_interface\", \"version\": 6, \"port_security_enabled\": false, \"macs\": [\"fa:16:3e:02:79:97\"], \"id\": \"145cf8a3-1b0c-47f5-ab98-d791941ff901\", \"security_groups\": [], \"device_id\": \"99e409c7-394c-47b6-b638-41d2e8ade294\"}/lport/77978d74-4f7f-49e9-8015-2788962e500c{\"dhcp_params\": {\"opts\": {}}, \"subnets\": [\"add073aa-bcd2-4232-a1da-40ae4695e117\"], \"binding_vnic_type\": \"normal\", \"name\": \"\", \"allowed_address_pairs\": [], \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"enabled\": true, \"unique_key\": 9, \"binding\": {\"chassis\": \"odf-controller\", \"type\": \"chassis\"}, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"ips\": [\"192.168.11.12\"], \"device_owner\": \"compute:nova\", \"version\": 4, \"port_security_enabled\": true, \"macs\": [\"fa:16:3e:e6:10:7f\"], \"id\": \"77978d74-4f7f-49e9-8015-2788962e500c\", \"security_groups\": [\"6065f7ed-e762-45f3-994f-56cfd6d0ec29\"], \"device_id\": \"77a17a11-c998-4924-9a17-e130a6a59096\"}/lport/818c15f5-fb5e-4971-9a07-237c2e2dcd75{\"dhcp_params\": {\"opts\": {}}, \"subnets\": [\"add073aa-bcd2-4232-a1da-40ae4695e117\"], \"binding_vnic_type\": \"normal\", \"name\": \"\", \"allowed_address_pairs\": [], \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"unique_key\": 6, \"enabled\": true, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"ips\": [\"192.168.11.2\"], \"device_owner\": \"network:dhcp\", \"version\": 3, \"port_security_enabled\": false, \"macs\": [\"fa:16:3e:ea:40:0a\"], \"id\": \"818c15f5-fb5e-4971-9a07-237c2e2dcd75\", \"security_groups\": [], \"device_id\": \"\"}/lport/de8ad7ca-2635-463c-8c66-abf1cd2a6b8d{\"dhcp_params\": {\"opts\": {}}, \"subnets\": [\"add073aa-bcd2-4232-a1da-40ae4695e117\"], \"binding_vnic_type\": \"normal\", \"name\": \"\", \"allowed_address_pairs\": [], \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"enabled\": true, \"unique_key\": 8, \"binding\": {\"chassis\": \"pdf-compute\", \"type\": \"chassis\"}, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"ips\": [\"192.168.11.3\"], \"device_owner\": \"compute:nova\", \"version\": 4, \"port_security_enabled\": true, \"macs\": [\"fa:16:3e:67:ab:ae\"], \"id\": \"de8ad7ca-2635-463c-8c66-abf1cd2a6b8d\", \"security_groups\": [\"6065f7ed-e762-45f3-994f-56cfd6d0ec29\"], \"device_id\": \"7abf2de9-e572-4fe7-a618-42374aea35a7\"}/lrouter/99e409c7-394c-47b6-b638-41d2e8ade294{\"name\": \"r1\", \"unique_key\": 1, \"id\": \"99e409c7-394c-47b6-b638-41d2e8ade294\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 3, \"routes\": [], \"ports\": [{\"network\": \"192.168.11.1/24\", \"unique_key\": 7, \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"mac\": \"fa:16:3e:02:79:97\", \"id\": \"145cf8a3-1b0c-47f5-ab98-d791941ff901\"}]}/lsubnet/add073aa-bcd2-4232-a1da-40ae4695e117{\"name\": \"net1-subnet\", \"enable_dhcp\": true, \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"dns_nameservers\": [], \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 0, \"host_routes\": [], \"gateway_ip\": \"192.168.11.1\", \"cidr\": \"192.168.11.0/24\", \"id\": \"add073aa-bcd2-4232-a1da-40ae4695e117\"}/lswitch/2c1c2b32-3fb0-423c-8910-bf075adbe331{\"name\": \"net1\", \"unique_key\": 2, \"segmentation_id\": 100, \"mtu\": 1450, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 5, \"id\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"network_type\": \"vxlan\", \"is_external\": false}/lswitch/5382c5f1-4b80-4b66-b7aa-b8348b04504a{\"name\": \"external\", \"unique_key\": 1, \"mtu\": 1500, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 3, \"physical_network\": \"public\", \"id\": \"5382c5f1-4b80-4b66-b7aa-b8348b04504a\", \"network_type\": \"flat\", \"is_external\": true}/pubsub/D��action�set�table�lport�topic�D�value�{\"dhcp_params\": {\"opts\": {}}, \"subnets\": [\"add073aa-bcd2-4232-a1da-40ae4695e117\"], \"binding_vnic_type\": \"normal\", \"name\": \"\", \"allowed_address_pairs\": [], \"lswitch\": \"2c1c2b32-3fb0-423c-8910-bf075adbe331\", \"enabled\": true, \"unique_key\": 9, \"binding\": {\"chassis\": \"odf-controller\", \"type\": \"chassis\"}, \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"ips\": [\"192.168.11.12\"], \"device_owner\": \"compute:nova\", \"version\": 4, \"port_security_enabled\": true, \"macs\": [\"fa:16:3e:e6:10:7f\"], \"id\": \"77978d74-4f7f-49e9-8015-2788962e500c\", \"security_groups\": [\"6065f7ed-e762-45f3-994f-56cfd6d0ec29\"], \"device_id\": \"77a17a11-c998-4924-9a17-e130a6a59096\"}�key�$77978d74-4f7f-49e9-8015-2788962e500c/secgroup/6065f7ed-e762-45f3-994f-56cfd6d0ec29{\"name\": \"default\", \"unique_key\": 1, \"rules\": [{\"direction\": \"egress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 0, \"id\": \"2cdf6659-6ccc-4bc2-9f42-b3e184233eb0\"}, {\"remote_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"direction\": \"ingress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 0, \"id\": \"3de51a83-3a21-4356-b9b3-7035975060ff\"}, {\"remote_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"direction\": \"ingress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 0, \"id\": \"56b98969-7ff2-44c1-a4c8-aa29c1c58681\"}, {\"direction\": \"egress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\", \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 0, \"id\": \"f9da0837-81ab-49e2-9bb9-c1ea71c89f78\"}], \"topic\": \"a5972cf299374500acd2699c8ea1cc55\", \"version\": 4, \"id\": \"6065f7ed-e762-45f3-994f-56cfd6d0ec29\"}/secgroup/d6017315-fb18-45c6-8b4d-9a38bdcc6e18{\"name\": \"default\", \"unique_key\": 2, \"rules\": [{\"direction\": \"egress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"topic\": \"6ea6f279297e465fa64856820fa4d3e2\", \"version\": 0, \"id\": \"0adea178-bf80-411f-9136-c326081879a8\"}, {\"remote_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"direction\": \"ingress\", \"ethertype\": \"IPv6\", \"security_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"topic\": \"6ea6f279297e465fa64856820fa4d3e2\", \"version\": 0, \"id\": \"513e5100-44d5-4a21-8873-17f44e9c48dd\"}, {\"direction\": \"egress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"topic\": \"6ea6f279297e465fa64856820fa4d3e2\", \"version\": 0, \"id\": \"5b23365d-bb46-4985-b1ea-840224980207\"}, {\"remote_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"direction\": \"ingress\", \"ethertype\": \"IPv4\", \"security_group_id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\", \"topic\": \"6ea6f279297e465fa64856820fa4d3e2\", \"version\": 0, \"id\": \"e3ade7c3-30b6-4395-b032-164dae738285\"}], \"topic\": \"6ea6f279297e465fa64856820fa4d3e2\", \"version\": 4, \"id\": \"d6017315-fb18-45c6-8b4d-9a38bdcc6e18\"}/service/4a475813-5df5-5a73-b8cd-5dc866fc4f62{\"binary\": \"df-metadata-service\", \"last_seen_up\": 1520408148.664552, \"chassis\": \"pdf-compute\", \"id\": \"4a475813-5df5-5a73-b8cd-5dc866fc4f62\"}/service/b44d242e-19ee-5541-ad33-b37145edddb0{\"binary\": \"df-local-controller\", \"last_seen_up\": 1520408148.002708, \"chassis\": \"pdf-compute\", \"id\": \"b44d242e-19ee-5541-ad33-b37145edddb0\"}/service/b77172b3-095e-56c6-8592-2de542f2094d{\"binary\": \"df-metadata-service\", \"last_seen_up\": 1520408142.636248, \"chassis\": \"odf-controller\", \"id\": \"b77172b3-095e-56c6-8592-2de542f2094d\"}/service/fdbf756f-2387-5cc8-b0bb-f8010422bdbd{\"binary\": \"df-local-controller\", \"last_seen_up\": 1520408124.881775, \"chassis\": \"odf-controller\", \"id\": \"fdbf756f-2387-5cc8-b0bb-f8010422bdbd\"}/unique_key/lport9/unique_key/lrouter1/unique_key/lswitch2/unique_key/secgroup2 df-db tables能够清晰展示etcd数据dragonflow 2018新版本支持df-db命令用来显示分布式存储etcd中的数据,你可以使用这个来观看etcd数据 1234567891011121314151617181920212223242526stack@odf-controller:~/devstack$ df-db tables+-------------------------+| DB Tables |+-------------------------+| activeport || bgp_peer || bgp_speaker || chassis || child_port_segmentation || floatingip || listener || lport || lrouter || lsubnet || lswitch || migration || publisher || qospolicy || secgroup || service || sfc_flowclassifier || sfc_portchain || sfc_portpair || sfc_portpairgroup || unique_key |+-------------------------+","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"Dragonflow etcdv3数据读取方法","slug":"Dragonflow-etcdv3数据读取方法","permalink":"http://vcpu.me/tags/Dragonflow-etcdv3数据读取方法/"}]},{"title":"etcdv3使用","slug":"etcdv3","date":"2018-03-21T10:00:00.000Z","updated":"2018-06-12T07:14:53.000Z","comments":true,"path":"etcdv3/","link":"","permalink":"http://vcpu.me/etcdv3/","excerpt":"","text":"安装下载etcdv3包12cd /usr/local/src/ wget https://github.com/coreos/etcd/releases/download/v3.2.15/etcd-v3.2.15-linux-amd64.tar.gz 解压12tar -zxvf etcd-v3.2.15-linux-amd64.tar.gzmv etcd-v3.2.15-linux-amd64 etcd 添加v3环境变量将代码路径加入环境变量中,并使用etcdv3的api接口 123vim /etc/profileexport PATH=/usr/local/src/etcd:$PATHexport ETCDCTL_API=3 生效 source /etc/profile 简易启动local standlone cluster参考:https://coreos.com/etcd/docs/latest/dev-guide/local_cluster.html 切换到:/usr/local/src/etc 目录直接启动etcd服务进程123456789root@ubuntu:/usr/local/src/etcd# ./etcd&[1] 19228root@ubuntu:/usr/local/src/etcd# 2018-03-07 13:40:16.719105 I | etcdmain: etcd Version: 3.2.152018-03-07 13:40:16.719217 I | etcdmain: Git SHA: 1b3ac99e82018-03-07 13:40:16.719235 I | etcdmain: Go Version: go1.8.52018-03-07 13:40:16.719244 I | etcdmain: Go OS/Arch: linux/amd642018-03-07 13:40:16.719254 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 42018-03-07 13:40:16.719274 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd2018-03-07 13:40:16.719422 N | etcdmain: the server is already initialized as 判别是否启动成功查看 2379 和 2380 两个端口服务都已经启动成功 etcdctl操作etcd存储额外说明:etcdctl get “” –from-key 是查看所有数据 12345678910111213141516171819202122232425root@ubuntu:~# etcdctl del f1root@ubuntu:~# etcdctl get \"\" --from-keyfoxxxfoobarroot@ubuntu:~# etcdctl put q qqOKroot@ubuntu:~# etcdctl get \"\" --from-keyfoxxxfoobarqqqroot@ubuntu:~# etcdctl put q qqqOKroot@ubuntu:~# etcdctl get \"\" --from-keyfoxxxfoobarqqqq python API命令操作etcd安装api库1pip install etcd3gw 操作方法123456789101112131415161718import etcd3gwe3gw = etcd3gw.client('127.0.0.1', 2379, ca_cert=None, cert_cert=None, timeout=None)# add one key valuee3gw.create('sdn', '4')# delete one key valuee3gw.delete('sdn')# replace one key valuee3gw.replace('sdn', '4', '40')# get one key's valueprint e3gw.get('sdn')# get allprint e3gw.get_all()","categories":[{"name":"dragonflow","slug":"dragonflow","permalink":"http://vcpu.me/categories/dragonflow/"}],"tags":[{"name":"etcdv3使用","slug":"etcdv3使用","permalink":"http://vcpu.me/tags/etcdv3使用/"}]},{"title":"OSX Fusion ubuntu16.04.2 虚拟机安装dpdk ovs","slug":"dpdk-ovs-vhost-user-kvm","date":"2018-02-28T10:00:00.000Z","updated":"2018-03-01T03:27:34.000Z","comments":true,"path":"dpdk-ovs-vhost-user-kvm/","link":"","permalink":"http://vcpu.me/dpdk-ovs-vhost-user-kvm/","excerpt":"为什么会有这篇文章: 笔者想体验下ovs dpdk情况下,云主机使用最高网络io性能的vhostuser方式,可是却没有空闲的物理机器,且在物理机器调试也没有在虚拟机调试方便;于是笔者花了一些时间探究如何在虚拟机中完成dpdk vhostuser运行工作;本文涉及到kvm安装、ovs编译、dpdk编译、虚拟机中以kvm方式启动新的虚拟机、ovs流表等信息,以实践为主,先跑起来是更加理解的基础。因为环境差异,你参考这篇可能会遇到新的问题,请google解决吧。 注意:1.为了虚拟机内也支持硬件虚拟化,笔者选择fusion8 安装ubuntu16.04操作系统(fusion下载要注册,说实话很麻烦),勾选Intel VT-x/EPT2.本人笔记本mac pro16款的,不需要设置,就可以在虚拟机中支持硬件虚拟化;不要拿virtual box跑,因为其建立虚拟机中不支持kvm","text":"为什么会有这篇文章: 笔者想体验下ovs dpdk情况下,云主机使用最高网络io性能的vhostuser方式,可是却没有空闲的物理机器,且在物理机器调试也没有在虚拟机调试方便;于是笔者花了一些时间探究如何在虚拟机中完成dpdk vhostuser运行工作;本文涉及到kvm安装、ovs编译、dpdk编译、虚拟机中以kvm方式启动新的虚拟机、ovs流表等信息,以实践为主,先跑起来是更加理解的基础。因为环境差异,你参考这篇可能会遇到新的问题,请google解决吧。 注意:1.为了虚拟机内也支持硬件虚拟化,笔者选择fusion8 安装ubuntu16.04操作系统(fusion下载要注册,说实话很麻烦),勾选Intel VT-x/EPT2.本人笔记本mac pro16款的,不需要设置,就可以在虚拟机中支持硬件虚拟化;不要拿virtual box跑,因为其建立虚拟机中不支持kvm 安装调试平台介绍 osx操作系统VMware Fusion 8安装Ubuntu 16.04.2,中安装dpdk和ovs 123456root@qinlong:~/dpdk-16.11# lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 16.04.2 LTSRelease: 16.04Codename: xenial 开启4个虚拟机网卡: KVM安装启动虚拟机是否支持硬件计算资源虚拟话?1egrep -c '(svm|vmx)' /proc/cpuinfo 如果得到结果大于0就代表所在host主机是支持虚拟话的 安装kvm相关依赖1apt-get install kvm qemu-kvm libvirt-bin virtinst bridge-utils 安装完成后尝试启动已经制作好的带有操作系统的硬盘/root/ubuntu-16.04-root-1.img 为带有操作系统的硬盘 1qemu-system-x86_64 -m 1024 -smp 2 -cpu host -hda /root/ubuntu-16.04-root-1.img -boot c -enable-kvm -no-reboot -net none -nographic -boot c -vnc :0 DPDK编译安装基础库安装sudo apt-get install m4 bison flex wget http://dpdk.org/browse/dpdk/snapshot/dpdk-16.11.tar.gz 12tar -zxvf dpdk-16.11.tar.gzcd dpdk-16.11/ 修改一处代码:否则后续会报错EAL: Error reading from file descriptor 23: Input/output error vim lib/librte_eal/linuxapp/igb_uio/igb_uio.c 编译安装1234mkdir -p /usr/src/dpdkmake config T=x86_64-native-linuxapp-gccmake install T=x86_64-native-linuxapp-gcc DESTDIR=/usr/src/dpdkmake install T=x86_64-native-linuxapp-gcc DESTDIR=/usr 说明:/usr/src/dpdk /usr 路径安装ovs用到 ovs编译安装下载1wget http://openvswitch.org/releases/openvswitch-2.7.0.tar.gz 编译1234567891011tar -zxvf openvswitch-2.7.0.tar.gzcd openvswitch-2.7.0/ ./boot.sh./configure \\--with-dpdk=/usr/src/dpdk \\--prefix=/usr \\--exec-prefix=/usr \\--sysconfdir=/etc \\--localstatedir=/varmakemake install 运行ovs设置步骤1: /etc/default/grub 添加iommu=pt intel_iommu=on 1GRUB_CMDLINE_LINUX_DEFAULT=\"iommu=pt intel_iommu=on\" 步骤2: 1update-grub 步骤3:重启 步骤4:查看是否配置完成 12root@qinlong:~/openvswitch-2.7.0# cat /proc/cmdlineBOOT_IMAGE=/vmlinuz-4.4.0-62-generic root=/dev/mapper/qinlong--vg-root ro iommu=pt intel_iommu=on 绑定网卡绑定12modprobe uioinsmod dpdk-16.11/x86_64-native-linuxapp-gcc/kmod/igb_uio.ko 查看是否加载成功:123root@qinlong:~# lsmod |grep uioigb_uio 16384 0uio 20480 1 igb_uio 查看当前网卡状态 配置大页修改大页占4G内存12echo 2048 > /proc/sys/vm/nr_hugepagesecho 'vm.nr_hugepages=2048' > /etc/sysctl.d/hugepages.conf 说明:2048*2M =4G,你要注意你系统内存是否有这么大,否则大页内存会分配失败,笔者虚拟机是8G内存 查看当前大页12345root@qinlong:~/dpdk-16.11# grep HugePages_ /proc/meminfoHugePages_Total: 2048HugePages_Free: 2048HugePages_Rsvd: 0HugePages_Surp: 0 挂载大页1mount -t hugetlbfs none /dev/hugepages 启动ovs进程1234567891011root@qinlong:~/dpdk-16.11# mkdir -p /etc/openvswitchroot@qinlong:~/dpdk-16.11# mkdir -p /var/run/openvswitchroot@qinlong:~/dpdk-16.11# ovsdb-server /etc/openvswitch/conf.db \\-vconsole:emer -vsyslog:err -vfile:info \\--remote=punix:/var/run/openvswitch/db.sock \\--private-key=db:Open_vSwitch,SSL,private_key \\--certificate=db:Open_vSwitch,SSL,certificate \\--bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir \\--log-file=/var/log/openvswitch/ovsdb-server.log \\--pidfile=/var/run/openvswitch/ovsdb-server.pid \\--detach --monitor 第一次运行1ovs-vsctl --no-wait init 初始化dpdkovs 启用dpdk1ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true 自定义dpdk的参数1ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=\"1024,0\" 指定dpdk运行的core1ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x03 查看确认dpdk设置123456root@qinlong:~/dpdk-16.11# ovs-vsctl get Open_vSwitch . other_config:dpdk-socket-mem\"1024,0\"root@qinlong:~/dpdk-16.11# ovs-vsctl get Open_vSwitch . other_config:pmd-cpu-mask\"0x03\"root@qinlong:~/dpdk-16.11# ovs-vsctl get Open_vSwitch . other_config:dpdk-init\"true\" 启动vswitchd进程12345ovs-vswitchd unix:/var/run/openvswitch/db.sock \\-vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir \\--log-file=/var/log/openvswitch/ovs-vswitchd.log \\--pidfile=/var/run/openvswitch/ovs-vswitchd.pid \\--detach --monitor 启动过程记录如下:1234567891011121314151617181920root@qinlong:~/dpdk-16.11# ovs-vswitchd unix:/var/run/openvswitch/db.sock \\> -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir \\> --log-file=/var/log/openvswitch/ovs-vswitchd.log \\> --pidfile=/var/run/openvswitch/ovs-vswitchd.pid \\> --detach --monitorEAL: Detected 8 lcore(s)EAL: No free hugepages reported in hugepages-1048576kBEAL: Probing VFIO support...EAL: PCI device 0000:02:01.0 on NUMA socket -1EAL: probe driver: 8086:100f net_e1000_emEAL: PCI device 0000:02:02.0 on NUMA socket -1EAL: probe driver: 8086:100f net_e1000_emEAL: PCI device 0000:02:03.0 on NUMA socket -1EAL: probe driver: 8086:100f net_e1000_emEAL: PCI device 0000:02:04.0 on NUMA socket -1EAL: probe driver: 8086:100f net_e1000_emVHOST_CONFIG: vhost-user server: socket created, fd: 35VHOST_CONFIG: bind to /var/run/openvswitch/vhost-user2VHOST_CONFIG: vhost-user server: socket created, fd: 45VHOST_CONFIG: bind to /var/run/openvswitch/vhost-user1 启动成功后,ovs-vswitchd进程占用率200% ovs使用跑vhost-user并连通虚拟机 建立bridge br0和两个云主机对应的dpdkvhostuser接口123sudo ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdevsudo ovs-vsctl add-port br0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostusersudo ovs-vsctl add-port br0 vhost-user2 -- set Interface vhost-user2 type=dpdkvhostuser 查看是否bridge桥是否建立成功123456789101112root@qinlong:~# ovs-vsctl show280e45c6-9143-4aad-ac4b-2c2305a96d0f Bridge \"br0\" Port \"vhost-user2\" Interface \"vhost-user2\" type: dpdkvhostuser Port \"vhost-user1\" Interface \"vhost-user1\" type: dpdkvhostuser Port \"br0\" Interface \"br0\" type: internal 启动虚拟机1:设置vnc 登录号,设置vhost-user io,启动虚拟机,ubuntu-16.04-root-1.img为装有ubuntu操作系统的硬盘123456qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda ~/ubuntu-16.04-root-1.img -boot c -enable-kvm -no-reboot -nographic -net none -vnc :0 \\-chardev socket,id=char1,path=/var/run/openvswitch/vhost-user1 \\-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce \\-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 \\-object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \\-numa node,memdev=mem -mem-prealloc 启动虚拟机2:设置vnc 登录号,设置vhost-user io,启动虚拟机,ubuntu-16.04-root-2.img为装有ubuntu操作系统的硬盘123456qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda /root/ubuntu-16.04-root-2.img -boot c -enable-kvm -no-reboot -nographic -net none -vnc :1 \\-chardev socket,id=char2,path=/var/run/openvswitch/vhost-user2 \\-netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce \\-device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 \\-object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on \\-numa node,memdev=mem -mem-prealloc 虚拟机1和虚拟机2连通性测试 通过vnc登录后,给云主机配置上一个网段的地址,就可以互通了。","categories":[{"name":"DPDK","slug":"DPDK","permalink":"http://vcpu.me/categories/DPDK/"}],"tags":[{"name":"osx操作系统","slug":"osx操作系统","permalink":"http://vcpu.me/tags/osx操作系统/"},{"name":"VMware Fusion 8","slug":"VMware-Fusion-8","permalink":"http://vcpu.me/tags/VMware-Fusion-8/"},{"name":"kvm","slug":"kvm","permalink":"http://vcpu.me/tags/kvm/"},{"name":"dpdk","slug":"dpdk","permalink":"http://vcpu.me/tags/dpdk/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"vhost-user","slug":"vhost-user","permalink":"http://vcpu.me/tags/vhost-user/"}]},{"title":"我的数据包丢在那里之一场iptables旅行","slug":"dropwatch","date":"2018-02-07T10:00:00.000Z","updated":"2018-02-11T14:48:59.000Z","comments":true,"path":"dropwatch/","link":"","permalink":"http://vcpu.me/dropwatch/","excerpt":"一般来搭建虚拟网络,或者其他业务情形,linux主机内部出现网络不通时候,有时候思路会断掉,因为内核协议栈对于专注于应用的开发者来说,属于黑盒子,看不到里面具体发生了什么,为什么丢包? 当然大部分原因是由于我们使用方法和配置有误导致,但是从如果从上层看下层,从下层也反证上层,问题可以提前找到和快解决。 本文重点举例,说明虚拟网络应用场景下出现iptables策略问题,并且通过工具反查和反证问题过程。属于一场iptables稍微深层次点的旅行吧,与君共勉。","text":"一般来搭建虚拟网络,或者其他业务情形,linux主机内部出现网络不通时候,有时候思路会断掉,因为内核协议栈对于专注于应用的开发者来说,属于黑盒子,看不到里面具体发生了什么,为什么丢包? 当然大部分原因是由于我们使用方法和配置有误导致,但是从如果从上层看下层,从下层也反证上层,问题可以提前找到和快解决。 本文重点举例,说明虚拟网络应用场景下出现iptables策略问题,并且通过工具反查和反证问题过程。属于一场iptables稍微深层次点的旅行吧,与君共勉。 系统环境123456root@ubuntu:~# lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 16.04.3 LTSRelease: 16.04Codename: xenial 系统镜像名称 ubuntu-16.04.3-server-amd64.iso 组网环境: 1.1.1.1和1.1.1.2位namespace net0和net1内部地址 net0和net1经过linux bridge br0连接起来 基本组网组成脚本12345678910111213141516171819202122brctl addbr br0 ifconfig br0 1.1.2.254/24 upip addr add 1.1.1.254/24 dev br0ip link add net0_eth0 type veth peer name tap0ip netns add net0ip link set dev net0_eth0 netns net0ip netns exec net0 ip link set dev net0_eth0 name eth0ip netns exec net0 ip addr add 1.1.1.1/24 dev eth0ip netns exec net0 ip link set dev eth0 upip link set dev tap0 master br0ip link set dev tap0 upip link add net1_eth0 type veth peer name tap1ip netns add net1ip link set dev net1_eth0 netns net1ip netns exec net1 ip link set dev net1_eth0 name eth0ip netns exec net1 ip addr add 1.1.1.2/24 dev eth0ip netns exec net1 ip link set dev eth0 upip link set dev tap1 master br0ip link set dev tap1 up iptable默认策略1234567iptables -A INPUT -p icmp -m physdev --physdev-in tap0 -j LOGiptables -A FORWARD -p icmp -m physdev --physdev-in tap0 --physdev-out tap1 -j LOGiptables -A FORWARD -p icmp -m physdev --physdev-is-outiptables -A FORWARD -p icmp -m physdev --physdev-is-out --physdev-is-in --physdev-is-bridged 环境验证123456789101112root@ubuntu:~# ip netnsnet1 (id: 1)net0 (id: 0)root@ubuntu:~# ip netns exec net0 bashroot@ubuntu:~# ip r1.1.1.0/24 dev eth0 proto kernel scope link src 1.1.1.1root@ubuntu:~# ping 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=8 ttl=64 time=0.061 ms64 bytes from 1.1.1.2: icmp_seq=9 ttl=64 time=0.055 ms64 bytes from 1.1.1.2: icmp_seq=10 ttl=64 time=0.057 ms64 bytes from 1.1.1.2: icmp_seq=11 ttl=64 time=0.057 ms 环境搭建完成后,在net0中ping net1 地址,可达 自此基本环境组建完毕,然后do something (其实是在主机侧执行了iptables -P FORWARD DROP)后,发现上述环境无法连通了,what happened!下面开启问题定位。 问题定位过程步骤1: 在net0 中ping net1 ,发现不通,查看arp表项1234root@ubuntu:~# ip netns exec net0 arp -nAddress HWtype HWaddress Flags Mask Iface1.1.1.2 ether 26:4c:48:39:09:21 C eth0root@ubuntu:~# 可学到arp表项,代表arp数据包相互可达,但是ping包仍然不通。 ip协议属于更上层协议,icmp不通可能性有两种: 1:对方不支持ping 2:中间被阻隔了 对于可能性1,很显然可能性不大,因为一开始还是好的,而且还没有听说关闭ping有什么意义 对于可能性2,是有可能发生的 步骤2: 可能性2验证 上述1 、2 、3、4定义为4个数据包抓取点 4点抓包1234tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes^C13:54:32.427376 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 2813:54:32.427392 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 28 未收到icmp,收到了arp 2点抓包123456789101112oot@ubuntu:~/drop_watch/src# tcpdump -i tap0 -netcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on tap0, link-type EN10MB (Ethernet), capture size 262144 bytes14:05:55.104561 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 1, length 6414:05:56.103918 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 2, length 6414:05:57.103702 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5357, seq 3, length 6414:05:59.454076 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5358, seq 1, length 6414:06:00.461847 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5358, seq 2, length 6414:06:09.373442 c2:a3:10:45:e4:65 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 2814:06:09.373461 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 2814:06:09.373463 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5361, seq 1, length 6414:06:10.381635 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype IPv4 (0x0800), length 98: 1.1.1.1 > 1.1.1.2: ICMP echo request, id 5361, seq 2, length 64 收到了icmp和arp 3点抓包12345root@ubuntu:~/drop_watch/src# tcpdump -i tap1 -netcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on tap1, link-type EN10MB (Ethernet), capture size 262144 bytes14:05:25.067625 c2:a3:10:45:e4:65 > 26:4c:48:39:09:21, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.2 tell 1.1.1.1, length 2814:05:25.067636 26:4c:48:39:09:21 > c2:a3:10:45:e4:65, ethertype ARP (0x0806), length 42: Reply 1.1.1.2 is-at 26:4c:48:39:09:21, length 28 未收到icmp,收到了arp 分析结论: 数据包经过br0被丢弃了,可能和转发策略iptables有关系查看br0的转发控制策略iptables -nL -v 发现FORWARD链有丢包,且和ping数据包的行为完全吻合,停止ping包,该丢包计数也不再增长 Chain FORWARD (policy DROP 813 packets, 68292 bytes) policy DROP为关键词规则链的默认策略丢包,通过iptables -S查看123456789root@ubuntu:~/drop_watch/src# iptables -S-P INPUT ACCEPT-P FORWARD DROP-P OUTPUT ACCEPT-A INPUT -p icmp -m physdev --physdev-in tap0 -j LOG-A FORWARD -p icmp -m physdev --physdev-in tap0 --physdev-out tap1 -j LOG-A FORWARD -p icmp -m physdev --physdev-is-out-A FORWARD -p icmp -m physdev --physdev-is-in --physdev-is-out --physdev-is-bridgedroot@ubuntu:~/drop_watch/src# 我们知道,如下-P FORWARD DROP 默认情况下,转发链设置为丢包,也就是说如果不配置通过策略,此数据包会被丢弃,但是这是再懂的问题原因以后才知道的,在未定为问题以前我是这样思考的,我的天哪,明明没有任何丢弃数据包策略,数据包却被丢弃了,而且默认情况下没有策略情况数据包应该被放通才对!外加真实环境比较复杂未免想的更复杂,是不是缺驱动?…等等 本来问题定位到此就要结束了,但是真实环境下我是如下折腾的 难道是少了什么驱动等配置选项? 看起来并没有其他策略阻隔数据包,难道是内核问题,我做了什么?drop watch查看丢包点,调用栈,确认丢包位置 12345apt-get install -y libnl-3-dev libnl-genl-3-dev binutils-dev libreadline6-devsudo apt-get install -y libnl-3-dev libnl-genl-3-dev binutils-dev libreadline6-dev gccgit clone https://github.com/pavel-odintsov/drop_watchcd drop_watch/srcmake ping数据包继续走起来,使用dropwatch 1234567891011121314151617181920212223root@ubuntu:~/drop_watch/src# ./dropwatch -l kasInitalizing kallsyms dbdropwatch> startEnabling monitoring...Kernel monitoring activated.Issue Ctrl-C to stop monitoring1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)1 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)2 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)2 drops at nf_hook_slow+b0 (0xffffffff8176d2e0)^CGot a stop messagedropwatch> Terminating dropwatch...Shutting down ... 到此问题基本定位nf_hook_slow查看,属于forward钩子处理部分(如果不确定你可以看内核源码,或者打印出更详细的调用栈),说明数据包被丢在了钩子策略上,是我想多了…,思路转回对iptables的研究上,直到找出默认FORWARD丢包策略-P FORWARD DROP 额外说明:如果你想看到更为详细的丢包调用栈,你可以用下面方法(当然systemtap也可以做到,但是这里不再赘述)perf查看协议栈丢包点位置,并打印出调用栈 安装方法 12apt install linux-tools-common -yapt install linux-tools-4.4.0-87-generic linux-cloud-tools-generic -y 使用方法: 步骤1:抓取: 1sudo perf record -g -a -e skb:kfree_skb 步骤2:分析: 1sudo perf script eg: 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869root@ubuntu:~/drop_watch/src# sudo perf record -g -a -e skb:kfree_skb^C[ perf record: Woken up 1 times to write data ][ perf record: Captured and wrote 0.143 MB perf.data (12 samples) ]root@ubuntu:~/drop_watch/src# sudo perf scriptping 1811 [000] 301.864213: skb:kfree_skb: skbaddr=0xffff8800d59bb400 protocol=2048 location=0xffffffff8176d2e0 921b1a kfree_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2e0 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) f7c br_nf_forward_ip ([br_netfilter]) 96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 3264 __br_forward ([bridge]) 3797 br_forward ([bridge]) 48d0 br_handle_frame_finish ([bridge]) 347 NF_HOOK_THRESH ([br_netfilter]) 1239 br_nf_pre_routing_finish ([br_netfilter]) 1fd1 br_nf_pre_routing ([br_netfilter]) 96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 4d2a br_handle_frame ([bridge]) 936424 __netif_receive_skb_core (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 936b38 __netif_receive_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 937938 process_backlog (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 93707e net_rx_action (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 285e11 __do_softirq (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) a43bcc do_softirq_own_stack (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 285858 do_softirq.part.19 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 2858dd __local_bh_enable_ip (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 978c89 ip_finish_output2 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 979c16 ip_finish_output (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 97a61e ip_output (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 979de5 ip_local_out (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 97afe9 ip_send_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 97b043 ip_push_pending_frames (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 9a16b3 raw_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 9b14c5 inet_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 919ad8 sock_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 91a581 ___sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 91aed1 __sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 91af22 sys_sendmsg (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) a41eb2 entry_SYSCALL_64_fastpath (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 108490 sendmsg (/lib/x86_64-linux-gnu/libc-2.23.so) 0 [unknown] ([unknown])ping 1811 [000] 302.863683: skb:kfree_skb: skbaddr=0xffff8800d59bb400 protocol=2048 location=0xffffffff8176d2e0 921b1a kfree_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2e0 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) f7c br_nf_forward_ip ([br_netfilter]) 96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 3264 __br_forward ([bridge]) 3797 br_forward ([bridge]) 48d0 br_handle_frame_finish ([bridge]) 347 NF_HOOK_THRESH ([br_netfilter]) 1239 br_nf_pre_routing_finish ([br_netfilter]) 1fd1 br_nf_pre_routing ([br_netfilter]) 96d212 nf_iterate (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 96d2a3 nf_hook_slow (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 4d2a br_handle_frame ([bridge]) 936424 __netif_receive_skb_core (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 936b38 __netif_receive_skb (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 937938 process_backlog (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 93707e net_rx_action (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 285e11 __do_softirq (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) a43bcc do_softirq_own_stack (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic) 285858 do_softirq.part.19 (/usr/lib/debug/boot/vmlinux-4.4.0-87-generic)root@ubuntu:~/drop_watch/src#","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"iptables","slug":"iptables","permalink":"http://vcpu.me/tags/iptables/"},{"name":"linuxbridge","slug":"linuxbridge","permalink":"http://vcpu.me/tags/linuxbridge/"},{"name":"dropwatch","slug":"dropwatch","permalink":"http://vcpu.me/tags/dropwatch/"},{"name":"perf","slug":"perf","permalink":"http://vcpu.me/tags/perf/"},{"name":"ubuntun16.04.3","slug":"ubuntun16-04-3","permalink":"http://vcpu.me/tags/ubuntun16-04-3/"}]},{"title":"addr2line如何根据地址找到行号?","slug":"addr2line","date":"2018-01-23T10:00:00.000Z","updated":"2018-02-14T09:50:14.000Z","comments":true,"path":"addr2line/","link":"","permalink":"http://vcpu.me/addr2line/","excerpt":"针对于linux平台 下,c程序使用场景举例:123456789101112131415161718192021222324root@ubuntu:~# cat ./mem_test.c#include <stdio.h>#include <stdlib.h>#include <unistd.h>int main(int argc, char *argv[]){ char *p1; char *p2; char *p3; char *p4; sleep(20);//让程序sleep 20s是因为我们程序先起来之后,等待SystemTap启动设置探测点 p1 = malloc(500); p2 = malloc(200); p3 = malloc(300); p4 = malloc(300);//泄漏 free(p1); free(p2); free(p3); free(p2);//重复释放 printf(\"p1: %p, p2: %p, p3: %p, p4: %p\\n\", p1, p2, p3, p4); return 0;}","text":"针对于linux平台 下,c程序使用场景举例:123456789101112131415161718192021222324root@ubuntu:~# cat ./mem_test.c#include <stdio.h>#include <stdlib.h>#include <unistd.h>int main(int argc, char *argv[]){ char *p1; char *p2; char *p3; char *p4; sleep(20);//让程序sleep 20s是因为我们程序先起来之后,等待SystemTap启动设置探测点 p1 = malloc(500); p2 = malloc(200); p3 = malloc(300); p4 = malloc(300);//泄漏 free(p1); free(p2); free(p3); free(p2);//重复释放 printf(\"p1: %p, p2: %p, p3: %p, p4: %p\\n\", p1, p2, p3, p4); return 0;} 编译运行12gcc -g mem_test.c -o main./main& 运行结果:1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950root@ubuntu:~# *** Error in `./main': double free or corruption (!prev): 0x0000000000b71210 ***======= Backtrace: =========/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f1cebfee7e5]/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f1cebff737a]/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f1cebffb53c]./main[0x400677]/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f1cebf97830]./main[0x400529]======= Memory map: ========00400000-00401000 r-xp 00000000 fc:00 526710 /root/main00600000-00601000 r--p 00000000 fc:00 526710 /root/main00601000-00602000 rw-p 00001000 fc:00 526710 /root/main00b71000-00b92000 rw-p 00000000 00:00 0 [heap]7f1ce4000000-7f1ce4021000 rw-p 00000000 00:00 07f1ce4021000-7f1ce8000000 ---p 00000000 00:00 07f1cebd61000-7f1cebd77000 r-xp 00000000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1cebd77000-7f1cebf76000 ---p 00016000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1cebf76000-7f1cebf77000 rw-p 00015000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1cebf77000-7f1cec137000 r-xp 00000000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1cec137000-7f1cec337000 ---p 001c0000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1cec337000-7f1cec33b000 r--p 001c0000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1cec33b000-7f1cec33d000 rw-p 001c4000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1cec33d000-7f1cec341000 rw-p 00000000 00:00 07f1cec341000-7f1cec367000 r-xp 00000000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1cec557000-7f1cec55a000 rw-p 00000000 00:00 07f1cec563000-7f1cec566000 rw-p 00000000 00:00 07f1cec566000-7f1cec567000 r--p 00025000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1cec567000-7f1cec568000 rw-p 00026000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1cec568000-7f1cec569000 rw-p 00000000 00:00 07fff0e000000-7fff0e021000 rw-p 00000000 00:00 0 [stack]7fff0e09b000-7fff0e09d000 r--p 00000000 00:00 0 [vvar]7fff0e09d000-7fff0e09f000 r-xp 00000000 00:00 0 [vdso]ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall][1]+ Aborted (core dumped) ./main 0x400677有问题,那么它对应那一行? 如何查找0x400677对应哪一行123root@ubuntu:~# addr2line -e main 0x400677/root/mem_test.c:21 分析如上我们发现21行是重复释放导致","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"addr2line","slug":"addr2line","permalink":"http://vcpu.me/tags/addr2line/"},{"name":"反汇编","slug":"反汇编","permalink":"http://vcpu.me/tags/反汇编/"}]},{"title":"valgrind打印程序调用栈","slug":"valgrind","date":"2018-01-15T10:00:00.000Z","updated":"2018-02-11T16:33:36.000Z","comments":true,"path":"valgrind/","link":"","permalink":"http://vcpu.me/valgrind/","excerpt":"使用valgrind调试,他功能有很多,但是真正性能调试时候,外加valgrind运行程序时,应用程序本身性能发挥不全,目前对于我来讲还是调用栈打印最实用 通过多次打调用栈,可得知应用程序忙在哪里,确认优化的方向","text":"使用valgrind调试,他功能有很多,但是真正性能调试时候,外加valgrind运行程序时,应用程序本身性能发挥不全,目前对于我来讲还是调用栈打印最实用 通过多次打调用栈,可得知应用程序忙在哪里,确认优化的方向 使用valgrind开启程序12345678[root@localhost ~]# valgrind --tool=callgrind ./server==1789== Callgrind, a call-graph generating cache profiler==1789== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.==1789== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info==1789== Command: ./server==1789====1789== For interactive control, run 'callgrind_control -h'.listening... 打印调用栈callgrind_control -e -b 12345678910111213141516[root@localhost ~]# callgrind_control -e -bPID 1789: ./serversending command status internal to pid 1789 Totals: Ir Th 1 159,440 Frame: Ir Backtrace for Thread 1 [ 0] 2,538 __select_nocancel (424 x) [ 1] 3,378 select (423 x) [ 2] 64,731 main (1 x) [ 3] 64,937 (below main) (1 x) [ 4] 65,876 _dl_runtime_resolve_sse_vex (1 x) [ 5] 65,888 _dl_runtime_resolve_avx_slow (1 x) [ 6] 65,904 0x0000000000400710 (1 x) [ 7] . 0x0000000000001170","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"valgrind","slug":"valgrind","permalink":"http://vcpu.me/tags/valgrind/"},{"name":"性能点找寻方法","slug":"性能点找寻方法","permalink":"http://vcpu.me/tags/性能点找寻方法/"},{"name":"c程序服务端用户态程序调用栈打印","slug":"c程序服务端用户态程序调用栈打印","permalink":"http://vcpu.me/tags/c程序服务端用户态程序调用栈打印/"},{"name":"一个强大的用户态程序调优方法","slug":"一个强大的用户态程序调优方法","permalink":"http://vcpu.me/tags/一个强大的用户态程序调优方法/"}]},{"title":"strace跟踪程序执行时间","slug":"exectime_for_c_proess","date":"2018-01-13T10:00:00.000Z","updated":"2018-02-11T16:34:06.000Z","comments":true,"path":"exectime_for_c_proess/","link":"","permalink":"http://vcpu.me/exectime_for_c_proess/","excerpt":"strace可以帮助你打印函数执行时间,帮助你查看程序消耗,针对于复杂的工作进程例如nginx或者janus,我暂且没有尝试,但是对复杂程序性能调整作用不大。 遇到复杂程序你可以考虑valgrind –tool=callgrind工具 http://valgrind.org/docs/manual/cl-manual.html#cl-manual.options.separation","text":"strace可以帮助你打印函数执行时间,帮助你查看程序消耗,针对于复杂的工作进程例如nginx或者janus,我暂且没有尝试,但是对复杂程序性能调整作用不大。 遇到复杂程序你可以考虑valgrind –tool=callgrind工具 http://valgrind.org/docs/manual/cl-manual.html#cl-manual.options.separation 服务端程序执行123456789[root@localhost ~]# ./serverlistening...hellohellohellohellohellohellohello 客户端程序执行1[root@localhost ~]# ./client 打印用户程序调用时间查找进程id123[root@localhost ~]# ps aux|grep serverroot 1636 0.0 0.0 4168 344 pts/1 S+ 07:34 0:00 ./serverroot 1642 0.0 0.0 112660 972 pts/2 S+ 07:34 0:00 grep --color=auto server strace跟踪应用程序打印出应用程序调用时间12[root@localhost ~]# strace -o server.strace -Ttt -p 1636strace: Process 1636 attached 查看函数结果123456789101112131415161718192021222324[root@localhost ~]# cat server.strace07:35:29.195639 select(8, [3 4], NULL, NULL, {0, 894993}) = 1 (in [4], left {0, 3979}) <0.891038>07:35:30.086907 recvfrom(4, \"hello\", 6, 0, NULL, NULL) = 5 <0.000009>07:35:30.086970 write(1, \"hello\\n\", 6) = 6 <0.000033>07:35:30.087027 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000007>07:35:30.087066 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 <0.000007>07:35:30.087091 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000006>07:35:30.087112 nanosleep({1, 0}, 0x7fff50f89b90) = 0 <1.000453>07:35:31.087646 sendto(4, \"hello\\0\", 6, 0, NULL, 0) = 6 <0.000099>07:35:31.087817 select(8, [3 4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 0}) <1.004789>07:35:32.092709 recvfrom(4, \"hello\", 6, 0, NULL, NULL) = 5 <0.000015>07:35:32.092786 write(1, \"hello\\n\", 6) = 6 <0.000050>07:35:32.092869 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000026>07:35:32.092924 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 <0.000006>07:35:32.092948 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000010>07:35:32.093035 nanosleep({1, 0}, 0x7fff50f89b90) = 0 <1.000253>07:35:33.093365 sendto(4, \"hello\\0\", 6, 0, NULL, 0) = 6 <0.000062>07:35:33.093471 select(8, [3 4], NULL, NULL, {1, 0}) = 1 (in [4], left {0, 554}) <0.999464>07:35:34.092994 recvfrom(4, \"hello\", 6, 0, NULL, NULL) = 5 <0.000007>07:35:34.093038 write(1, \"hello\\n\", 6) = 6 <0.000025>07:35:34.093078 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000005>07:35:34.093099 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 <0.000004>07:35:34.093116 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000005>07:35:34.093132 nanosleep({1, 0}, <detached ...> 结果分析 如上得知07:35:30.087112 nanosleep({1, 0}, 0x7fff50f89b90) = 0 \\<1.000453> 耗费了1.000453 s时间 服务端程序1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889[root@localhost ~]# cat server.c#include <stdio.h>#include <sys/socket.h>#include <netinet/in.h>#include <string.h>#define BACKLOG 5#define BUFF_SIZE 6#define PORT 9999int fd[BACKLOG];int con_num;void ha_ha(void){ sleep(1);}int main(int argc, char** argv){ int sock_fd, new_fd; struct sockaddr_in server_addr; struct sockaddr_in client_addr; socklen_t sin_size; fd_set fdsr; struct timeval tv; char buf[BUFF_SIZE+1]; int ret = 0; int i = 0; sock_fd = socket(AF_INET, SOCK_STREAM, 0); server_addr.sin_family = AF_INET; server_addr.sin_port = htons(PORT); server_addr.sin_addr.s_addr = INADDR_ANY; memset(server_addr.sin_zero, 0, sizeof(server_addr.sin_zero)); bind(sock_fd, (struct sockaddr*)&server_addr, sizeof(server_addr)); listen(sock_fd, BACKLOG); printf(\"listening...\\n\"); while(1) { FD_ZERO(&fdsr); FD_SET(sock_fd, &fdsr); tv.tv_sec = 1; tv.tv_usec = 0; for(i = 0; i < BACKLOG; i++) { if(fd[i] != 0) { FD_SET(fd[i], &fdsr); } } select(sock_fd+BACKLOG, &fdsr, NULL, NULL, &tv); if(FD_ISSET(sock_fd, &fdsr)) { new_fd = accept(sock_fd, (struct sockaddr*)&client_addr, &sin_size); if(con_num < BACKLOG) { fd[con_num++] = new_fd; } } for(i = 0; i < con_num; i++) { if(FD_ISSET(fd[i], &fdsr)) { ret = recv(fd[i], buf, BUFF_SIZE*sizeof(char), 0); memset(&buf[ret], '\\0', 1); printf(\"%s\\n\", buf); ha_ha(); send(fd[i], buf, BUFF_SIZE, 0); } } } return 0; } 客户端程序1234567891011121314151617181920212223242526272829303132333435363738394041[root@localhost ~]# cat client.c#include <stdio.h>#include <unistd.h>#include <sys/types.h>#include <sys/socket.h>#include <netinet/in.h>#include <string.h>#include <sys/stat.h>#include <fcntl.h>#define DEST_IP \"127.0.0.1\"#define DEST_PORT 9999int main(int argc, char** argv){ int sockfd = 0; int len = 0; struct sockaddr_in dest_addr; char* msg = \"hello\"; sockfd = socket(AF_INET, SOCK_STREAM, 0); dest_addr.sin_family = AF_INET; dest_addr.sin_port = htons(DEST_PORT); dest_addr.sin_addr.s_addr = inet_addr(DEST_IP); memset(dest_addr.sin_zero, 0, sizeof(dest_addr.sin_zero)); connect(sockfd, (struct sockaddr*)&dest_addr, sizeof(struct sockaddr_in)); len = strlen(msg); while(1) { sleep(2); send(sockfd, msg, len, 0); } return 0;}","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"strace","slug":"strace","permalink":"http://vcpu.me/tags/strace/"},{"name":"性能点找寻","slug":"性能点找寻","permalink":"http://vcpu.me/tags/性能点找寻/"},{"name":"程序执行时间细化","slug":"程序执行时间细化","permalink":"http://vcpu.me/tags/程序执行时间细化/"},{"name":"strace使用方法","slug":"strace使用方法","permalink":"http://vcpu.me/tags/strace使用方法/"},{"name":"你可以使用的用户程序性能优化找寻仿佛之一","slug":"你可以使用的用户程序性能优化找寻仿佛之一","permalink":"http://vcpu.me/tags/你可以使用的用户程序性能优化找寻仿佛之一/"}]},{"title":"gdb工具打印函数调用栈","slug":"print_stack","date":"2018-01-10T10:00:00.000Z","updated":"2018-02-11T16:35:25.000Z","comments":true,"path":"print_stack/","link":"","permalink":"http://vcpu.me/print_stack/","excerpt":"可以使用该脚本打印出当前进程的调用栈,当CPU使用率100%时候,你可以使用此脚本得知应用程序在忙什么,对应用程序进程调试 当然如果你应用程序性能不足,你正无助时,可以多次使用此脚本,你可以大概得知程序调用栈所耗损的地方,找到性能调试的关键点 我曾经遇到过CPU使用率不高,服务端程序性能严重不足情况,我使用了此脚本找到了io处理的一个全局锁,在修改优化此全局锁后,大大提升了服务端程序的性能","text":"可以使用该脚本打印出当前进程的调用栈,当CPU使用率100%时候,你可以使用此脚本得知应用程序在忙什么,对应用程序进程调试 当然如果你应用程序性能不足,你正无助时,可以多次使用此脚本,你可以大概得知程序调用栈所耗损的地方,找到性能调试的关键点 我曾经遇到过CPU使用率不高,服务端程序性能严重不足情况,我使用了此脚本找到了io处理的一个全局锁,在修改优化此全局锁后,大大提升了服务端程序的性能 调用栈打印脚本12345678910111213141516171819202122232425262728293031323334353637383940414243444546[root@localhost ~]# cat pstack.sh#!/bin/shif test $# -ne 1; then echo \"Usage: `basename $0 .sh` <process-id>\" 1>&2 exit 1fiif test ! -r /proc/$1; then echo \"Process $1 not found.\" 1>&2 exit 1fi# GDB doesn't allow \"thread apply all bt\" when the process isn't# threaded; need to peek at the process to determine if that or the# simpler \"bt\" should be used.backtrace=\"bt\"if test -d /proc/$1/task ; then # Newer kernel; has a task/ directory. if test `/bin/ls /proc/$1/task | /usr/bin/wc -l` -gt 1 2>/dev/null ; then backtrace=\"thread apply all bt\" fielif test -f /proc/$1/maps ; then # Older kernel; go by it loading libpthread. if /bin/grep -e libpthread /proc/$1/maps > /dev/null 2>&1 ; then backtrace=\"thread apply all bt\" fifiGDB=${GDB:-/usr/bin/gdb}if $GDB -nx --quiet --batch --readnever > /dev/null 2>&1; then readnever=--readneverelse readnever=fi# Run GDB, strip out unwanted noise.$GDB --quiet $readnever -nx /proc/$1/exe $1 <<EOF 2>&1 |$backtraceEOF/bin/sed -n \\ -e 's/^(gdb) //' \\ -e '/^#/p' \\ -e '/^Thread/p' 脚本使用方法sh pstack.sh \\<pid> 使用例子: 12345[root@localhost ~]# sh pstack.sh 1471#0 0x00007f9de5afe190 in __nanosleep_nocancel () from /lib64/libc.so.6#1 0x00007f9de5afe044 in sleep () from /lib64/libc.so.6#2 0x0000000000400810 in ha_ha ()#3 0x0000000000400ae7 in main ()","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"gdb","slug":"gdb","permalink":"http://vcpu.me/tags/gdb/"},{"name":"函数调用栈","slug":"函数调用栈","permalink":"http://vcpu.me/tags/函数调用栈/"},{"name":"性能点找寻方法之一","slug":"性能点找寻方法之一","permalink":"http://vcpu.me/tags/性能点找寻方法之一/"},{"name":"用户态进程性能调整方法","slug":"用户态进程性能调整方法","permalink":"http://vcpu.me/tags/用户态进程性能调整方法/"}]},{"title":"使用mac pro 使用ansible自动监控服务器进程状态","slug":"osx-ensible","date":"2018-01-07T10:00:00.000Z","updated":"2018-02-11T14:04:07.000Z","comments":true,"path":"osx-ensible/","link":"","permalink":"http://vcpu.me/osx-ensible/","excerpt":"本文目的,因为笔者有需要观察云开发环境相关服务进程需求,因为该开发环境并没有部署相关运行监控程序,因此笔者想通过一个脚本,在使用环境前做一个基本检查 如果服务不满足使用条件,对其进行一定调整,使用该脚本只需要花费不到1s时间,既可以达到使用目标。当然也可以通过ssh命令到具体服务器一个个观察服务进程,但需要耗费大量重复劳动时间。 场景描述mac pro osx 系统想远程登录控制controller服务器(ubuntu16.04),并且执行相关命令","text":"本文目的,因为笔者有需要观察云开发环境相关服务进程需求,因为该开发环境并没有部署相关运行监控程序,因此笔者想通过一个脚本,在使用环境前做一个基本检查 如果服务不满足使用条件,对其进行一定调整,使用该脚本只需要花费不到1s时间,既可以达到使用目标。当然也可以通过ssh命令到具体服务器一个个观察服务进程,但需要耗费大量重复劳动时间。 场景描述mac pro osx 系统想远程登录控制controller服务器(ubuntu16.04),并且执行相关命令 准备工作本地controller域名解析配置cat /private/etc/hosts 1234192.168.4.101 controller192.168.4.102 compute192.168.4.104 compute02192.168.4.103 network controller服务器支持免密登录步骤1:在controller服务器中如果不存在authorized_keys将其创建出,创建命令如下 touch /root/.ssh/authorized_keys 步骤2:找出mac pro osx系统公钥 12➜ ~ cat ~/.ssh/id_rsa.pubssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH7uxw50HQG0SwvLCcHAgcCarLw5DP4gDqDDki/+E85STu5Di++u4F8SeHLphkMiuvqsDWZzzOwx7+H32JXDu+aev/A2a8vQ9TRnH257+n4SOjWQD07QsyDQ+U0A4I3oofXY2kul3KBeQ9f8z/2lW7yAN1AEeJ/SW+TFeNqvLlkHfGNOUsw6NmfH5uujxbhxIREB0T7kH9q+gjLVcyMgRYdCKk8fvdzWZ99w/+xuUhCkhs1kLdqgRWuqQ6iI9ZPmcZU7pJD3DDQPqIUUxzgGFRkb3SJ7ewczdKm0XV3BupmwRlEXvuS2o26zoVui7X1ndqahLdjQSH6ZGggn/w6KJX qinlong@QinlongdeMacBook-Pro.local 步骤3:将mac pro公钥放入controller服务器的root/.ssh/authorized_keys 12root@controller:~# cat /root/.ssh/authorized_keysssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH7uxw50HQG0SwvLCcHAgcCarLw5DP4gDqDDki/+E85STu5Di++u4F8SeHLphkMiuvqsDWZzzOwx7+H32JXDu+aev/A2a8vQ9TRnH257+n4SOjWQD07QsyDQ+U0A4I3oofXY2kul3KBeQ9f8z/2lW7yAN1AEeJ/SW+TFeNqvLlkHfGNOUsw6NmfH5uujxbhxIREB0T7kH9q+gjLVcyMgRYdCKk8fvdzWZ99w/+xuUhCkhs1kLdqgRWuqQ6iI9ZPmcZU7pJD3DDQPqIUUxzgGFRkb3SJ7ewczdKm0XV3BupmwRlEXvuS2o26zoVui7X1ndqahLdjQSH6ZGggn/w6KJX qinlong@QinlongdeMacBook-Pro.local osx安装ansible安装: 1brew install ansible 安装验证: 1234567➜ ~ ansible --versionansible 2.3.0.0 config file = configured module search path = Default w/o overrides python version = 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]➜ ~➜ ~ 配置想要访问的hosts ansible: 123➜ ~ cat /usr/local/etc/ansible/hosts[controller]controller ensible以root方式登录controller执行命令ls123456789101112131415161718192021222324252627282930➜ ~ ansible controller -u root -m command -a \"ls\" [WARNING]: Found both group and host with same name: controller [WARNING]: Found both group and host with same name: networkcontroller | SUCCESS | rc=0 >>1.shaadmin-openrcarxan_0916arxan_0916.tar.gzarxan-manager_1.0.0-7_all.deba.tar.gzbr-sw-set.shb.tar.gzcentec_driver.pycirros-0.3.4-x86_64-disk.imgdemodemo-openrcetcd_2.2.2_amd64.debid_rsa.pubneutron-l2-arxan-agent_2.1.1-14_all.debneutron-local-controller_2.1.1-14_all.debp.tar.gzpython-arxan_1.0.0-7_all.debpython-dragonflow_2.1.1-14_all.debpython-etcd_0.4.5-1_all.debpython-ovsdbapp_0.4.0-0ubuntu2_all.debq.tar.gzxcmdb.py 在云主机执行shell脚本程序云主机中的脚本12345678910111213141516171819root@controller:~# cat 1.sh#!/bin/bashecho $HOSTNAMEecho \"-----------------------------------------\"arr_string=(\"rabbitmq-server\" \"etcd\" )for var in ${arr_string[@]}do echo $var wc_rab=`systemctl status $var |grep active|grep running|wc -l` if [ $wc_rab -ne 1 ] then echo \" $var is down ,Now restart it\" systemctl restart $var else echo \"ok\" fidone 该脚本主要观察进程状态,如果进程非开启状态并将其重启,它观察的进程是rabbitmq-server, etcd mac pro osx执行云主机的脚本1ansible check -u root -m command -a \"bash ./1.sh\"","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"dropwatch","slug":"dropwatch","permalink":"http://vcpu.me/tags/dropwatch/"},{"name":"osx","slug":"osx","permalink":"http://vcpu.me/tags/osx/"},{"name":"ensible","slug":"ensible","permalink":"http://vcpu.me/tags/ensible/"},{"name":"云主机","slug":"云主机","permalink":"http://vcpu.me/tags/云主机/"},{"name":"自动化控制许多台远程主机","slug":"自动化控制许多台远程主机","permalink":"http://vcpu.me/tags/自动化控制许多台远程主机/"}]},{"title":"云主机发包性能提升方法","slug":"guest_tx_pps_performance","date":"2017-12-29T10:00:00.000Z","updated":"2018-02-11T16:12:11.000Z","comments":true,"path":"guest_tx_pps_performance/","link":"","permalink":"http://vcpu.me/guest_tx_pps_performance/","excerpt":"本文主要内容1.vhost-net io不差,只是我们没有用对目前云主机采用网络io方式为vhost-net,相比较于virtio和全虚拟化网卡,其数据包延迟和pps要好太多,且还存在较大优化空间。 2.去除linux bridge,让云主机直接接入br-int可提升2倍发包pps云主机发包经过linux bridge (安全组功能使用)到br-int虚拟交换机,发包极限性能16w左右,如果采用ovs流表实现安全组,云主机直接和br-int相连,云主机包性能则有可能到27-30wpps。 下文主要是对细节描述","text":"本文主要内容1.vhost-net io不差,只是我们没有用对目前云主机采用网络io方式为vhost-net,相比较于virtio和全虚拟化网卡,其数据包延迟和pps要好太多,且还存在较大优化空间。 2.去除linux bridge,让云主机直接接入br-int可提升2倍发包pps云主机发包经过linux bridge (安全组功能使用)到br-int虚拟交换机,发包极限性能16w左右,如果采用ovs流表实现安全组,云主机直接和br-int相连,云主机包性能则有可能到27-30wpps。 下文主要是对细节描述 细节描述拓扑 br-int为ovs交换机 172.16.1.3经过linux bridge连接到br-int,主要用来当客户端测试发包能力 172.16.1.12直接接入br-int,主要用来当客户端测试发包能力 172.16.1.5经过linux bridge连接到br-int,主要用来当服务端 所有云主机都是centos7.3操作系统 172.16.1.3/12位于计算节点1 172.16.1.5位于计算节点2 计算节点1和计算节点2接入到同一个物理交换机 计算节点1主机的CPU型号是Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 结论经过测试16*vcpu 单队列情况下tx 16wpps、rx 56wpps。 单队列情况下tx发包性能还是很低的,于是采用iperf3和hping3进行tx pps测试,并采用ftrace和perf工具找出性能瓶颈。 采用iperf3测试情况下: 发包经过linux bridge,云主机的发包性能目前为15w-17w(不能再向上了,因为云主机对应的内核vhost线程CPU近%100) 去除发包节点的linux bridge后,云主机发包性能可被优化到24w-27w(应该还能向上,因为此时vhost所占用CPU还剩余至少10%) 使用hping3命令测试 发包经过linux bridge的安全组,发包性能15wpps(此时主机侧vhost内核进程100%) 去除发包节点的linux bridge后,去除bridge后发包性能可达40wpps 经过测试可知,单队列情况下,去除linux bridge,在ovs上实现安全组功能,也许可以提升云主机的发包pps到30w-40w pps。 我们知道host主机CPU资源是很珍贵的,多队列可以提升云主机发包能力但是会占用和耗损主机CPU,所以能不开尽量不要开启多队列,因此使用ovs实现安全组+去除linuxbridge是合适的方案。 调试过程关键参数分析测试方法iperf3测试方法: 服务端:iperf3 -s -p 16001 客户端:iperf3 -u -l 16 -b 100m -t 1200 -c 172.16.1.5 -p 16001 -A 2 hping测试方法: 客户端:hping3 -c 100000000 –flood 172.16.1.5 经过linux bridge 云主机发包pps测试情况下发包云主机所在服务器 perf top记录 如上图ipt_do_table占用率较高,这部分是遍历iptable规则链的函数,做linux bridge安全组功能使用,如上图可以看到,其发包瓶颈在安全组功能。 perftop帮助我们及时发现性能瓶颈点,为我们性能优化提供非常大的便利。 此时记录vhost线程CPU占用情况 去除linux bridge发包perf top记录 如上我们可以看到,去除linux bridge后性能点已经从linux bridge转移到其它位置,在此不做再详细的分析了。 此时vhost线程CPU占用情况","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"虚拟云主机","slug":"虚拟云主机","permalink":"http://vcpu.me/tags/虚拟云主机/"},{"name":"vhost-net","slug":"vhost-net","permalink":"http://vcpu.me/tags/vhost-net/"},{"name":"tx pps","slug":"tx-pps","permalink":"http://vcpu.me/tags/tx-pps/"},{"name":"hping3","slug":"hping3","permalink":"http://vcpu.me/tags/hping3/"},{"name":"iperf3","slug":"iperf3","permalink":"http://vcpu.me/tags/iperf3/"},{"name":"perftop分析性能瓶颈","slug":"perftop分析性能瓶颈","permalink":"http://vcpu.me/tags/perftop分析性能瓶颈/"},{"name":"云主机tx性能提升两倍方法","slug":"云主机tx性能提升两倍方法","permalink":"http://vcpu.me/tags/云主机tx性能提升两倍方法/"}]},{"title":"ftrace一个系统含有的工具","slug":"ftrace","date":"2017-12-25T10:00:00.000Z","updated":"2018-02-11T15:48:10.000Z","comments":true,"path":"ftrace/","link":"","permalink":"http://vcpu.me/ftrace/","excerpt":"你可以用它分析性能,还是很强大的!! ftrace是什么?一个用来跟踪的缓冲区(内存)称为一个instance,缓冲区的大小由文件buffer_size_kb和buffer_total_size_kb文件指定。有了缓冲区,你就可以启动行为跟踪,跟踪的结果会分CPU写到缓冲区中。缓冲区的数据可以通过trace和trace_pipe两个接口读出。前者通常用于事后读,后者是个pipe,可以让你动态读。为了不影响执行过程,我更推荐前一个接口。","text":"你可以用它分析性能,还是很强大的!! ftrace是什么?一个用来跟踪的缓冲区(内存)称为一个instance,缓冲区的大小由文件buffer_size_kb和buffer_total_size_kb文件指定。有了缓冲区,你就可以启动行为跟踪,跟踪的结果会分CPU写到缓冲区中。缓冲区的数据可以通过trace和trace_pipe两个接口读出。前者通常用于事后读,后者是个pipe,可以让你动态读。为了不影响执行过程,我更推荐前一个接口。trace等文件的输出是综合所有CPU的,如果你关心单个CPU可以进入per_cpu目录,里面有这些文件的分CPU版本。 所以读者应该已经明白了/sys/kernel/debug/tracing这个目录本身就代表一个instance。如果你需要更多的instance,你可以进入到这个目录下面的instances目录中,创建一个任意名字的目录,那个目录中就也会有另一套buffer_size_kb啦,trace啦这些文件,那里就是另一个instance了。通过多instance,你可以隔离多个独立的跟踪任务。当然,这也很浪费内存。 向set_ftrace_pid写pid可以限制只根据某个pid的事件等。 启动事件跟踪的方法很简单: 1. 先查available_events中有哪些可以用的事件(查events目录也可以)。 2. 把那个事件的名称写进set_event,可以写多个,可以写sched:*这样的通配符 3. 通过trace_on文件启动跟踪。启动之前可以通过比如tracing_cpumask这样的文件限制跟踪的CPU,通过set_event_pid设置跟踪的pid,或者通过其他属性进行更深入的设定。 剩下的事情就是执行跟踪程序和分析跟踪结果了 对于性能分析,我用得最多的是这个线程switch事件(还有softirq的一组事件)。因为从考量通量的角度,主业务CPU要不idle,它要不在处理业务,要不在调度。一个“不折腾”的系统,主业务进程应该每次都用完自己的时间片,如果它总用不完,要不是它实时性要求很高(主业务这种情况很少),要不是线程调度设计有问题。我们常常看到的一种模型是,由于业务在线程上安排不合理,导致一个线程刚执行一步,马上要等下一个线程完成,那个线程又执行一步,又要回来等前一个线程完成,这样CPU的时间都在切换上,整个通量就很低了。 你可以用它干什么?Function tracer 和 Function graph tracer: 跟踪函数调用。 Schedule switch tracer: 跟踪进程调度情况。 Wakeup tracer:跟踪进程的调度延迟,即高优先级进程从进入 ready 状态到获得 CPU 的延迟时间。该 tracer 只针对实时进程。 Irqsoff tracer:当中断被禁止时,系统无法相应外部事件,比如键盘和鼠标,时钟也无法产生 tick 中断。这意味着系统响应延迟,irqsoff 这个 tracer 能够跟踪并记录内核中哪些函数禁止了中断,对于其中中断禁止时间最长的,irqsoff 将在 log 文件的第一行标示出来,从而使开发人员可以迅速定位造成响应延迟的罪魁祸首。 Preemptoff tracer:和前一个 tracer 类似,preemptoff tracer 跟踪并记录禁止内核抢占的函数,并清晰地显示出禁止抢占时间最长的内核函数。 Preemptirqsoff tracer: 同上,跟踪和记录禁止中断或者禁止抢占的内核函数,以及禁止时间最长的函数。 Branch tracer: 跟踪内核程序中的 likely/unlikely 分支预测命中率情况。 Branch tracer 能够记录这些分支语句有多少次预测成功。从而为优化程序提供线索。 Hardware branch tracer:利用处理器的分支跟踪能力,实现硬件级别的指令跳转记录。在 x86 上,主要利用了 BTS 这个特性。 Initcall tracer:记录系统在 boot 阶段所调用的 init call 。 Mmiotrace tracer:记录 memory map IO 的相关信息。 Power tracer:记录系统电源管理相关的信息。 Sysprof tracer:缺省情况下,sysprof tracer 每隔 1 msec 对内核进行一次采样,记录函数调用和堆栈信息。 Kernel memory tracer: 内存 tracer 主要用来跟踪 slab allocator 的分配情况。包括 kfree,kmem_cache_alloc 等 API 的调用情况,用户程序可以根据 tracer 收集到的信息分析内部碎片情况,找出内存分配最频繁的代码片断,等等。 Workqueue statistical tracer:这是一个 statistic tracer,统计系统中所有的 workqueue 的工作情况,比如有多少个 work 被插入 workqueue,多少个已经被执行等。开发人员可以以此来决定具体的 workqueue 实现,比如是使用 single threaded workqueue 还是 per cpu workqueue. Event tracer: 跟踪系统事件,比如 timer,系统调用,中断等。 查看sshd进程的CPU调度情况进入ftrace目录 cd /sys/kernel/debug/tracing 关闭ftrace开关,清除历史结果 12echo 0 > tracing_onecho > trace 开启sched_switch CPU调度监听事件 12echo 'next_comm ~ \"sshd\"' > events/sched/sched_switch/filterecho sched:sched_switch > set_event 开始分析 1echo 1 > tracing_on 实时查看结果 1cat trace_pipe 查看最终结果 1cat trace 查看当前系统的中断事件情况关闭ftrace开关,清除历史结果 12echo 0 > tracing_onecho > trace 开启ftrace irq监听事件 123echo nop > current_tracer echo irq > set_event echo 1 > tracing_on 1cat trace_pipe 查看最终结果 1cat trace 查看当前内核函数的调用情况123456echo > set_event echo 0 > tracing\\_on echo > trace echo function_graph > current_tracer echo 1 > tracing_on cat trace_pipe","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"ftrace","slug":"ftrace","permalink":"http://vcpu.me/tags/ftrace/"},{"name":"cpu调度情况查看","slug":"cpu调度情况查看","permalink":"http://vcpu.me/tags/cpu调度情况查看/"},{"name":"内核函数调用","slug":"内核函数调用","permalink":"http://vcpu.me/tags/内核函数调用/"}]},{"title":"ubuntu16.04内核代码编译升级步骤","slug":"ubuntu_soure_make","date":"2017-12-22T10:00:00.000Z","updated":"2018-02-26T10:21:55.000Z","comments":true,"path":"ubuntu_soure_make/","link":"","permalink":"http://vcpu.me/ubuntu_soure_make/","excerpt":"笔者最近在ubuntu基础平台上做开发,因此对于基础平台的探究再从新来一遍吧,下面是基础平台编译升级内核的方法。","text":"笔者最近在ubuntu基础平台上做开发,因此对于基础平台的探究再从新来一遍吧,下面是基础平台编译升级内核的方法。 下载源代码1234567891011121314151617181920212223242526272829303132333435root@ubuntu:~# apt-cache search linux-sourcelinux-source - Linux kernel source with Ubuntu patcheslinux-source-4.4.0 - Linux kernel source for version 4.4.0 with Ubuntu patcheslinux-source-4.10.0 - Linux kernel source for version 4.10.0 with Ubuntu patcheslinux-source-4.11.0 - Linux kernel source for version 4.11.0 with Ubuntu patcheslinux-source-4.13.0 - Linux kernel source for version 4.13.0 with Ubuntu patcheslinux-source-4.8.0 - Linux kernel source for version 4.8.0 with Ubuntu patchesroot@ubuntu:~#root@ubuntu:~#root@ubuntu:~#root@ubuntu:~# uname -r4.4.0-62-genericroot@ubuntu:~#root@ubuntu:~# apt-get install linux-source-4.4.0Reading package lists... DoneBuilding dependency treeReading state information... DoneThe following packages were automatically installed and are no longer required: systemtap-common systemtap-runtimeUse 'apt autoremove' to remove them.Suggested packages: libncurses-dev | ncurses-dev kernel-package libqt3-devThe following NEW packages will be installed: linux-source-4.4.00 upgraded, 1 newly installed, 0 to remove and 153 not upgraded.Need to get 112 MB of archives.After this operation, 130 MB of additional disk space will be used.Get:1 http://mirrors.aliyun.com/ubuntu xenial-updates/main amd64 linux-source-4.4.0 all 4.4.0-109.132 [112 MB]Fetched 112 MB in 5min 2s (371 kB/s)Selecting previously unselected package linux-source-4.4.0.(Reading database ... 113915 files and directories currently installed.)Preparing to unpack .../linux-source-4.4.0_4.4.0-109.132_all.deb ...Unpacking linux-source-4.4.0 (4.4.0-109.132) ...Setting up linux-source-4.4.0 (4.4.0-109.132) ...root@ubuntu:~# 源代码下载目录123root@ubuntu:/usr/src# ls /usr/src/linux-source-4.4.0debian debian.master linux-source-4.4.0.tar.bz2root@ubuntu:/usr/src# 编译apt-get install libncurses5-dev libncursesw5-dev 1234make menuconfigmake -j4make modules_installmake install 更新引导项123456root@ubuntu:/usr/src# cd /boot/grub/root@ubuntu:/boot/grub# update-grub2Generating grub configuration file ...Found linux image: /boot/vmlinuz-4.4.0-62-genericFound initrd image: /boot/initrd.img-4.4.0-62-genericdone 重启1reboot","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"ubuntu16.04.2","slug":"ubuntu16-04-2","permalink":"http://vcpu.me/tags/ubuntu16-04-2/"},{"name":"内核代码获取","slug":"内核代码获取","permalink":"http://vcpu.me/tags/内核代码获取/"},{"name":"内核代码编译","slug":"内核代码编译","permalink":"http://vcpu.me/tags/内核代码编译/"}]},{"title":"云主机TCP PPS测试方法","slug":"tcp_pps_test","date":"2017-12-11T10:00:00.000Z","updated":"2018-06-12T07:29:49.000Z","comments":true,"path":"tcp_pps_test/","link":"","permalink":"http://vcpu.me/tcp_pps_test/","excerpt":"","text":"废话连篇开端: 阿里云官方提供的测试例,用的是udp协议进行pps测试,其实这种方法是有道理的,udp pps更纯粹也更能体现出pps能力,适合于通用测试场景。 但是使用后云主机业务基本都是TCP协议,如果能采用TCP协议进行PPS测试,可谓是更接近用户使用性能,也更能发现云主机和云主机所在host主机性能瓶颈进行调优工作。 作为研发工作者,我认为通用场景也要做,但是仅仅通用场景并不能说明问题,最终目标是为了满足使用者使用要求,所以测试pps一定不能武断的只使用udp进行测试。 额外想说的是:“在调试前只要方法有效,适用于当前场景业务的调优方法都应该被考虑进去,调优不仅仅技术活,应该增强其实践性” 下面是尝过痛楚才考虑也应该使用tcp 进行pps测试,写出其中一种测试方法共勉之。 步骤1:测试机器和被测试机器安装12345678910git clone https://github.com/esnet/iperf.gitcd iperfgit checkout 3.1-STABLE./configure --prefix=/root/iperfhomemakemake install$ sudo vi /etc/ld.so.conf/root/iperfhome/lib# ldconfig# ldconfig -p |grep iperf 注意:因为涉及到编译和git下载,需要预先安装git和gcc;centos系统可执行 yum install gcc git -y 步骤2:服务端开启iperf程序123./iperfhome/bin/iperf3 -p 8181 -f M -i 3 -B 0.0.0.0 -V --logfile /tmp/iperf.log -s -Dtail -f -n 1 /tmp/iperf.log 说明:这种在后台启动服务端,尽量做到对主机侧影响较小 步骤3:客户端开始流量测试1./iperfhome/bin/iperf3 -c xxx.xxx.xxx.xxx -p 8181 -b 0 -t 100 -P 64 -i 3 -M 90 注意⚠️:可以指定并行度,测试tcp或udp, 缓冲区的大小,TCP窗口的大小,MTU大小,堵塞控制算法 步骤4:观察云主机的pps情况sar -n DEV 1 10000","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"TCP PPS","slug":"TCP-PPS","permalink":"http://vcpu.me/tags/TCP-PPS/"}]},{"title":"systemtap能做什么?第一篇","slug":"systemtap-skills","date":"2017-12-10T10:00:00.000Z","updated":"2018-02-11T15:18:53.000Z","comments":true,"path":"systemtap-skills/","link":"","permalink":"http://vcpu.me/systemtap-skills/","excerpt":"systemtap是一个强大的工具,笔者本次主要是发现其能力,为后续操作使用做下积淀积累。请见下文。 probe “probe” \\<=> “探测”, 是SystemTap进行具体地收集数据的关键字。“probe point” 是probe动作的时机,也称探测点。也就是probe程序监视的某事件点,一旦侦测的事件触发了,则probe将从此处插入内核或者用户进程中。 探测点语法123456kernel.function(PATTERN)kernel.function(PATTERN).callkernel.function(PATTERN).returnkernel.function(PATTERN).retutrn.maxactive(VALUE)kernel.function(PATTERN).inlinekernel.function(PATTERN).label(PATTERN) return 返回点探测 return.maxactive(VALUE)修饰return,控制同时探测多少个实例,默认足够一搬不用,如果出现了跳过探测现象且很多,可以使用此参数,提升探测效果 .call 函数被调用时触发此调用点 .inline 内联函数需要展示时候用此参数 .label 内核常常用到goto函数,用此标签可以探测出具体的goto返回点","text":"systemtap是一个强大的工具,笔者本次主要是发现其能力,为后续操作使用做下积淀积累。请见下文。 probe “probe” \\<=> “探测”, 是SystemTap进行具体地收集数据的关键字。“probe point” 是probe动作的时机,也称探测点。也就是probe程序监视的某事件点,一旦侦测的事件触发了,则probe将从此处插入内核或者用户进程中。 探测点语法123456kernel.function(PATTERN)kernel.function(PATTERN).callkernel.function(PATTERN).returnkernel.function(PATTERN).retutrn.maxactive(VALUE)kernel.function(PATTERN).inlinekernel.function(PATTERN).label(PATTERN) return 返回点探测 return.maxactive(VALUE)修饰return,控制同时探测多少个实例,默认足够一搬不用,如果出现了跳过探测现象且很多,可以使用此参数,提升探测效果 .call 函数被调用时触发此调用点 .inline 内联函数需要展示时候用此参数 .label 内核常常用到goto函数,用此标签可以探测出具体的goto返回点 1234module(MPATTERN).function(PATTERN)moudle(MPATTERN).function(PATTERN).callmoudle(MPATTERN).function(PATTERN).return.maxactive(VALUE)moudle(MPATTERN).function(PATTERN).inline 12kernel.statement(PATTERN)kernel.statement(ADDRESS).absolute statement定位到具体的line或者函数,将这些定位点作为跟踪点 1moudle(MPATTERN).statement(PATTERN) 12345process(PROCESSPATH).function(PATTERN)process(PROCESSPATH).function(PATTERN).callprocess(PROCESSPATH).function(PATTERN).returnprocess(PROCESSPATH).function(PATTERN).inlineprocess(PROCESSPATH).statement(PATTERN) PATTERNfunc[@file] func@file:linenumber eg: 123kernel.function(\"*int*\")kernel.function(\"*\")kernel.function(\"__netif_receive_skb_core\") 我当前内核有哪些函数?12root@ubuntu:~# stap -l 'kernel.function(\"*\")'|grep __netif_receive_skb_corekernel.function(\"__netif_receive_skb_core@/build/linux-W6HB68/linux-4.4.0/net/core/dev.c:3828\") 我当前内核有哪些变量?12root@ubuntu:~# stap -L 'kernel.function(\"__netif_receive_skb_core\")'kernel.function(\"__netif_receive_skb_core@/build/linux-W6HB68/linux-4.4.0/net/core/dev.c:3828\") $skb:struct sk_buff* $pfmemalloc:bool 我想知道__netif_receive_skb_core被调用了几次?1root@ubuntu:~# cat tanche.stp 1234567891011global count=0probe kernel.function(\"__netif_receive_skb_core\") { count++ if (count % 5 == 0) printf( \"sys_sync called %d times\\n\", count);}probe timer.ms(10000){ printf(\" %d times\\n\\n\",count);} 每收5个数据包,打印一次,如果没有收到5个数据包,且时间过了约10s,也打印一次 执行结果: 123456root@ubuntu:~# stap tanche.stp 0 timessys_sync called 5 timessys_sync called 10 timessys_sync called 15 times^Croot@ubuntu:~# 如何打印内核函数的返回值?12root@ubuntu:~# stap -e 'probe kernel.function(\"__netif_receive_skb_core\").return { printf(\"__netif_receive_skb_core return: :%d\\n\",$return) exit() }'__netif_receive_skb_core return: :0 如何使用stap知晓当前函数在哪丢包的?12345678910111213141516171819202122232425262728293031323334353637383940414243static int__netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc){ struct packet_type *ptype, *pt_prev;... goto out; }#ifdef CONFIG_NET_CLS_ACT if (skb->tc_verd & TC_NCLS) { skb->tc_verd = CLR_TC_NCLS(skb->tc_verd); goto ncls; }#endif if (pfmemalloc) goto skip_taps; list_for_each_entry_rcu(ptype, &ptype_all, list) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; }... if (pt_prev) { if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) goto drop; else ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); } else {drop: atomic_long_inc(&skb->dev->rx_dropped); kfree_skb(skb); /* Jamal, now you will not able to escape explaining * me how you were going to use this. :-) */ ret = NET_RX_DROP; }out: return ret;} 12stap -e 'probe kernel.function(\"__netif_receive_skb_core\").label(\"drop\") { printf(\"__netif_receive_skb_core drop\\n\") }’stap -e 'probe kernel.function(\"__netif_receive_skb_core\").label(\"out\") { printf(\"__netif_receive_skb_core out\\n\") }' linux函数使用goto处理函数返回点再正常不过了,此时如果你使用return探测丢包点,难达预期效果。只有将丢包点精准到goto语句的label标签,才可以发现丢包位置。 (之前对stap 的probe理解不深刻,也不会这样用,也不知曾为找具体丢包点挠过多少回头皮) 谁调用了__netif_receive_skb_core 收了我的数据包?首先列出__netif_receive_skb_core所在的代码位置 12root@ubuntu:~# stap -L 'kernel.function(\"__netif_receive_skb_core\")'kernel.function(\"__netif_receive_skb_core@/build/linux-W6HB68/linux-4.4.0/net/core/dev.c:3828\") $skb:struct sk_buff* $pfmemalloc:bool 1stap -e 'probe kernel.statement(\"*@net/core/dev.c:3829\") {printf(\" : %s\\n\", execname()) }' qume虚拟机进程创建出了错,该怎么办?123456789101112131415161718root@compute-001:~# cat qumetanche.stpprobe begin { printf(\"start moniting qemu clone syscall...\\n\")}probe kernel.function(\"sys_clone\") { if (execname() == \"qemu-system-x86\") { printf(\"sys_clone : %s\\n\", execname()) }}probe kernel.function(\"sys_clone\").return { if (execname() == \"qemu-system-x86\") { printf(\"sys_clone_return : %s, %d\\n\", execname(), $return) if ($return < 0) printf(\"[error]sys_clone_return : %s, %d\\n\", execname(), $return) }} 进程在创建时候,会调用sys_clone,通过过滤调用该函数的执行者,可以定位到qemu-system-x86服务建立的进程,然后跟踪建立过程中返回值确认创建失败原因 12345678910111213141516171819root@compute-001:~# stap qumetanche.stpstart moniting qemu clone syscall...sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29946sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29947sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29948sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29949sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29950sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29951sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 29952sys_clone : qemu-system-x86sys_clone_return : qemu-system-x86, 30088sys_clone : qemu-system-x86 我想研究一个内核(信号处理过程)过程该怎么做?步骤1:找到sys_kill函数所在的源文件12root@ubuntu:~# stap -l 'kernel.function(\"sys_kill\")'kernel.function(\"SyS_kill@/build/linux-W6HB68/linux-4.4.0/kernel/signal.c:2847\") 得知信号处理函数所在的内核源代码路径为:/build/linux-W6HB68/linux-4.4.0/kernel/signal.c 因为不同平台会有不同的内核路径,首先要做的事情就是,先找到具体内核路径 步骤2:根据步骤1找到的内核路径,探测该路径下所有的函数的调用和返回并打出结果-x -x PID sets target() to PID, 脚本里会用到此参数 123456789101112131415161718192021222324252627282930root@ubuntu:~# stap -x 15365 signal.stpWARNING: function signals_init is in blacklisted section: keyword at signal.stp:5:1 source: probe kernel.function(\"*@/build/linux-W6HB68/linux-4.4.0/kernel/signal.c\").call { ^WARNING: function setup_print_fatal_signals is in blacklisted section: keyword at :5:1 source: probe kernel.function(\"*@/build/linux-W6HB68/linux-4.4.0/kernel/signal.c\").call { ^begin 0 bash(15365): -> get_signal,pid() 15365 target() 15365 5 bash(15365): -> get_signal,pid() 15365 target() 15365 8 bash(15365): -> dequeue_signal,pid() 15365 target() 15365 9 bash(15365): -> dequeue_signal,pid() 15365 target() 15365 11 bash(15365): -> __dequeue_signal,pid() 15365 target() 15365 13 bash(15365): -> __dequeue_signal,pid() 15365 target() 15365 14 bash(15365): -> __dequeue_signal,pid() 15365 target() 15365 16 bash(15365): -> __dequeue_signal,pid() 15365 target() 15365 18 bash(15365): -> __sigqueue_free,pid() 15365 target() 15365 19 bash(15365): -> __sigqueue_free,pid() 15365 target() 15365 22 bash(15365): -> recalc_sigpending,pid() 15365 target() 15365 23 bash(15365): -> recalc_sigpending,pid() 15365 target() 15365 25 bash(15365): -> recalc_sigpending_tsk,pid() 15365 target() 15365 27 bash(15365): -> recalc_sigpending_tsk,pid() 15365 target() 15365 31 bash(15365): -> signal_setup_done,pid() 15365 target() 15365 33 bash(15365): -> signal_setup_done,pid() 15365 target() 15365 34 bash(15365): -> __set_current_blocked,pid() 15365 target() 15365 36 bash(15365): -> __set_current_blocked,pid() 15365 target() 15365 38 bash(15365): -> __set_task_blocked,pid() 15365 target() 15365 39 bash(15365): -> __set_task_blocked,pid() 15365 target() 15365 41 bash(15365): -> recalc_sigpending,pid() 15365 target() 15365 有人用stap分析内存泄漏和重复释放,记录下来吧步骤1:分析要要用的c语言源码,并编译123456789101112131415161718192021222324root@ubuntu:~# cat mem_test.c#include <stdio.h>#include <stdlib.h>#include <unistd.h>int main(int argc, char *argv[]){ char *p1; char *p2; char *p3; char *p4; sleep(20);//让程序sleep 20s是因为我们程序先起来之后,等待SystemTap启动设置探测点 p1 = malloc(500); p2 = malloc(200); p3 = malloc(300); p4 = malloc(300);//泄漏 free(p1); free(p2); free(p3); free(p2);//重复释放 printf(\"p1: %p, p2: %p, p3: %p, p4: %p\\n\", p1, p2, p3, p4); return 0;} 编译: 1gcc -g mem_test.c -o main 步骤2: 探测内存泄漏和重复释放的脚本123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051root@ubuntu:~# cat mem.stpprobe begin { printf(\"=============begin============\\n\")}//记录内存分配和释放的计数关联数组global g_mem_ref_tbl//记录内存分配和释放的调用堆栈关联数组global g_mem_bt_tblprobe process(\"/lib/x86_64-linux-gnu/libc.so.6\").function(\"__libc_malloc\").return, process(\"/lib/x86_64-linux-gnu/libc.so.6\").function(\"__libc_calloc\").return { if (target() == pid()) { if (g_mem_ref_tbl[$return] == 0) { g_mem_ref_tbl[$return]++ g_mem_bt_tbl[$return] = sprint_ubacktrace() } }}probe process(\"/lib/x86_64-linux-gnu/libc.so.6\").function(\"__libc_free\").call { if (target() == pid()) { g_mem_ref_tbl[$mem]-- if (g_mem_ref_tbl[$mem] == 0) { if ($mem != 0) { //记录上次释放的调用堆栈 g_mem_bt_tbl[$mem] = sprint_ubacktrace() } } else if (g_mem_ref_tbl[$mem] < 0 && $mem != 0) { //如果调用free已经失衡,那就出现了重复释放内存的问题,这里输出当前调用堆栈,以及这个地址上次释放的调用堆栈 printf(\"----------------------------------------------\\n\") printf(\"[%p] has been freed : %d\\n\", $mem, g_mem_ref_tbl[$mem]) printf(\"who free this memory at error moment ? you can see the stack \") print_ubacktrace() printf(\"haha,The memory has been freed by : \\n\") printf(\"%s\\n\", g_mem_bt_tbl[$mem]) printf(\"----------------------------------------------\\n\") } }}probe end { //最后输出产生泄漏的内存是在哪里分配的 printf(\"=============end============\\n\") foreach(mem in g_mem_ref_tbl) { if (g_mem_ref_tbl[mem] > 0) { printf(\"[%p] is not free ,but you malloc it in %s ,so This is memory Loss!!!!\\n\", mem, g_mem_bt_tbl[mem]) } }} 脚本分析:脚本主要记录下内存申请点和释放点,这样就可以很容易找到内存重复释放和泄漏点了 步骤3: 探测过程123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566root@ubuntu:~# vim mem.stproot@ubuntu:~# ./main &[1] 18904root@ubuntu:~# stap -x 18904 mem.stp=============begin============p1: 0xa28010, p2: 0xa28210, p3: 0xa282e0, p4: 0xa28420*** Error in `./main': double free or corruption (!prev): 0x0000000000a28210 ***======= Backtrace: =========/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f1d146897e5]/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f1d1469237a]/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f1d1469653c]./main[0x40069c]/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f1d14632830]./main[0x400529]======= Memory map: ========00400000-00401000 r-xp 00000000 fc:00 526710 /root/main00600000-00601000 r--p 00000000 fc:00 526710 /root/main00601000-00602000 rw-p 00001000 fc:00 526710 /root/main00a28000-00a49000 rw-p 00000000 00:00 0 [heap]7f1d10000000-7f1d10021000 rw-p 00000000 00:00 07f1d10021000-7f1d14000000 ---p 00000000 00:00 07f1d143fc000-7f1d14412000 r-xp 00000000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1d14412000-7f1d14611000 ---p 00016000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1d14611000-7f1d14612000 rw-p 00015000 fc:00 786953 /lib/x86_64-linux-gnu/libgcc_s.so.17f1d14612000-7f1d147d2000 r-xp 00000000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1d147d2000-7f1d149d2000 ---p 001c0000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1d149d2000-7f1d149d6000 r--p 001c0000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1d149d6000-7f1d149d8000 rw-p 001c4000 fc:00 791185 /lib/x86_64-linux-gnu/libc-2.23.so7f1d149d8000-7f1d149dc000 rw-p 00000000 00:00 07f1d149dc000-7f1d14a02000 r-xp 00000000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1d14bf2000-7f1d14bf5000 rw-p 00000000 00:00 07f1d14bfe000-7f1d14c01000 rw-p 00000000 00:00 07f1d14c01000-7f1d14c02000 r--p 00025000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1d14c02000-7f1d14c03000 rw-p 00026000 fc:00 791163 /lib/x86_64-linux-gnu/ld-2.23.so7f1d14c03000-7f1d14c04000 rw-p 00000000 00:00 07ffc8d97f000-7ffc8d9a0000 rw-p 00000000 00:00 0 [stack]7ffc8d9d3000-7ffc8d9d5000 r--p 00000000 00:00 0 [vvar]7ffc8d9d5000-7ffc8d9d7000 r-xp 00000000 00:00 0 [vdso]7fffffffe000-7ffffffff000 --xp 00000000 00:00 0 [uprobes]ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]WARNING: Missing unwind data for a module, rerun with 'stap -d /root/main'WARNING: Missing unwind data for a module, rerun with 'stap -d /lib/x86_64-linux-gnu/ld-2.23.so'----------------------------------------------[0xa28210] has been freed : -1who free this memory at error moment ? you can see the stack 0x7f1d146964f0 : free+0x0/0x1d0 [/lib/x86_64-linux-gnu/libc-2.23.so] 0x40069c [/root/main+0x69c/0x1000]haha,The memory has been freed by :free+0x0 [libc-2.23.so]0x400684 [main+0x684]----------------------------------------------^C=============end============[0xa28420] is not free ,but you malloc it in 0x400643 [main+0x643] ,so This is memory Loss!!!![0xa28560] is not free ,but you malloc it in _IO_file_doallocate+0x55 [libc-2.23.so]_IO_doallocbuf+0x34 [libc-2.23.so]_IO_file_overflow@@GLIBC_2.2.5+0x1c8 [libc-2.23.so]_IO_file_xsputn@@GLIBC_2.2.5+0xad [libc-2.23.so]_IO_vfprintf+0xd1 [libc-2.23.so]printf+0x99 [libc-2.23.so]0x40066c [main+0x66c] ,so This is memory Loss!!!![0x7f1d100008c0] is not free ,but you malloc it in 0x7f1d149f8f5a [ld-2.23.so+0x1cf5a] ,so This is memory Loss!!!![0x7f1d100008f0] is not free ,but you malloc it in 0x7f1d149e7bf6 [ld-2.23.so+0xbbf6] ,so This is memory Loss!!!![0x7f1d10000da0] is not free ,but you malloc it in 0x7f1d149e7ef4 [ld-2.23.so+0xbef4] ,so This is memory Loss!!!![0x7f1d10000dd0] is not free ,but you malloc it in 0x7f1d149ea737 [ld-2.23.so+0xe737] ,so This is memory Loss!!!![0x7f1d10000e10] is not free ,but you malloc it in 0x7f1d149ee0be [ld-2.23.so+0x120be] ,so This is memory Loss!!!![1]+ Aborted (core dumped) ./main 步骤4:结果分析:a. 申请内存地址如下1p1: 0xa28010, p2: 0xa28210, p3: 0xa282e0, p4: 0xa28420 b. 0xa28210 p2 重复释放,重复释放位置 0x40069c [/root/main+0x69c/0x1000] ,因为它已经在0x400684 [main+0x684] 释放过了123456[0xa28210] has been freed : -1who free this memory at error moment ? you can see the stack 0x7f1d146964f0 : free+0x0/0x1d0 [/lib/x86_64-linux-gnu/libc-2.23.so] 0x40069c [/root/main+0x69c/0x1000]haha,The memory has been freed by :free+0x0 [libc-2.23.so]0x400684 [main+0x684] c. 0xa28420 p4内存泄漏,你在0x400643 [main+0x643] 申请了它,但是没有释放1[0xa28420] is not free ,but you malloc it in 0x400643 [main+0x643] ,so This is memory Loss!!!! 我可以获取哪些函数和系统状态并打印出来?12345678910111213root@ubuntu:~# cat test_all_func.stp probe begin { printf(\"SystemTap scrits start\\n\");} probe kernel.function(\"tcp_v4_rcv\"){ printf(\"skb->len = %d\\n \",$skb->len); printf(\"cpu %d \\n\",cpu()) printf(\"execname %s pid %d tid %d \\n\",execname(),pid(),tid()); printf(\"pp %s probefunc %s\\n\",pp(),probefunc()); printf(\"gettimeofday_s %d get_cycles %d \\n\",gettimeofday_s(),get_cycles()); printf(\"ppfunc %s \\n target %d\\n \",ppfunc(),target()); print_backtrace(); exit() } 执行结果如下:具体含义自己体会和尝试 1234567891011121314151617181920212223root@ubuntu:~# stap -d e1000 -x 1000 test_all_func.stpSystemTap scrits startskb->len = 32 cpu 3execname sshd pid 14479 tid 14479pp kernel.function(\"tcp_v4_rcv@/build/linux-W6HB68/linux-4.4.0/net/ipv4/tcp_ipv4.c:1555\") probefunc tcp_v4_rcvgettimeofday_s 1516860301 get_cycles 127888015890393ppfunc tcp_v4_rcv target 1000 0xffffffff81791250 : tcp_v4_rcv+0x0/0xa20 [kernel] 0xffffffff8176b414 : ip_local_deliver_finish+0x94/0x1e0 [kernel] 0xffffffff8176b71f : ip_local_deliver+0x6f/0xe0 [kernel] 0xffffffff8176b0f2 : ip_rcv_finish+0x92/0x320 [kernel] 0xffffffff8176ba21 : ip_rcv+0x291/0x3a0 [kernel] 0xffffffff8172c634 : __netif_receive_skb_core+0x704/0xa60 [kernel] 0xffffffff8172c9a8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8172ca22 : netif_receive_skb_internal+0x32/0xa0 [kernel] 0xffffffff8172d6a3 : napi_gro_receive+0xc3/0x120 [kernel] 0xffffffffc00400d2 : e1000_clean_rx_irq+0x152/0x4c0 [e1000] 0xffffffffc0040722 : e1000_clean+0x262/0x8c0 [e1000] 0xffffffff8172ceee : net_rx_action+0x21e/0x360 [kernel] 0xffffffff81085db1 : __do_softirq+0x101/0x290 [kernel] 0xffffffff8183a30c : do_softirq_own_stack+0x1c/0x30 [kernel] 获取当前调用栈,probe的探测点代码位置,触发进程名称、进程id和线程id、执行cpu id,当然时间相关函数也可以打出来; 这些函数基本可以帮助你了解定位该probe点所处于位置和系统进程运行和调用情况;可谓强大之极。 我想了解用户态程序成员变量在某一行的值,我该怎么做?使用statement定位到具体某一行,然后打印出你关注的变量即可。 eg: 用户态程序如下 root@ubuntu:~# cat test.c 12345678910111213141516171819202122#include <stdio.h>typedef struct str { int len; char *data;} str_t;typedef struct policy { str_t name; int id;} policy_t;int main(int argc, char *argv[]){ policy_t policy; policy_t *p = &policy; p->id = 111; p->name.data = \"test\"; p->name.len = sizeof(\"test\")-1; printf(\" p->id: %d, p->name.data:[%p] %s, p->name.len: %d\\n\", p->id, p->name.data, p->name.data, p->name.len); return 0;} gcc -Wall -g -o test ./test.c stap 脚本如下: root@ubuntu:~# cat test.stp 1234probe process(\"./test\").statement(\"main@./test.c:20\"){ printf(\"p->name->data pointer[%p] policy name: p->name->data %s : p->name->len %d p->id %d \\n\", $p->name->data,$p->name->data$,$p->name->len,$p->id);} 如何探测: 步骤1:开启脚本监听 12root@ubuntu:~# stap test.stpp->name->data pointer[0x4006b8] policy name: p->name->data \"test\" : p->name->len 4 p->id 111 步骤2:运行用户态进程 12root@ubuntu:~# stap test.stpp->name->data pointer[0x4006b8] policy name: p->name->data \"test\" : p->name->len 4 p->id 111 我用户态的函数是void * 类型,脚本一直报类型不对,我想通过void *指针拿出具体成员,该怎么做?eg: 如下: 我定义了一个void * 类型的变量q,想在stap脚本中通过q打印出成员变量的值,该怎么做? root@ubuntu:~# cat test.c 1234567891011121314151617181920212223#include <stdio.h>typedef struct str { int len; char *data;} str_t;typedef struct policy { str_t name; int id;} policy_t;int main(int argc, char *argv[]){ policy_t policy; policy_t *p = &policy; void *q=(void*)p; p->id = 111; p->name.data = \"test\"; p->name.len = sizeof(\"test\")-1; printf(\"[p:%p] [q:%p] p->id: %d, p->name.data:[%p] %s, p->name.len: %d\\n\", p,q,p->id, p->name.data, p->name.data, p->name.len); return 0;} 脚本: 123456root@ubuntu:~# cat test.stpprobe process(\"./test\").statement(\"main@./test.c:21\"){ d =&@cast($q,\"policy_t\") printf(\"policy name: p->name->data %s \\n\", d->name->data$);} 脚本解释:d变量为转换后的指针,是脚本的局部变量 ;@cast($q,”policy_t”) 方法可以转换指针类型 探测脚本运行: 12root@ubuntu:~# stap test.stppolicy name: p->name->data \"test\" 编译运行: 123gcc -Wall -g -o test ./test.croot@ubuntu:~# ./test[p:0x7ffc5202b240] [q:0x7ffc5202b240] p->id: 111, p->name.data:[0x4006d8] test, p->name.len: 4 怎么做到的? 通过@cast($q,”policy_t”)将void * 类型的指针转换为policy_t *类型的指针,即可 如果存在二级指针,如果通过stap脚本监控其值?1root@ubuntu:~# cat test1.c c用户程序 12345678910111213141516#include <stdio.h>struct test { int count;};int main(int argc, char *argv[]){ struct test t = {.count = 5566}; struct test *pt = &t; struct test **ppt = &pt; printf(\"t.count: %d, pt->count: %d, ppt->count: %d\\n\", t.count, pt->count, (*ppt)->count); return 0;} stap监控脚本 12345root@ubuntu:~# cat test1.stpprobe process(\"./test1\").statement(\"main@./test1.c:13\"){ printf(\"$t->count: %d, $pt->count: %d, $ppt->count: %d\", $t->count, $pt->count, $ppt[0]->count);} 脚本运行: 12root@ubuntu:~# stap test1.stp$t->count: 5566, $pt->count: 5566, $ppt->count: 5566 程序运行: 12root@ubuntu:~# ./test1t.count: 5566, pt->count: 5566, ppt->count: 5566 总结:ppt是二级指针,(*ppt)->count 在stap脚本中表现形式是 $ppt[0]->count 如何嵌入c语言?1234567891011121314151617root@ubuntu:~# cat copy_process.stpfunction getprocname:string(task:long)%{ struct task_struct *task = (struct task_struct *)STAP_ARG_task; snprintf(STAP_RETVALUE, MAXSTRINGLEN, \"pid: %d, comm: %s\", task->pid, task->comm);%}function getprocid:long(task:long)%{ struct task_struct *task = (struct task_struct *)STAP_ARG_task; STAP_RETURN(task->pid);%}probe kernel.function(\"copy_process\").return{ printf(\"copy_process return: %p, pid: %d, getprocname: %s, getprocid: %d\\n\", $return, $return->pid, getprocname($return), getprocid($return));} 运行 窗口1: 12root@ubuntu:~# stap -g copy_process.stpcopy_process return: 0xffff88030ff3aa00, pid: 11976, getprocname: pid: 11976, comm: bash, getprocid: 11976 窗口2 1ls stap脚本要在花括号前加上% 号 获取参数STAP_ARG_前缀 返回值用STAP_RETVALUE ,其它情况使用snprintf or strncat将返回值拼进来 上述task是指针类型为long 如何修改进程中的变量?root@ubuntu:~# cat test3.c 12345678910111213141516#include<stdio.h>typedef struct policy{ int id;}policy_t;int main(){ policy_t policy; policy_t *p = &policy; policy_t **pp ; p->id =111; printf(\"before stap set value,p->id:%d\\n\",p->id); pp = &p; printf(\"after stap set value,p->id : %d ,(*pp)->id : %d\\n\",p->id,(*pp)->id); return 0;} 修改进程中结构变了policy_t中的id 12345probe process(\"./test3\").statement(\"main@./test3.c:13\"){ $p->id=222; printf(\"$p$: %s\\n\",$p$)} 执行stap脚本 12root@ubuntu:~# stap -g test3.stap$p$: {.id=222} 运行进程程序 123root@ubuntu:~# ./test3before stap set value,p->id:111after stap set value,p->id : 222 ,(*pp)->id : 222","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"systemtap","slug":"systemtap","permalink":"http://vcpu.me/tags/systemtap/"},{"name":"ubuntu16.04.2","slug":"ubuntu16-04-2","permalink":"http://vcpu.me/tags/ubuntu16-04-2/"},{"name":"probe","slug":"probe","permalink":"http://vcpu.me/tags/probe/"},{"name":"内存泄漏探测","slug":"内存泄漏探测","permalink":"http://vcpu.me/tags/内存泄漏探测/"},{"name":"嵌入c","slug":"嵌入c","permalink":"http://vcpu.me/tags/嵌入c/"},{"name":"打印二级指针","slug":"打印二级指针","permalink":"http://vcpu.me/tags/打印二级指针/"},{"name":"qemu进程创建失败","slug":"qemu进程创建失败","permalink":"http://vcpu.me/tags/qemu进程创建失败/"}]},{"title":"ubuntu16.04.2 systemtap安装以及工作原理和协议栈调用栈不全解决方法","slug":"systemtap3.1","date":"2017-12-02T10:00:00.000Z","updated":"2018-02-11T15:18:58.000Z","comments":true,"path":"systemtap3.1/","link":"","permalink":"http://vcpu.me/systemtap3.1/","excerpt":"openstack环境常常也使用ubuntu16.04.2作为host主机镜像构建私有云。systemtap可以帮助笔者探测host主机内核态的真实运行状态,外加内核代码阅读及时定位看是系统底层的问题。解决一些系统难题、扩展思路、排除一些因为不了解内核而造成的不必要的猜测。下面是笔者使用systemtap环境安装过程和基本原理描述。 关于跨模块和kernel 符号显示不全是笔者遇到的第一个问题,找到了解决方法,也记录下来,请看下文。 如何安装systemtap?步骤1:下载ubuntu16.04.2调试数据包 1linux-image-4.4.0-62-generic-dbgsym_4.4.0-62_amd64.ddeb 调试目录数据包下载路径 http://ddebs.ubuntu.com/pool/main/l/linux/?C=M;O=D 步骤2:安装ubuntu16.04.2调试包 1dpkg -i linux-image-4.4.0-62-generic-dbgsym_4.4.0-62_amd64.ddeb 步骤3: 编译安装systemtap程序 12345678910apt-get update && \\ apt-get install -y build-essential gettext elfutils libdw-dev python wget tar && \\wget https://sourceware.org/systemtap/ftp/releases/systemtap-3.1.tar.gztar xzvf systemtap-3.1.tar.gzcd systemtap-3.1/ && \\ ./configure && \\ make all && \\ make install ;","text":"openstack环境常常也使用ubuntu16.04.2作为host主机镜像构建私有云。systemtap可以帮助笔者探测host主机内核态的真实运行状态,外加内核代码阅读及时定位看是系统底层的问题。解决一些系统难题、扩展思路、排除一些因为不了解内核而造成的不必要的猜测。下面是笔者使用systemtap环境安装过程和基本原理描述。 关于跨模块和kernel 符号显示不全是笔者遇到的第一个问题,找到了解决方法,也记录下来,请看下文。 如何安装systemtap?步骤1:下载ubuntu16.04.2调试数据包 1linux-image-4.4.0-62-generic-dbgsym_4.4.0-62_amd64.ddeb 调试目录数据包下载路径 http://ddebs.ubuntu.com/pool/main/l/linux/?C=M;O=D 步骤2:安装ubuntu16.04.2调试包 1dpkg -i linux-image-4.4.0-62-generic-dbgsym_4.4.0-62_amd64.ddeb 步骤3: 编译安装systemtap程序 12345678910apt-get update && \\ apt-get install -y build-essential gettext elfutils libdw-dev python wget tar && \\wget https://sourceware.org/systemtap/ftp/releases/systemtap-3.1.tar.gztar xzvf systemtap-3.1.tar.gzcd systemtap-3.1/ && \\ ./configure && \\ make all && \\ make install ; systemtap 测试12345678root@ubuntu:~# stap -ve 'probe begin { log(\"hello systemtap!\") exit() }'Pass 1: parsed user script and 465 library scripts using 110724virt/47744res/6432shr/41544data kb, in 120usr/40sys/374real ms.Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 111648virt/48956res/6620shr/42468data kb, in 0usr/0sys/7real ms.Pass 3: translated to C into \"/tmp/staps7ieJA/stap_294bb69fb17d36571b970bddbffa5bd9_1172_src.c\" using 111648virt/48956res/6620shr/42468data kb, in 0usr/0sys/0real ms.Pass 4: compiled C into \"stap_294bb69fb17d36571b970bddbffa5bd9_1172.ko\" in 990usr/220sys/2162real ms.Pass 5: starting run.hello systemtap!Pass 5: run completed in 0usr/10sys/392real ms. 打印4s内所有open系统调用的信息12345678910111213141516171819202122root@ubuntu:~# cat test2.stp#!/usr/bin/stapprobe begin{ log(\"begin to probe\")}probe syscall.open{ printf (\"%s(%d) open (%s)\\n\", execname(), pid(), argstr)}probe timer.ms(4000) # after 4 seconds{ exit ()}probe end{ log(\"end to probe\")} 123456root@ubuntu:~# cp /root/systemtap/systemtap-3.1/stap /usr/bin/root@ubuntu:~# ./test2.stpbegin to probeirqbalance(1686) open (\"/proc/interrupts\", O_RDONLY)irqbalance(1686) open (\"/proc/stat\", O_RDONLY)end to probe systemtap 工作原理systemtap 的核心思想是定义一个事件(event),以及给出处理该事件的句柄(Handler)。当一个特定的事件发生时,内核运行该处理句柄,就像快速调用一个子函数一样,处理完之后恢复到内核原始状态。这里有两个概念: 事件(Event):systemtap 定义了很多种事件,例如进入或退出某个内核函数、定时器时间到、整个systemtap会话启动或退出等等。 句柄(Handler):就是一些脚本语句,描述了当事件发生时要完成的工作,通常是从事件的上下文提取数据,将它们存入内部变量中,或者打印出来。 Systemtap 工作原理是通过将脚本语句翻译成C语句,编译成内核模块。模块加载之后,将所有探测的事件以钩子的方式挂到内核上,当任何处理器上的某个事件发生时,相应钩子上句柄就会被执行。最后,当systemtap会话结束之后,钩子从内核上取下,移除模块。整个过程用一个命令 stap 就可以完成。 实际使用linux basic kernl和加载进去的ko文件共同组成linux 内核态,支撑起linux操作系统的运行;linux有不同的驱动选择,这些驱动往往走不同内核分支,它们很相像,在梳理协议栈时候,往往很费解,stap可以帮助解决此问题,很简单的一个运用是打印调用关系栈,来最终总结出linux运行函数路径。分为打印kernel函数调用栈和加载ko模块的调用栈,来最终实现所有linux运行路径拓扑图。(这里只提讲解调用栈相关使用) https://zhuanlan.zhihu.com/p/28680568 kenel调用栈步骤1:列出kernel符号 12root@ubuntu:~/openvswitch-2.7.0# stap -l 'kernel.function(\"*\")'|grep __netif_receive_skb_corekernel.function(\"__netif_receive_skb_core@/build/linux-W6HB68/linux-4.4.0/net/core/dev.c:3828\") 内核ko模块调用栈步骤1:列出模块函数符号 stap -l ‘module(“openvswitch”).function(“*“)’|grep ovs_vport_receive 120xffffffffc0586ac0 br_handle_frame0xffffffffc05864a0 br_handle_frame_finish 解决stap函数栈解析不全问题问题举例1:vethpair口调用函数调用栈打印打印协议栈脚本 1234567root@ubuntu:~# cat btveth.stp probe module(\"veth\").function(@1){ print(\"----------------START-------------------------\\n\") printf(\"In process [%s]\\n\", execname()) print_backtrace() print(\"----------------END-------------------------\\n\") } 执行上述脚本后,出现资源无法解析到情况,可以看到stack函数打印时只有指针没有函数名称,且出现inexact字样 12345678910111213141516171819202122232425root@ubuntu:~# stap btveth.stp veth_xmitWARNING: Missing unwind data for a module, rerun with 'stap -d kernel'----------------START-------------------------In process [ping] 0xffffffffc046c550 : veth_xmit+0x0/0x70 [veth] 0xffffffff8172e959 0xffffffff8172f0e6 (inexact) 0xffffffff8172f160 (inexact) 0xffffffff817383f8 (inexact) 0xffffffff8176fab6 (inexact) 0xffffffff81770a56 (inexact) 0xffffffff81764133 (inexact) 0xffffffff8177145e (inexact) 0xffffffff81770920 (inexact) 0xffffffff81770c25 (inexact) 0xffffffff81771e29 (inexact) 0xffffffff81771e83 (inexact) 0xffffffff817983a9 (inexact) 0xffffffff8139ebc1 (inexact) 0xffffffff817a8175 (inexact) 0xffffffff8170fae8 (inexact) 0xffffffff81710591 (inexact) 0xffffffff8106f31f (inexact) 0xffffffff810caeb1 (inexact)----------------END------------------------- 解决办法1:stap探测时候加上-d kernel选项,这样在解析时候就会调用kernel的符号了,将指针变成我们能认识的函数符号1234567891011121314151617181920212223root@ubuntu:~# stap -d kernel btveth.stp veth_xmitWARNING: Missing unwind data for a module, rerun with 'stap -d openvswitch'WARNING: Missing unwind data for a module, rerun with 'stap -d bridge'----------------START-------------------------In process [ping] 0xffffffffc046c550 : veth_xmit+0x0/0x70 [veth] 0xffffffff8172e959 : dev_hard_start_xmit+0x249/0x3d0 [kernel] 0xffffffff8172f0e6 : __dev_queue_xmit+0x526/0x590 [kernel] 0xffffffff8172f160 : dev_queue_xmit+0x10/0x20 [kernel] 0xffffffff8176fc02 : ip_finish_output2+0x292/0x380 [kernel] 0xffffffff81770a56 : ip_finish_output+0x136/0x1f0 [kernel] 0xffffffff8177145e : ip_output+0x6e/0xe0 [kernel] 0xffffffff81770c25 : ip_local_out+0x35/0x40 [kernel] 0xffffffff81771e29 : ip_send_skb+0x19/0x40 [kernel] 0xffffffff81771e83 : ip_push_pending_frames+0x33/0x40 [kernel] 0xffffffff817983a9 : raw_sendmsg+0x7e9/0xb20 [kernel] 0xffffffff817a8175 : inet_sendmsg+0x65/0xa0 [kernel] 0xffffffff8170fae8 : sock_sendmsg+0x38/0x50 [kernel] 0xffffffff81710591 : ___sys_sendmsg+0x281/0x290 [kernel] 0xffffffff81710ee1 : __sys_sendmsg+0x51/0x90 [kernel] 0xffffffff81710f32 : sys_sendmsg+0x12/0x20 [kernel] 0xffffffff818385f2 : entry_SYSCALL_64_fastpath+0x16/0x71 [kernel]----------------END------------------------- 解决办法2:加上–all-modules选项 (非常适合懒人)123456789101112root@ubuntu:~# stap --all-modules btbr.stp br_handle_frame----------------START-------------------------In process [ping] 0xffffffffc0583ac0 : br_handle_frame+0x0/0x2b0 [bridge] 0xffffffff8172c294 : __netif_receive_skb_core+0x364/0xa60 [kernel] 0xffffffff8172c9a8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8172d7a8 : process_backlog+0xa8/0x150 [kernel] 0xffffffff8172ceee : net_rx_action+0x21e/0x360 [kernel] 0xffffffff81085db1 : __do_softirq+0x101/0x290 [kernel] 0xffffffff8183a30c : do_softirq_own_stack+0x1c/0x30 [kernel]----------------END------------------------- 问题举例2: linux bridge发包函数调用栈打印123456probe module(\"bridge\").function(@1){ print(\"----------------START-------------------------\\n\") printf(\"In process [%s]\\n\", execname()) print_backtrace() print(\"----------------END-------------------------\\n\")} 执行上述脚本后,出现资源无法解析到情况,可以看到stack函数打印时只有指针没有函数名称,很可能也漏了一部分关键函数 1234567root@ubuntu:~# stap btbr.stp br_handle_frameWARNING: Missing unwind data for a module, rerun with 'stap -d kernel'----------------START-------------------------In process [ping] 0xffffffffc0583ac0 : br_handle_frame+0x0/0x2b0 [bridge] 0xffffffff8172c294----------------END------------------------- 解决办法1:加上-d kernel 如下即可以看见调用栈全貌1234567891011root@ubuntu:~# stap -d kernel btbr.stp br_handle_frame----------------START-------------------------In process [ping] 0xffffffffc0583ac0 : br_handle_frame+0x0/0x2b0 [bridge] 0xffffffff8172c294 : __netif_receive_skb_core+0x364/0xa60 [kernel] 0xffffffff8172c9a8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8172d7a8 : process_backlog+0xa8/0x150 [kernel] 0xffffffff8172ceee : net_rx_action+0x21e/0x360 [kernel] 0xffffffff81085db1 : __do_softirq+0x101/0x290 [kernel] 0xffffffff8183a30c : do_softirq_own_stack+0x1c/0x30 [kernel]----------------END------------------------- 解决办法2: 加上–all-modules选项(非常适合懒人)123456789101112root@ubuntu:~# stap --all-modules btbr.stp br_handle_frame----------------START-------------------------In process [ping] 0xffffffffc0583ac0 : br_handle_frame+0x0/0x2b0 [bridge] 0xffffffff8172c294 : __netif_receive_skb_core+0x364/0xa60 [kernel] 0xffffffff8172c9a8 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff8172d7a8 : process_backlog+0xa8/0x150 [kernel] 0xffffffff8172ceee : net_rx_action+0x21e/0x360 [kernel] 0xffffffff81085db1 : __do_softirq+0x101/0x290 [kernel] 0xffffffff8183a30c : do_softirq_own_stack+0x1c/0x30 [kernel]----------------END-------------------------","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"ubuntu16.04.2","slug":"ubuntu16-04-2","permalink":"http://vcpu.me/tags/ubuntu16-04-2/"},{"name":"systemtap编译安装","slug":"systemtap编译安装","permalink":"http://vcpu.me/tags/systemtap编译安装/"},{"name":"systemtap运行原理","slug":"systemtap运行原理","permalink":"http://vcpu.me/tags/systemtap运行原理/"},{"name":"systemtap符号不全解决方法","slug":"systemtap符号不全解决方法","permalink":"http://vcpu.me/tags/systemtap符号不全解决方法/"},{"name":"stap运行举例","slug":"stap运行举例","permalink":"http://vcpu.me/tags/stap运行举例/"},{"name":"veth pair和linux bridge函数调用栈打印","slug":"veth-pair和linux-bridge函数调用栈打印","permalink":"http://vcpu.me/tags/veth-pair和linux-bridge函数调用栈打印/"},{"name":"虚拟网络","slug":"虚拟网络","permalink":"http://vcpu.me/tags/虚拟网络/"}]},{"title":"ovs 发包限速","slug":"ovs-limit","date":"2017-11-08T10:00:00.000Z","updated":"2017-11-08T10:12:35.000Z","comments":true,"path":"ovs-limit/","link":"","permalink":"http://vcpu.me/ovs-limit/","excerpt":"netperf吞吐测试方法步骤1:在ns4 namespace中运行netserver1Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC 步骤2:在ns1 namespace中运行 netperf -H 1.1.1.4 -t UDP_STREAM1234567root@compute:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.01 795007 0 41640.99212992 10.01 794006 41588.56 测速结果:41.58856Gbps","text":"netperf吞吐测试方法步骤1:在ns4 namespace中运行netserver1Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC 步骤2:在ns1 namespace中运行 netperf -H 1.1.1.4 -t UDP_STREAM1234567root@compute:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.01 795007 0 41640.99212992 10.01 794006 41588.56 测速结果:41.58856Gbps ovs限速操作方法添加限速策略 步骤1:添加qos1ovs-vsctl --timeout=10 -- set Port firstbr qos=@newqos -- --id=@newqos create QoS type=linux-htb other-config:max-rate=60000000000 注意:该命令限速60G,即使重复执行上述命令,也是最后一条命令生效 额外说明:qos有默认最大吞吐限制,如果超过最大吞吐限制,限制速度直接采用最大吞吐限制,如果没有超过最大吞吐限制,配置的max-rate才会起作用 举例:123456root@compute:~# tc -s -d class show dev firstbrclass htb 1:1 parent 1:fffe prio 0 quantum 1500 rate 12Kbit ceil 25640Mbit linklayer ethernet burst 1563b/1 mpu 0b overhead 0b cburst 0b/1 mpu 0b overhead 0b level 0 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 16291666 ctokens: 7 限速60G,实际无法达到该速率,qos取默认最大速率为25640Mbit 其它说明:如果没有设置other-config:max-rate ,就按照接口速率设置qos吞吐 举例:1234567891011121314151617root@compute:~# ethtool firstbrSettings for firstbr: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Speed: 10000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off MDI-X: Unknown Link detected: yes 如果没有陪着max-rate就按照 10000Mb/s进行qos限速 步骤2:添加queue,限速10G1ovs-vsctl --timeout=10 create Queue other-config:max-rate=10000000000 步骤3:绑定queue到具体的qosovs-vsctl –timeout=10 add qos 7cf6a845-ce22-4aa4-886f-1ad76e2914bc queues 0=826c230d-3f28-4ab6-b0d1-794f7e2a0602 注意:7cf6a845-ce22-4aa4-886f-1ad76e2914bc 为qos id,826c230d-3f28-4ab6-b0d1-794f7e2a0602为queue id, 0 为ID具体代号且有特殊含义,默认数据包均走0号ID(不明确指定queue队列,如果存在0号限速策略所有经过firstbr发出的流量均会被限速10G) 步骤4:通过流表引导流量到queue删除限速策略清除接口上qos1ovs-vsctl clear port firstbr qos 清除所有的qos和queue策略1ovs-vsctl -- --all destroy QoS -- --all destroy Queue 清除接口上的6e8b837c-2386-410e-a430-1ee319f25b01 qos策略ovs-vsctl – destroy Qos 6e8b837c-2386-410e-a430-1ee319f25b01 该命令如果对已经应用到具体接口qos则会报错如下12root@compute:~# ovs-vsctl -- destroy Qos 6e8b837c-2386-410e-a430-1ee319f25b01ovs-vsctl: transaction error: {"details":"cannot delete QoS row 6e8b837c-2386-410e-a430-1ee319f25b01 because of 1 remaining reference(s)","error":"referential integrity violation”} 要解决报错问题,需要清除接口的Qos策略 实验说明拓扑 拓扑实现脚本12345678910111213141516171819202122232425262728293031ip netns add ns1ip netns add ns2ip netns add ns3ip netns add ns4ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip addr add 1.1.1.1/24 dev tap1ip netns exec ns1 ip link set tap1 upip netns exec ns1 ip link set lo upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip addr add 1.1.1.2/24 dev tap2ip netns exec ns2 ip link set tap2 upip netns exec ns2 ip link set lo upovs-vsctl add-port br0 tap3 -- set Interface tap3 type=internalip link set tap3 netns ns3ip netns exec ns3 ip addr add 1.1.1.3/24 dev tap3ip netns exec ns3 ip link set tap3 upip netns exec ns3 ip link set lo upip link add firstbr type veth peer name firstifovs-vsctl add-port br0 firstbrovs-vsctl add-port br1 firstifip link set firstbr upip link set firstif upovs-vsctl add-port br1 tap4 -- set Interface tap4 type=internalip link set tap4 netns ns4ip netns exec ns4 ip addr add 1.1.1.4/24 dev tap4ip netns exec ns4 ip link set tap4 upip netns exec ns4 ip link set lo up Qos策略添加 添加qos和queue 123ovs-vsctl --timeout=10 -- set Port firstbr qos=@newqos -- --id=@newqos create QoS type=linux-htb other-config:max-rate=60000000000ovs-vsctl --timeout=10 create Queue other-config:max-rate=10000000000ovs-vsctl --timeout=10 create Queue other-config:max-rate=5000000000 将queue和qos绑定 12ovs-vsctl --timeout=10 add qos 418d3c99-073c-4509-b5cd-fa928423f47f queues 1=25831471-552b-4d5a-83b0-179f8f8e8991ovs-vsctl --timeout=10 add qos 418d3c99-073c-4509-b5cd-fa928423f47f queues 2=70001227-95ae-469f-aa25-c93d6b876650 将流量导向具体的queue 12ovs-ofctl --timeout=5 add-flow br0 hard_timeout=0,idle_timeout=0,priority=50,ip,ip_src=1.1.1.1,actions=set_queue:1,NORMALovs-ofctl --timeout=5 add-flow br0 hard_timeout=0,idle_timeout=0,priority=50,ip,ip_src=1.1.1.2,actions=set_queue:2,NORMAL 网络速度测试 ns1 -> ns4 UDP传输速率为9.217Gbits < 10G(queue1) 12345678root@compute:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.00 570032 0 29871.54212992 10.00 175900 9217.74 ns2 -> ns4 UDP传输速率为4.949Gbits < 5G(queue2) 12345678root@compute:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.01 1379035 0 72220.42212992 10.01 94505 4949.25 观察策略结果查看br0的流表策略12345root@compute:~# ovs-ofctl dump-flows br0NXST_FLOW reply (xid=0x4): cookie=0x0, duration=519.507s, table=0, n_packets=1271232, n_bytes=83320651952, idle_age=238, priority=50,ip,nw_src=1.1.1.1 actions=set_queue:1,NORMAL cookie=0x0, duration=519.197s, table=0, n_packets=2130362, n_bytes=139571241079, idle_age=247, priority=50,ip,nw_src=1.1.1.2 actions=set_queue:2,NORMAL cookie=0x0, duration=5773.205s, table=0, n_packets=12914956, n_bytes=736062122789, idle_age=238, priority=0 actions=NORMAL 查看限速TC queue123456789101112131415161718192021222324oot@compute:~# tc -s -d class show dev firstbrclass htb 1:fffe root rate 25640Mbit ceil 25640Mbit linklayer ethernet burst 0b/1 mpu 0b overhead 0b cburst 0b/1 mpu 0b overhead 0b level 7 Sent 17715662461 bytes 340 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 270756 borrowed: 0 giants: 0 tokens: 6 ctokens: 6class htb 1:1 parent 1:fffe prio 0 quantum 1500 rate 12Kbit ceil 25640Mbit linklayer ethernet burst 1563b/1 mpu 0b overhead 0b cburst 0b/1 mpu 0b overhead 0b level 0 Sent 168 bytes 4 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 4 borrowed: 0 giants: 0 tokens: 15417840 ctokens: 6class htb 1:2 parent 1:fffe prio 0 quantum 1500 rate 12Kbit ceil 10Gbit linklayer ethernet burst 1563b/1 mpu 0b overhead 0b cburst 1250b/1 mpu 0b overhead 0b level 0 Sent 11523732906 bytes 55 pkt (dropped 394118, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 6 borrowed: 175964 giants: 0 tokens: -525916978 ctokens: 18class htb 1:3 parent 1:fffe prio 0 quantum 1500 rate 12Kbit ceil 5Gbit linklayer ethernet burst 1563b/1 mpu 0b overhead 0b cburst 1250b/1 mpu 0b overhead 0b level 0 Sent 6191929387 bytes 281 pkt (dropped 1284513, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 6 borrowed: 94792 giants: 0 tokens: -524514101 ctokens: 37 打印队列发送速度脚本12345678910111213import osimport sysimport timecmds=\"tc -s -d class show dev bond1|tail -n 6|grep Sent|awk '{print $2}'\"print 4*'start'while True: result1=os.popen(cmds).readlines() time.sleep(1) result2=os.popen(cmds).readlines() print \"queue tx speed:\", (int(result2[0])-int(result1[0]))/1000 打印bond1口,tc queue,定位到具体class,单位Kbyte/s","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"tc","slug":"tc","permalink":"http://vcpu.me/tags/tc/"},{"name":"netperf","slug":"netperf","permalink":"http://vcpu.me/tags/netperf/"},{"name":"发包限速","slug":"发包限速","permalink":"http://vcpu.me/tags/发包限速/"}]},{"title":"linux bridge","slug":"linux-bridge1","date":"2017-11-02T10:00:00.000Z","updated":"2017-11-02T10:53:23.000Z","comments":true,"path":"linux-bridge1/","link":"","permalink":"http://vcpu.me/linux-bridge1/","excerpt":"linux bridge处理原理进入桥的数据报文分为几个类型,桥对应的处理方法也不同:1. 报文是本机发送给自己的,桥不处理,交给上层协议栈;2. 接收报文的物理接口不是网桥接口,桥不处理,交给上层协议栈;3. 进入网桥后,如果网桥的状态为Disable,则将包丢弃不处理;4. 报文源地址无效(广播,多播,以及00:00:00:00:00:00),丢包;5. 如果是STP的BPDU包,进入STP处理,处理后不再转发,也不再交给上层协议栈;6. 如果是发给本机的报文,桥直接返回,交给上层协议栈,不转发;7. 需要转发的报文分三种情况:1) 广播或多播,则除接收端口外的所有端口都需要转发一份;2) 单播并且在CAM表中能找到端口映射的,只需要网映射端口转发一份即可;3) 单播但找不到端口映射的,则除了接收端口外其余端口都需要转发;(1) bridge 网络帧,如果帧的目的 MAC 地址是在桥的另一侧的某个网络设备上。(2) 泛洪该网络帧,如果帧的目的 MAC 对网桥是不认识的。(3) 转到更高一层的三层IP协议栈代码处理,如果帧的目的地址是桥本身的或者它的某个端口的。忽略它,如果帧的目的地址是位于它来自的桥的方向的同一侧。 对 (1)和 (2)来说,接下来该数据帧被 filter 的 FORWARD 链处理。对(3)来说,它会被 filter 的 INPUT 链处理。发生这种情况时,该 bridge 其实是被用作一个路由器(一个对比例子是现实世界中的带路由的交换机设备)。包含 IP 包的以太帧的目的 MAC 地址是桥的MAC地址,但是目的 IP 地址不是桥的IP地址。","text":"linux bridge处理原理进入桥的数据报文分为几个类型,桥对应的处理方法也不同:1. 报文是本机发送给自己的,桥不处理,交给上层协议栈;2. 接收报文的物理接口不是网桥接口,桥不处理,交给上层协议栈;3. 进入网桥后,如果网桥的状态为Disable,则将包丢弃不处理;4. 报文源地址无效(广播,多播,以及00:00:00:00:00:00),丢包;5. 如果是STP的BPDU包,进入STP处理,处理后不再转发,也不再交给上层协议栈;6. 如果是发给本机的报文,桥直接返回,交给上层协议栈,不转发;7. 需要转发的报文分三种情况:1) 广播或多播,则除接收端口外的所有端口都需要转发一份;2) 单播并且在CAM表中能找到端口映射的,只需要网映射端口转发一份即可;3) 单播但找不到端口映射的,则除了接收端口外其余端口都需要转发;(1) bridge 网络帧,如果帧的目的 MAC 地址是在桥的另一侧的某个网络设备上。(2) 泛洪该网络帧,如果帧的目的 MAC 对网桥是不认识的。(3) 转到更高一层的三层IP协议栈代码处理,如果帧的目的地址是桥本身的或者它的某个端口的。忽略它,如果帧的目的地址是位于它来自的桥的方向的同一侧。 对 (1)和 (2)来说,接下来该数据帧被 filter 的 FORWARD 链处理。对(3)来说,它会被 filter 的 INPUT 链处理。发生这种情况时,该 bridge 其实是被用作一个路由器(一个对比例子是现实世界中的带路由的交换机设备)。包含 IP 包的以太帧的目的 MAC 地址是桥的MAC地址,但是目的 IP 地址不是桥的IP地址。 iptable linux bridge网络包匹配条件 使用举例实验环境 实验脚本12345678910111213141516171819202122brctl addbr br0 ifconfig br0 1.1.2.254/24 upip addr add 1.1.1.254/24 dev br0ip link add net0_eth0 type veth peer name tap0ip netns add net0ip link set dev net0_eth0 netns net0ip netns exec net0 ip link set dev net0_eth0 name eth0ip netns exec net0 ip addr add 1.1.1.1/24 dev eth0ip netns exec net0 ip link set dev eth0 upip link set dev tap0 master br0ip link set dev tap0 upip link add net1_eth0 type veth peer name tap1ip netns add net1ip link set dev net1_eth0 netns net1ip netns exec net1 ip link set dev net1_eth0 name eth0ip netns exec net1 ip addr add 1.1.1.2/24 dev eth0ip netns exec net1 ip link set dev eth0 upip link set dev tap1 master br0ip link set dev tap1 up 匹配linux网络数据包1234iptables -A INPUT -p icmp -m physdev --physdev-in tap0 -j LOGiptables -A FORWARD -p icmp -m physdev --physdev-in tap0 --physdev-out tap1 -j LOGiptables -A FORWARD -p icmp -m physdev --physdev-is-outiptables -A FORWARD -p icmp -m physdev --physdev-is-out --physdev-is-in --physdev-is-bridged 查看数据包123456789101112Chain INPUT (policy ACCEPT 122 packets, 8044 bytes) pkts bytes target prot opt in out source destination 5 420 LOG icmp -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap0 LOG flags 0 level 4Chain FORWARD (policy ACCEPT 12 packets, 1008 bytes) pkts bytes target prot opt in out source destination 6 504 LOG icmp -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap0 --physdev-out tap1 LOG flags 0 level 4 12 1008 icmp -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-is-out 12 1008 icmp -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-is-in --physdev-is-out --physdev-is-bridgedChain OUTPUT (policy ACCEPT 85 packets, 20092 bytes) pkts bytes target prot opt in out source destination","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"iptables","slug":"iptables","permalink":"http://vcpu.me/tags/iptables/"},{"name":"linux bridge","slug":"linux-bridge","permalink":"http://vcpu.me/tags/linux-bridge/"}]},{"title":"小说官神经典语录","slug":"guanshen","date":"2017-11-02T10:00:00.000Z","updated":"2017-11-02T11:05:04.000Z","comments":true,"path":"guanshen/","link":"","permalink":"http://vcpu.me/guanshen/","excerpt":"“苟利国家生死以,岂因祸福趋避之。”","text":"“苟利国家生死以,岂因祸福趋避之。” 为官者达到举重若轻、点石成金、出神入化的境界,是为官神。 “渡尽劫波兄弟在,相逢一笑泯恩仇。” 但留一线,也好相见。 “俯仰无愧天地,褒贬自有春秋!” 世界上有两样东西最震憾人心,一个是寄托了一生追求的理想,一个是现实中必须承担的责任。因为有理想,我们对未来充满了希望。因为有责任,我们负重前行,义无反顾。 一个国家真正富强的标志不是GDP排名第几,不是有没有世界排名前几的富翁。也不是穷兵黩武拥有多少核武器,而是在于这个国家有没有利用文化来向世界各地输出自己的文化理念和价值观! “十年磨一剑,霜刃未曾试,今日把示君,谁有不平事?” 世界上化解痛苦的最好的办法,莫过于看到自己的对手受到比自己更大的伤害,失去比自己更多的利益。 “中岁颇好道,晚家南山陲。兴来每独往,胜事空自知。行到水穷处,坐看云起时。偶然值林叟,谈笑无还期。” 听弦歌而知雅意。 政治上的团结是全然不分党政的。 官场上的争斗,除了政治智慧之外,比拼就是谁的后台够硬。 有时候,境界的高低就决定了眼界的不同,出身的不同,往往就直接决定了命运的不同。 傲慢不仅仅可以表现为嚣张和不可一世,也可以表现为淡定从容的风度,居高临下的微笑,以及一切尽在掌握的自信。 有时候也许你级别不够高,权力不够大,但如果你有足够影响到关键人物的能力,你就会成为一个至关重要的支点,是所有目光的焦点,这个时候,你就有了决定性的影响力。 越在百姓面前趾高气扬不可一世的人,其实越是官不大钱不多素质不高本事不大的人,正是因为没有足够的跻身上流社会的资本,才会在普通人面前耀武扬威,炫耀自己的并不多的本钱,其实也是一种无能加自卑的变态表现。 最聪明的人从来不将利益挂在嘴边放在眼前,他们只做不说,表面只帮别人得到利益,实际上在帮助别人的同时,所有受过他的恩惠的人,都会将他的帮助记在心间。让别人记住你的好,是做人的最大成功,远比让别人记住你多有能量多重要要强上许多。人都是感情动物,在感情上接受一个人会记得十分长久。 许多时候,我们和人交道,尊重的不是他本人,是他的背景。现在看来,用在狗身上也同样合适。有什么样的主人,就有什么样的狗仗人势的狗! 没做过好事的人,是永远体会不到做好事之时的温暖和美好,是永远无法感受被人感谢时的幸福和快乐。 有时候越是事关自己的切身利益,就越要耐住性子,不能露怯。否则,同事轻视你,领导看不上你,你的前途就完了。 法治建设之所以任重而道远,就是因为国内人治的现象太过严重,一些领导人权力欲望高涨,亲者近远者疏,不站在公平客观的立场上看待问题,而是从个人感情出发,迁就纵容身边人,对许多丑陋现象视而不见,只因为违法乱纪者有后台有背景,就不予追究。 人事一说,可以理解为先做人后做事,也可以理解为先用人后成事。 领导站着听你的工作汇报,有双重含义,一是他可以随时结束,如果你没有说到领导心里,他一抬手看时间,你就得马上结束汇报。二是给你施加压力,站着,可以当成是对你的重视,也可以当成随时准备走人的准备。抬手腕就可以走人,但同时又可以一屁股坐下,全看你的表现是不是让领导满意了。 领导的话总是含义丰富,猜不对的话,不但办不成事情,还会在领导心目中留下不好的印象,以后再想重新建立好感,就很难了。 同流合污和孤胆英雄之间,有一个灰色的中间地带,就叫和光同尘。 你和谁有仇,你就生一个女儿,从小不教好她,让她爱虚荣乱花钱,然后将你的女儿嫁给仇人的儿子,然后仇人的全家就完了。 有些人,你退一步,他进两步。你以大局为重,你以为你软弱可欺。稍有风吹草动,他就以为你会失势,就想趁机在你的头上踩上几脚。 不该问的问题绝对不要问,不该知道的问题千万不要想法去知道,是官场守则的第一要旨。 没领结婚证之前,所有的恋爱都算耍流氓。 上联:赤条条来,深入裙中,海棠树上梨花颤。 下联:光溜溜去,牡丹花下,嫩草尚绿老牛归。 横批:畜生入死。 在官场之上,你挡了别人的路,坐了别人的座位,比夺妻之恨还严重。 厅级干部,尤其是实职正厅,因为经济问题下台的,多半是政治斗争的牺牲品。 认为天下太平,形势一片大好的人,都不过被新闻媒体愚弄了,或是头脑过于简单。严防死守记者是为了什么,还不是为了继续推行愚民政策。 力量掌握在坏人手中,就是土匪。如果土匪的头头改邪归正,土匪就成了官兵。 让一个人最痛苦的不是将他一棍子打死,而是让他看不到希望,但又必须强颜欢笑,假装一切都好 领导随便和平和,是领导风范,是平易近人的表现。下级要是随便和平和,是不懂规矩,是目无尊长的不成熟。规矩一定要遵守,不管领导对你多看重,任何时候都不要忘了领导就是领导,不是你可以随便说笑的朋友。把领导当朋友的人,早晚会被领导弃置一边。 有些人,你拿他当人,他拿你当鸟。你拿他当鸟,他就立刻软了腰,拿你当人了。 开拓一说,本来就没有一个明确的说法,什么叫开拓?大刀阔斧叫开拓,但也可以叫蛮干。轰轰烈烈叫开拓,但也可以叫沽名钓誉。 政治上,玩的是权术,但比的也是谁更有理有据。在后台同样强硬的情况下,谁更有主动权,谁更大义凛然,谁就是最后的胜利者。 新手开车,总是喜欢左冲右突,似乎多有身手一样。真正有了经历几后,车都开得平稳了,宁停三分,不抢一秒。 领导拉拢下级,手法高超或直接,其实目的都一样,领导是想让下级听话,惟命是从,平常认真落实领导意图,关键时刻当枪,危机时刻当炮灰。而下级向领导靠拢,用意也很明显,是想借上级之手,为升迁创造条件,想要步步高升。说白了,上下级之间,也是互相利用的关系,就看谁利用谁的手段更高明罢了。 你不坏,别人未必会认为你不坏。你很好,别人也许认为你假装清高。 人生是一次旅行,所有人的目的地都相同,所不同的是沿途的风景。 政治就是最大的冒险,向小里说,赌的就是人生和前途,向大里说,赌的就是性命。 伟人都是被神化的,在我们需要仰视才见的伟人的光辉的形象背后,也是一样有七情六欲有喜怒哀乐,会震怒会失态也会骂人的凡人。 奋斗之前,是天道酬勤。奋斗之后,是希言自然。 官场之上最不缺少的就是人才,就是接任者,说句不好听的话,如果将全国各地市的一二把手全部拿下,不出三天,就会有后备力量全部接替,不会出现有位置而缺人顶替的尴尬。 聪明人往往都会讲信誉,并且目光长远,一旦认准目标,轻易不会改变初衷。 咄咄逼人的性格,因为心气过于外露,精力外泄,所以易伤肝。人一老,最先由脚冷开始,肝火不旺,寒气入体就长驱直入,身体的抵抗力就每况愈下。 国人都溺爱自己的孩子,认为自己的孩子就是全世界一样,但说句不好听的实话,你的全世界不是别人的全世界。 社会是一个大家庭,个人拥有再多的财富,再高的社会地位,如果不能建立在和百姓同分享共欢乐的基础之上,就和建造在沙滩上的高楼大厦一样,早晚倒塌。 一个真正的政治家,在提高自身素质的同时,在引导百姓正确的方向的同时,更要有心系苍生敢为天下先的勇气和决心,在其位谋其政,愿为国家的未来、民族的振兴和百姓的幸福,奉献毕生的心血。 政治之上,你永远不知道你的朋友或对手的后面,到底站着的人是谁。 人生无非过场,总有落幕的一刻,与其刻意逃避已经发生的事实,不如坦然面对,或许还能多一些开朗。 人无品不立,没有品德的人,能力越大,反而越是社会的危害。 想要达到心中的理想,有时候,你必须学会伪装。伪装不是虚伪,是必要的弯路。 不怕狼一样穷凶极恶的对手,就怕猪一样无所事事的队友。 最可喜的是身临绝境而不绝望,能惊而不慌,急而不措。最可悲的是首战告捷先庆功,孤芳而自赏,得意而忘形。 问题不在于是谁发起,而是在谁的手中出现了问题。就如每一个末代皇帝,都有一个将家底耗光的爹,留给他的是一个千疮百孔无论如何也缝补不过来的烂摊子,而且气数已尽,无力回天。 红颜祸水,其实祸水的不是红颜,而是男人的贪婪和占有之心。 酒鬼就认为世界上人人好酒,色鬼就觉得所有男人见到女人就会走不动,阴谋家见到一出碾压小孩的车祸事件,也会猜疑是不是有人为制造的可能…… 人应该有自知之明,既然不在其位了,就应该不谋其政,不要上蹿下跳地像一个跳梁小丑一样,拿着鸡毛当令箭。 将欲取之,必先予之,用人之道,也要讲究一个平衡和制衡,养虎未必会成患,也许将可以吃人的老虎养大之后,再放虎归山,必有猎人猎杀之。 政治上最大的惩罚不是免职,也不是一免到低,而是先晾后晒,然后再处理的先扬后抑。反正就将你放在原有的位置,可以上新闻媒体,可以出风头,可以说可以报道的话,但你自己却心里清楚,手中无权,而且前方无路,在前方等待你的是一个大大的囚笼。你想走,走不了。想不干,也撂不下担子。就是说,明明知道快死了,还得强颜欢笑,还得以饱满的热情和满腔的激情来迎接死亡——换了谁,都受不了这样的煎熬。 有时候,卸磨杀驴不是主人无情,而是驴自恃劳苦劳高,产生了不切实际的幻想。 大凡双眼犀利、为人处事咄咄逼人之人,因为精气外泄过多,大多不会长寿。 对事不对人的说法是很高尚的说法,实际上,人都是情感动物,谁也做不到真正的对事不对人。 将复杂问题简单化,是人才。将简单问题复杂化,是官僚。 精明不是坏事,但凡事要看长远,在大局观的统筹之下,不计较一时得失才是大将之风,凡事斤斤计较,吃一点亏就要还回来,精明过头就显得小气了。 不管黑狗白狗,能看门的狗就是好狗。 无知者不是无畏,而是自我沉醉。 对于大部分油盐酱醋的市民来说,生活就是日出而作日落而息,就是赚点钱吃点肉喝点酒,然后老婆孩子热坑头,然后打打麻将吹吹牛,然后拉灯睡觉,感觉日子如流水,却天下太平,就算地震,也震不到自己的头上。 有时候,无知也是一种愚蠢的幸福。 太多人都生活在真相的大门之外,甚至打酱油也打的是劣质酱油。 一个人只有在失而复得之后才会感受到平常忽视的珍贵。 每个人都有作茧自缚的一面,只是自己不知道罢了。 有时候赖帐不是因为脸皮厚,而是因为心照不宣。 敲山是为震虎,打狗,是为了让狗主人看清形势。 许多时候,高高在上的一类人见到百姓的疾苦不会同情,是因为没有感同身受。 任何改革,都是向既得利益者叫板,不拿既得利益集团开刀的改革,就不是改革。 圈养贪官然后等其养肥之后再杀之取财,也不失为增加税收的一条捷径…… 身上不干净的人,最怕深化改革,一改革,许多事情就得摆到阳光之下。问题是,身上不见光的事情太多了,不能晒,一晒就化了。 历史的公正在于,任何逆历史潮流而动的人物都会被冲得七零八落,历史从来不讲情面,只按照既定的轨道向前冷漠地推动。 历史上历来每次改革,都会出现许多反对的声音,甚至还有暴力事件发生。改革就是触动既得利益集团的利益,没有利益重新分配的改革不叫改革,叫空头支票。 官场之上,站队是每个人都必须面临的重大选择,不站队,就不会融入团队,不融入团队,就没有盟友。没有盟友的官场中人,是可悲的独行者,肯定走不远。 腐败不除,不仅是道德的沦丧,也是文化的悲哀,更是信仰的缺失和制度的漏洞,在现在的腐败形势日盖严峻的情形之下,如果真要推行一种所谓的允许适度腐败的模式,夏想断言,不出二十年,必有亡党亡国之忧。 古往今来,几千年的泱泱大国,有记载的只有一人家世永昌,传承了七十多代一孔子。 中国人生活累,一小半缘于生存,一大半缘于攀比。 美国的忧患意识极强,做事情一般都谋划长远,甚至长达20年。美国不但不允许世界上出现一个可以挑战美国权威的超级大国,同样也不允许中国出现一位有胆魄有开拓精神的领导人 美国作为世界上最霸权的国家,又是世界警察,除非飞出地球,否则哪里都会有美国的影子。 离开生存谈人权,就是扯淡的事情。同样,离开民众素质的提高谈民主,也一样是扯谈。 人,如果不摆正自己的位置,早晚会被自己的傲慢所累。 每个国家都有自己独特的技术或资源,就和一个人一样,只有拥有了独立的人格和技能,才能赢得别人的尊重。如果将自己的独门绝技拱手让人,不顾自身的安危和由此有可能造成的严重的后果,那不是活雷锋,是活宝。 想要保持神秘有两种办法,一是沉默。但越沉默反而越容易让外界众说纷纭,还容易让对方判断失误。另一种就是夸大,在夸夸其谈之下,同样也让对方摸不透虚实。 中国外交策略过于保守,凡事喜欢一个腔调说话,其实反而容易让外界摸清底细。而美国的手法通常是两面手法,外交部是一套说辞,国防部是一套说辞,一方含糊其词,一方措辞强硬,就很容易让外界分不清美国的真实意图。 对老人来讲,有点事情做是好事,人不能闲,一闲,心思就淡,心思一淡,生命力就弱了。 人生所有的努力和奋斗,怕就怕突然双眼睁开,原来是黄粱一梦。 人生是粒种,落地就会生根。同样。一个人的所作所为也是一粒种子,种下善良,收获的也是善良,种下仇恨,收获的也是仇恨。 有关部门正在研究拟将退休年龄提高5岁,提出65岁退休以缓解养老金缺口的压力,其实是昏招。中国人的平均寿命72岁,65岁退休,最多领7年退休金就会告别人世,而想要领取退休后的退休金,必须退休之前先交满15年养老保险。用15年付出换回7年的回报,为什么还填不满养老金的巨大缺口?而机关事业单位的职工无需交纳养老金,退休后的收入是企业职工的两倍到数倍!有关部门还解释说机关事业单位的退休金不占社会养老金的份额,但不要忘了一个大提前是,机关事业单位全靠企业养活,他们的工资和养老金全是纳税人的钱。 在摸着石头过河的改革过程中,国家其实走了许多弯路,但在弥补弯路带来的损失时,往往只损害百姓的利益,同时各个利益集团不但不付出,反而还要变相盘剥百姓并且借机壮大自己的实力。不可否认的是,利益集团在任何时代都会出现,只要有国家存在,就会有利益集团的存在,只是如果能少一些对百姓的压榨,还富于民,才是国家基石稳定的保证。利益集团永远是少数,是船,广大百姓就是承载大船小船的水。水的承载能力有限,利益集团造船的野心无限,久而久之,要么是船翻水枯,要么就是水能载舟亦能覆舟。 普普通通的中国人、踏踏实实的中国人、不畏艰难的中国人、侠肝义胆的中国人! 在岁月的流逝中,许多人离去,许多人到来,人生就是一场聚散不定的盛宴,有开始,必然有结束,不管是喜剧还是悲剧,结束的一刻,总会如期到来。","categories":[{"name":"杂文","slug":"杂文","permalink":"http://vcpu.me/categories/杂文/"}],"tags":[{"name":"经典语录","slug":"经典语录","permalink":"http://vcpu.me/tags/经典语录/"}]},{"title":"ovs snat","slug":"ovs-snat","date":"2017-10-31T10:00:00.000Z","updated":"2017-11-02T10:52:53.000Z","comments":true,"path":"ovs-snat/","link":"","permalink":"http://vcpu.me/ovs-snat/","excerpt":"环境背景","text":"环境背景 脚本配置1234567891011121314151617181920ip netns add ns1ip netns add ns2ovs-vsctl add-br br0ovs-vsctl add-br br-swovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip link set dev tap1 upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip link set dev tap2 upip netns exec ns1 ip addr add 192.168.1.102/24 dev tap1ip netns exec ns2 ip addr add 192.168.1.1/24 dev tap2ip netns exec ns1 ip link set lo upip netns exec ns2 ip link set lo upip netns exec ns1 ping -c 4 192.168.1.102ip netns exec ns1 ping -c 4 192.168.1.1ovs-vsctl add-port br0 patch-ovs-0 -- set Interface patch-ovs-0 type=patch options:peer=patch-sw-1ovs-vsctl add-port br-sw patch-sw-1 -- set Interface patch-sw-1 type=patch options:peer=patch-ovs-0 把外网接口加入到br-sw 1234567ip addr flush dev enp0s3ovs-vsctl add-port br-sw enp0s3ifconfig br-sw 10.30.10.141/24 uproute add default gw 10.30.10.1 配置访问同网段流表10.30.10.175123456789#如果进入trk可以直接查找连接,如果没有执行相应actionovs-ofctl add-flow br0 'table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)' #新建连接,执行ct commit;创建链接,执行nat规则ovs-ofctl add-flow br0 'table=1,in_port=1,ip,ct_state=+trk+new,action=ct(nat(src=10.30.10.141-10.30.10.141:5000-50000),commit),mod_dl_src:08:00:27:99:81:b7,mod_dl_dst:00:0e:c6:cf:da:c7,3'#到外网流量,已经建立号连接,直接发给3口ovs-ofctl add-flow br0 'table=1,in_port=1,ip,ct_state=+trk+est,action=mod_dl_src:08:00:27:99:81:b7,mod_dl_dst:00:0e:c6:cf:da:c7,3'#回来流量,已经建立连接,直接发给1口ovs-ofctl add-flow br0 'table=1,in_port=3,ip,ct_state=+trk+est,action=mod_dl_src:1e:33:f9:73:48:d0,mod_dl_dst:92:25:03:9d:0f:34,1' 配置访问外网114.114.114.11412345678910#如果进入trk可以直接查找连接,如果没有执行相应actionovs-ofctl add-flow br0 'table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)'#新建连接,执行ct commit;创建链接,执行nat规则ovs-ofctl add-flow br0 'table=1,in_port=1,ip,ct_state=+trk+new,action=ct(nat(src=10.30.10.141-10.30.10.141:5000-50000),commit),mod_dl_src:08:00:27:99:81:b7,mod_dl_dst:d4:ee:7:50:78:c2,3'#到外网流量,已经建立号连接,直接发给3口ovs-ofctl add-flow br0 'table=1,in_port=1,ip,ct_state=+trk+est,action=mod_dl_src:08:00:27:99:81:b7,mod_dl_dst:d4:ee:7:50:78:c2,3'#回来流量,已经建立连接,直接发给1口ovs-ofctl add-flow br0 'table=1,in_port=3,ip,ct_state=+trk+est,action=mod_dl_src:1e:33:f9:73:48:d0,mod_dl_dst:92:25:03:9d:0f:34,1' 流表删除1234567ovs-ofctl del-flows br0 'table=0,ip,ct_state=-trk'ovs-ofctl del-flows br0 'table=1,in_port=1,ip,ct_state=+trk+new'ovs-ofctl del-flows br0 'table=1,in_port=1,ip,ct_state=+trk+est'ovs-ofctl del-flows br0 'table=1,in_port=3,ip,ct_state=+trk+est' 按照dragonflow社区想法实现snatsnat社区实现流量原理流量从br0流出后,修改源mac为已设定 80:88:88:88:88:99,目的mac修改为出口网桥的mac 08:00:27:99:81:b7 流量从br0流到br-sw后,br-sw收到上述流量后,查看目的mac为桥的mac,属于三层转发流量,br-sw将上述流量源mac变为br-sw mac ,目的mac变为具体地址活着网关的mac。 snat社区实现要点通过将snat发出流量直接发给出口网桥,通过网桥三层转发给具体的目的地址,这样实现从br0中流表不用关心snat访问的具体目标mac地址(访问的具体目标br-sw会主动学习具体mac地址),只需要关注外网网桥的mac即可。 SNAT 的ARP被动回复流表12345ovs-ofctl add-flow br0 \"priority=100 arp arp_op=1 arp_tpa=10.30.10.111 action=move:OXM_OF_ARP_TPA[]->NXM_NX_REG2[],resubmit(,8),goto_table:10\"ovs-ofctl add-flow br0 \"table=8,reg2=0x0a1e0a6f/0xffffffff,action=load:0x808888888888->OXM_OF_PKT_REG0[]\"ovs-ofctl add-flow br0 \"table=10 priority=10,arp,arp_op=1,action=load:2->OXM_OF_ARP_OP[],move:OXM_OF_ARP_SHA[]->OXM_OF_ARP_THA[],move:OXM_OF_PKT_REG0[0..47]->OXM_OF_ARP_SHA[],move:OXM_OF_ARP_SPA[]->OXM_OF_ARP_TPA[],move:NXM_NX_REG2[]->OXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],move:OXM_OF_PKT_REG0[0..47]->NXM_OF_ETH_SRC[],move:NXM_OF_IN_PORT[]->NXM_NX_REG3[0..15],load:0->NXM_OF_IN_PORT[],output:NXM_NX_REG3[0..15]\" SNAT流表1234ovs-ofctl add-flow br0 \"in_port=1,ip,action=ct(commit,zone=1,nat(src=10.30.10.111-10.30.10.111)),mod_dl_src:80:88:88:88:88:99,mod_dl_dst:08:00:27:99:81:b7,3\"ovs-ofctl add-flow br0 \"in_port=3,ct_state=-trk,ip,action=ct(table=0,zone=1,nat)\"ovs-ofctl add-flow br0 \"in_port=3,ct_state=+trk,ct_zone=1,ip,action=mod_dl_src:80:88:88:88:88:99,mod_dl_dst:92:25:03:9d:0f:34,1\" 生成流表123456789101112131415cookie=0x0, duration=881.078s, table=0, n_packets=850, n_bytes=83300, idle_age=0, ip,in_port=1 actions=ct(commit,zone=1,nat(src=10.30.10.111)),mod_dl_src:80:88:88:88:88:99,mod_dl_dst:08:00:27:99:81:b7,output:3 cookie=0x0, duration=732.960s, table=0, n_packets=1621, n_bytes=421788, idle_age=0, ct_state=-trk,ip,in_port=3 actions=ct(table=0,zone=1,nat) cookie=0x0, duration=710.396s, table=0, n_packets=1584, n_bytes=408966, idle_age=0, ct_state=+trk,ct_zone=1,ip,in_port=3 actions=mod_dl_src:80:88:88:88:88:99,mod_dl_dst:92:25:03:9d:0f:34,output:1 cookie=0x0, duration=2336.028s, table=0, n_packets=18, n_bytes=1026, idle_age=4, priority=100,arp,arp_tpa=10.30.10.111,arp_op=1 actions=move:NXM_OF_ARP_TPA[]->NXM_NX_REG2[],resubmit(,8),resubmit(,10) cookie=0x0, duration=38932.939s, table=0, n_packets=17775, n_bytes=5974547, idle_age=332, priority=0 actions=NORMAL cookie=0x0, duration=2171.189s, table=8, n_packets=18, n_bytes=1026, idle_age=4, reg2=0xa1e0a6f actions=load:0x808888888899->OXM_OF_PKT_REG0[] cookie=0x0, duration=2110.202s, table=10, n_packets=18, n_bytes=1026, idle_age=4, priority=10,arp,arp_op=1 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:OXM_OF_PKT_REG0[0..47]->NXM_NX_ARP_SHA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],move:NXM_NX_REG2[]->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],move:OXM_OF_PKT_REG0[0..47]->NXM_OF_ETH_SRC[],move:NXM_OF_IN_PORT[]->NXM_NX_REG3[0..15],load:0->NXM_OF_IN_PORT[],output:NXM_NX_REG3[0..15]","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"snat","slug":"snat","permalink":"http://vcpu.me/tags/snat/"}]},{"title":"RYU连接ovs方法举例","slug":"ryu-connect-ovs","date":"2017-10-25T10:00:00.000Z","updated":"2017-11-02T07:04:57.000Z","comments":true,"path":"ryu-connect-ovs/","link":"","permalink":"http://vcpu.me/ryu-connect-ovs/","excerpt":"基本概念介绍执行mn命令后,主机监听6633和6634端口 1234Active Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp 0 0 0.0.0.0:6633 0.0.0.0:* LISTEN 23871/controllertcp 0 0 0.0.0.0:6634 0.0.0.0:* LISTEN 1131/ovs-vswitchd","text":"基本概念介绍执行mn命令后,主机监听6633和6634端口 1234Active Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp 0 0 0.0.0.0:6633 0.0.0.0:* LISTEN 23871/controllertcp 0 0 0.0.0.0:6634 0.0.0.0:* LISTEN 1131/ovs-vswitchd 命令执行记录如下:1234567891011121314151617root@controller-VirtualBox:~# mn*** Creating network*** Adding controller*** Adding hosts:h1 h2*** Adding switches:s1*** Adding links:(h1, s1) (h2, s1)*** Configuring hostsh1 h2*** Starting controllerc0*** Starting 1 switchess1 ...*** Starting CLI:mininet> 端口6633 是ovs连接的控制器端口,且该端口是mn自己启动的端口6634是被动连接端口 12345678910111213Bridge "s1" Controller "tcp:127.0.0.1:6633" is_connected: true Controller "ptcp:6634" fail_mode: secure Port "s1" Interface "s1" type: internal Port "s1-eth1" Interface "s1-eth1" Port "s1-eth2" Interface "s1-eth2" ovs_version: "2.5.2" 端口6634是ovs-vswitchd监听端口是被动连接管理的端口所谓的被动连接端口是指,控制器可以主动连接此端口,控制此ovs,下面就是设置被动端口的 方式 ovs-vsctl set-controller s1 ptcp:8888 特别说明:相对应被动端口,主动端口(ovs主动连接并和控制器通信的端口)设置方式如下: ovs-vsctl set-controller s1 tcp:127.0.0.1:8887 主动连接本地控制器127.0.0.1:8887端口 主动连接也可以设置多个,设置方法如下通过通信数据包得知,该ovs会和本地控制器6636和6637通信,如果有packet in事件时候会将该事件分别发一份给这两个控制器,至于两个控制器间的关系协作暂无研究。 ovs-vsctl set-controller s1 tcp:127.0.0.1:6636 tcp:127.0.0.1:6637 123456789101112Bridge "s1" Controller "tcp:127.0.0.1:6637" Controller "tcp:127.0.0.1:6636" fail_mode: secure Port "s1-eth1" Interface "s1-eth1" Port "s1-eth2" Interface "s1-eth2" Port "s1" Interface "s1" type: internal ovs_version: "2.5.2" RYU控制器连接ovs步骤 设置ovs s1 主动连接6666端口 ovs-vsctl set-controller s1 tcp:127.0.0.1:6666 设置RYU启动服务端口6666 root@controller-VirtualBox:/usr/local/lib/python2.7/dist-packages/ryu/app# ryu-manager –ofp-tcp-listen-port 6666 –verbose simple_switch_13.py 1tcp 0 0 0.0.0.0:6666 0.0.0.0:* LISTEN 24665/python 通过上述设置ovs连接本机127.0.0.1:6666端口走openflow协议连接控制器simple_switch_13.py","categories":[{"name":"RYU","slug":"RYU","permalink":"http://vcpu.me/categories/RYU/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"实验性质","slug":"实验性质","permalink":"http://vcpu.me/tags/实验性质/"},{"name":"RYU","slug":"RYU","permalink":"http://vcpu.me/tags/RYU/"}]},{"title":"mininet+RYU环境安装","slug":"mininet1","date":"2017-10-20T10:00:00.000Z","updated":"2017-11-02T07:03:52.000Z","comments":true,"path":"mininet1/","link":"","permalink":"http://vcpu.me/mininet1/","excerpt":"安装mininet apt-get install mininet","text":"安装mininet apt-get install mininet 运行mininet 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475root@controller-VirtualBox:~# mn*** No default OpenFlow controller found for default switch!*** Falling back to OVS Bridge*** Creating network*** Adding controller*** Adding hosts:h1 h2*** Adding switches:s1*** Adding links:(h1, s1) (h2, s1)*** Configuring hostsh1 h2*** Starting controller*** Starting 1 switchess1 ...*** Starting CLI:mininet> nodesavailable nodes are:h1 h2 s1mininet> neth1 h1-eth0:s1-eth1h2 h2-eth0:s1-eth2s1 lo: s1-eth1:h1-eth0 s1-eth2:h2-eth0mininet> dump<Host h1: h1-eth0:10.0.0.1 pid=1961><Host h2: h2-eth0:10.0.0.2 pid=1963><OVSBridge s1: lo:127.0.0.1,s1-eth1:None,s1-eth2:None pid=1968>mininet> h1 ping h2PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=1.81 ms64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.072 ms^C--- 10.0.0.2 ping statistics ---2 packets transmitted, 2 received, 0% packet loss, time 1002msrtt min/avg/max/mdev = 0.072/0.942/1.812/0.870 msmininet> 下载安装RYU 1234git clone git://github.com/osrg/ryu.gitcd ryusudo pip install -r tools/pip-requiressudo python setup.py install 开启默认mininet环境,启动controller控制ip和端口mn –controller remote,ip=127.0.0.1,port=6633 执行外上述命令后,在一个窗口执行ovs-vsctl show 命令,查看配置情况 123456789101112131415root@controller-VirtualBox:~/ryu# ovs-vsctl show73065536-d00b-440c-a02f-b8d33ef05ee1 Bridge \"s1\" Controller \"ptcp:6634\" Controller \"tcp:127.0.0.1:6633\" fail_mode: secure Port \"s1\" Interface \"s1\" type: internal Port \"s1-eth1\" Interface \"s1-eth1\" Port \"s1-eth2\" Interface \"s1-eth2\" ovs_version: \"2.5.2\"root@controller-VirtualBox:~/ryu# 如上显示,controller控制ip:port信息如上,默认mininet环境启动了两个客户端分别通过s1-eth1和s1-eth2接入到ovs s1上,二层相连 未执行任何ping动作时候查查看默认流表root@controller-VirtualBox:~/ryu# ovs-ofctl dump-flows s1 -O openflow13 OFPST_FLOW reply (OF1.3) (xid=0x2): 开启simple_switch_13.py RYU控制器demo123456789101112131415161718192021222324252627282930313233343536373839root@controller-VirtualBox:/usr/local/lib/python2.7/dist-packages/ryu/app# ryu-manager --verbose simple_switch_13.py loading app simple_switch_13.pyloading app ryu.controller.ofp_handlerinstantiating app simple_switch_13.py of SimpleSwitch13instantiating app ryu.controller.ofp_handler of OFPHandlerBRICK SimpleSwitch13 CONSUMES EventOFPPacketIn CONSUMES EventOFPSwitchFeaturesBRICK ofp_event PROVIDES EventOFPPacketIn TO {'SimpleSwitch13': set(['main'])} PROVIDES EventOFPSwitchFeatures TO {'SimpleSwitch13': set(['config'])} CONSUMES EventOFPPortStatus CONSUMES EventOFPSwitchFeatures CONSUMES EventOFPHello CONSUMES EventOFPErrorMsg CONSUMES EventOFPEchoRequest CONSUMES EventOFPPortDescStatsReply CONSUMES EventOFPEchoReplyconnected socket:<eventlet.greenio.base.GreenSocket object at 0x7fcef67e5cd0> address:('127.0.0.1', 53604)hello ev <ryu.controller.ofp_event.EventOFPHello object at 0x7fcef67e56d0>move onto config modeEVENT ofp_event->SimpleSwitch13 EventOFPSwitchFeaturesswitch features ev version=0x4,msg_type=0x6,msg_len=0x20,xid=0xc20757d2,OFPSwitchFeatures(auxiliary_id=0,capabilities=79,datapath_id=1,n_buffers=256,n_tables=254)move onto main modeEVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 82:06:4f:42:09:6c 33:33:00:00:00:02 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 82:06:4f:42:09:6c ff:ff:ff:ff:ff:ff 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 76:db:0f:62:67:5e 82:06:4f:42:09:6c 2EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 82:06:4f:42:09:6c 76:db:0f:62:67:5e 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 76:db:0f:62:67:5e 33:33:00:00:00:02 2EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 82:06:4f:42:09:6c 33:33:00:00:00:02 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 76:db:0f:62:67:5e 33:33:00:00:00:02 2 开启完simple_switch_13.py 后查看流表项,经过s1所有的流量都会通过127.0.0.1:6633发给控制器,交给simple_switch_13.py 处理root@controller-VirtualBox:~/ryu# ovs-ofctl dump-flows s1 -O openflow13 OFPST_FLOW reply (OF1.3) (xid=0x2): cookie=0x0, duration=2.782s, table=0, n_packets=0, n_bytes=0, priority=0 actions=CONTROLLER:65535 h1 ping h2,流表新增两条12345678mininet> h1 ping h2PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=6.59 ms64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.228 ms^C--- 10.0.0.2 ping statistics ---2 packets transmitted, 2 received, 0% packet loss, time 1002msrtt min/avg/max/mdev = 0.228/3.412/6.597/3.185 ms 12345root@controller-VirtualBox:~/ryu# ovs-ofctl dump-flows s1 -O openflow13OFPST_FLOW reply (OF1.3) (xid=0x2): cookie=0x0, duration=12.970s, table=0, n_packets=4, n_bytes=280, priority=1,in_port=2,dl_dst=0a:30:ba:ff:9b:80 actions=output:1 cookie=0x0, duration=12.967s, table=0, n_packets=3, n_bytes=238, priority=1,in_port=1,dl_dst=ee:78:58:36:2b:06 actions=output:2 cookie=0x0, duration=224.026s, table=0, n_packets=5, n_bytes=322, priority=0 actions=CONTROLLER:65535 123456789101112packet in 1 ee:78:58:36:2b:06 33:33:00:00:00:02 2EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 0a:30:ba:ff:9b:80 ff:ff:ff:ff:ff:ff 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 ee:78:58:36:2b:06 0a:30:ba:ff:9b:80 2EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 0a:30:ba:ff:9b:80 ee:78:58:36:2b:06 1EVENT ofp_event->SimpleSwitch13 EventOFPPacketInpacket in 1 0a:30:ba:ff:9b:80 33:33:00:00:00:02 1 通过rest api查找相关信息1进入:/usr/local/lib/python2.7/dist-packages/ryu/app ryu-manager –verbose simple_switch_13.py ofctl_rest.py rest_topology.py rest-api查看ovs流表","categories":[{"name":"RYU","slug":"RYU","permalink":"http://vcpu.me/categories/RYU/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"SDN控制器","slug":"SDN控制器","permalink":"http://vcpu.me/tags/SDN控制器/"}]},{"title":"mininet界面配置拓扑方法","slug":"mininet-jiemian","date":"2017-10-20T10:00:00.000Z","updated":"2017-11-02T07:04:15.000Z","comments":true,"path":"mininet-jiemian/","link":"","permalink":"http://vcpu.me/mininet-jiemian/","excerpt":"在ubuntun系统中执行如下命令 /usr/lib/python2.7/dist-packages/mininet/examples/miniedit.py","text":"在ubuntun系统中执行如下命令 /usr/lib/python2.7/dist-packages/mininet/examples/miniedit.py 遇到问题: 在设置控制器时候,点击run时候,miniedit退出并报错如下c0 cannot find required executable controller 123apt-get install openvswitch-testcontroller cp /usr/bin/ovs-testcontroller /usr/bin/ovs-controllerln /usr/bin/ovs-controller /usr/bin/controller 环境搭建过程 步骤1:root模式执行python /usr/lib/python2.7/dist-packages/mininet/examples/miniedit.py 步骤2:设置h1和h2、h3的IP地址分别为10.10.10.1、10.10.10.2、10.10.10.3 步骤3:点击run运行步骤4:保存该脚本为1.py,下次直接运行 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960root@controller-VirtualBox:~# cat 1.py#!/usr/bin/pythonfrom mininet.net import Mininetfrom mininet.node import Controller, RemoteController, OVSControllerfrom mininet.node import CPULimitedHost, Host, Nodefrom mininet.node import OVSKernelSwitch, UserSwitchfrom mininet.node import IVSSwitchfrom mininet.cli import CLIfrom mininet.log import setLogLevel, infofrom mininet.link import TCLink, Intffrom subprocess import calldef myNetwork(): net = Mininet( topo=None, build=False, ipBase='10.0.0.0/8') info( '*** Adding controller\\n' ) c0=net.addController(name='c0', controller=Controller, protocol='tcp', port=6633) info( '*** Add switches\\n') s1 = net.addSwitch('s1', cls=OVSKernelSwitch, listenPort=6633) s2 = net.addSwitch('s2', cls=OVSKernelSwitch, listenPort=6633) info( '*** Add hosts\\n') h3 = net.addHost('h3', cls=Host, ip='10.10.10.3', defaultRoute=None) h1 = net.addHost('h1', cls=Host, ip='10.10.10.1', defaultRoute=None) h2 = net.addHost('h2', cls=Host, ip='10.10.10.2', defaultRoute=None) info( '*** Add links\\n') net.addLink(h1, s1) net.addLink(s1, h2) net.addLink(s2, h3) net.addLink(s1, s2) info( '*** Starting network\\n') net.build() info( '*** Starting controllers\\n') for controller in net.controllers: controller.start() info( '*** Starting switches\\n') net.get('s1').start([c0]) net.get('s2').start([c0]) info( '*** Post configure switches and hosts\\n') s1.cmd('ifconfig s1 127.0.0.1') s2.cmd('ifconfig s2 127.0.0.1') CLI(net) net.stop()if __name__ == '__main__': setLogLevel( 'info' ) myNetwork() 步骤5: 直接运行python 1.py恢复环境","categories":[{"name":"RYU","slug":"RYU","permalink":"http://vcpu.me/categories/RYU/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"mininet界面配置","slug":"mininet界面配置","permalink":"http://vcpu.me/tags/mininet界面配置/"},{"name":"实验性质","slug":"实验性质","permalink":"http://vcpu.me/tags/实验性质/"}]},{"title":"openstack 网络环境模拟","slug":"openstackmoni","date":"2017-10-15T10:00:00.000Z","updated":"2017-11-02T07:03:17.000Z","comments":true,"path":"openstackmoni/","link":"","permalink":"http://vcpu.me/openstackmoni/","excerpt":"背景环境","text":"背景环境环境介绍: 上述环境是模拟openstack网络环境,分为外网区域、计算节点区域、网络节点区域;外网区域主要模拟外网的网关和路由,计算节点区域模拟openstack计算节点物理机器,网络节点区域模拟openstack网络节点物理机器 计算节点区域和外网交换机,通过vlan206、vlan1000、vlan2000连接到外网交换机,其中valn206是外网浮动ip网段的ip,vlan1000为私网10.10.10.0/24,vlan2000位私网192.168.16.0/24 网络节点区域和外网交换机,br-int通过vlan1000,vlan2000连接外网交换机连接,br-public通过vlan206连接外网交换机 外网区域是计算节点和网络节点网关所在地,同时该区域连接外网接口enp0s8,计算节点和网络节点通过外围区域上外网 ubuntun操作系统区域,是ubuntu系统安装的系统的机器,上述环境是在一个机器中模拟所有节点情况。 环境搭建脚本123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596ovs-vsctl add-br br-int-1ovs-vsctl add-br br-int-2ovs-vsctl add-br swovs-vsctl add-br br-publicip netns add vm1ip netns add vm2ip netns add dhcp1ip netns add dhcp2ip netns add routerip netn add gwovs-vsctl add-port br-int-1 vm1-eth0 tag=1000 -- set Interface vm1-eth0 type=internalip link set vm1-eth0 netns vm1ip netns exec vm1 ifconfig vm1-eth0 10.10.10.10/24 upip netns exec vm1 route add default gw 10.10.10.1ip netns exec vm1 ifconfig lo upovs-vsctl add-port br-int-1 vm2-eth0 tag=2000 -- set Interface vm2-eth0 type=internalip link set vm2-eth0 netns vm2ip netns exec vm2 ifconfig vm2-eth0 192.168.16.10/24 upip netns exec vm2 route add default gw 192.168.16.1ip netns exec vm2 ifconfig lo upovs-vsctl add-port br-int-2 dhcp1-tap tag=1000 -- set Interface dhcp1-tap type=internalip link set dhcp1-tap netns dhcp1ip netns exec dhcp1 ifconfig dhcp1-tap 10.10.10.2/24 upip netns exec dhcp1 ifconfig lo upovs-vsctl add-port br-int-2 dhcp2-tap tag=2000 -- set Interface dhcp2-tap type=internalip link set dhcp2-tap netns dhcp2ip netns exec dhcp2 ifconfig dhcp2-tap 192.168.16.2/24 upip netns exec dhcp2 ifconfig lo upovs-vsctl add-port br-int-2 router-tap1 tag=1000 -- set Interface router-tap1 type=internalip link set router-tap1 netns routerip netns exec router ifconfig router-tap1 10.10.10.1/24 upip netns exec router ifconfig lo upovs-vsctl add-port br-int-2 router-tap2 tag=2000 -- set Interface router-tap2 type=internalip link set router-tap2 netns routerip netns exec router ifconfig router-tap2 192.168.16.1/24 upip netns exec router ifconfig lo upovs-vsctl add-port br-int-1 trunk_int1 trunks=1000,2000,206 -- set Interface trunk_int1 type=patch options:peer=trunk_sw1ovs-vsctl add-port sw trunk_sw1 trunks=1000,2000,206 -- set Interface trunk_sw1 type=patch options:peer=trunk_int1ovs-vsctl add-port br-int-2 trunk_int2 trunks=1000,2000 -- set Interface trunk_int2 type=patch options:peer=trunk_sw2ovs-vsctl add-port sw trunk_sw2 trunks=1000,2000 -- set Interface trunk_sw2 type=patch options:peer=trunk_int2ovs-vsctl add-port br-public router-qg tag=206 -- set Interface router-qg type=internalip link set router-qg netns routerip netns exec router ifconfig router-qg 192.168.206.11/24 upip netns exec router ifconfig lo upovs-vsctl add-port sw provider-gw tag=206 -- set Interface provider-gw type=internalip link set provider-gw netns gwip netns exec gw ifconfig provider-gw 192.168.206.254/24 upip netns exec gw ifconfig lo upovs-vsctl add-port br-public trunk_pub trunks=206 -- set Interface trunk_pub type=patch options:peer=trunk_sw3ovs-vsctl add-port sw trunk_sw3 trunks=206 -- set Interface trunk_sw3 type=patch options:peer=trunk_pubifconfig br-public 192.168.206.10/24 upip netns exec router route add default gw 192.168.206.254ovs-vsctl add-br br-exovs-vsctl add-port br-ex ex-gw -- set Interface ex-gw type=internalip link set ex-gw netns gwip netns exec gw ifconfig ex-gw 10.30.10.100/24 upip netns exec gw ifconfig lo upip netns exec gw route add default gw 10.30.10.1ip netns exec gw iptables -t nat -A POSTROUTING -s 192.168.206.0/24 -o ex-gw -j MASQUERADEovs-vsctl add-port br-ex enp0s8ip addr flush dev enp0s8ifconfig br-ex 10.30.10.145/24route add default gw 10.30.10.1ip netns exec router sysctl net.ipv4.ip_forward=1ip netns exec gw sysctl net.ipv4.ip_forward=1ip netns exec router iptables -t nat -A POSTROUTING -s 10.10.10.0/24 -o router-qg -j MASQUERADEip netns exec router iptables -t nat -A POSTROUTING -s 192.168.16.0/24 -o router-qg -j MASQUERADE 环境搭建完成后查看ovs123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990root@controller-VirtualBox:~# ovs-vsctl show73065536-d00b-440c-a02f-b8d33ef05ee1 Bridge "br-int-1" Port "br-int-1" Interface "br-int-1" type: internal Port "vm2-eth0" tag: 2000 Interface "vm2-eth0" type: internal Port "trunk_int1" trunks: [206, 1000, 2000] Interface "trunk_int1" type: patch options: {peer="trunk_sw1"} Port "vm1-eth0" tag: 1000 Interface "vm1-eth0" type: internal Bridge br-public Port trunk_pub trunks: [206] Interface trunk_pub type: patch options: {peer="trunk_sw3"} Port router-qg tag: 206 Interface router-qg type: internal Port br-public Interface br-public type: internal Bridge sw Port "trunk_sw3" trunks: [206] Interface "trunk_sw3" type: patch options: {peer=trunk_pub} Port provider-gw tag: 206 Interface provider-gw type: internal Port "trunk_sw2" trunks: [1000, 2000] Interface "trunk_sw2" type: patch options: {peer="trunk_int2"} Port "trunk_sw1" trunks: [206, 1000, 2000] Interface "trunk_sw1" type: patch options: {peer="trunk_int1"} Port sw Interface sw type: internal Bridge "br-int-2" Port "br-int-2" Interface "br-int-2" type: internal Port "dhcp1-tap" tag: 1000 Interface "dhcp1-tap" type: internal Port "router-tap1" tag: 1000 Interface "router-tap1" type: internal Port "router-tap2" tag: 2000 Interface "router-tap2" type: internal Port "trunk_int2" trunks: [1000, 2000] Interface "trunk_int2" type: patch options: {peer="trunk_sw2"} Port "dhcp2-tap" tag: 2000 Interface "dhcp2-tap" type: internal Bridge br-ex Port "enp0s8" Interface "enp0s8" Port ex-gw Interface ex-gw type: internal Port br-ex Interface br-ex type: internal ovs_version: "2.5.2"","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"实验性质","slug":"实验性质","permalink":"http://vcpu.me/tags/实验性质/"}]},{"title":"ovs patch port 连接方法","slug":"ovspatch","date":"2017-10-11T10:00:00.000Z","updated":"2017-10-11T03:21:29.000Z","comments":true,"path":"ovspatch/","link":"","permalink":"http://vcpu.me/ovspatch/","excerpt":"创建patch port连接ovs两个桥patch port用于连接openvswitch桥 实验环境12345678910111213141516 patch-ovs-1 patch-ovs-2+-----------------+ +----------------------+| | | || br1 +----------+ br2 || | | |+-----------------+ +----------------------+ | | | | +-------------+ +--------------+ | | | ns2 | | ns1 | | | | | | 1.1.1.2/24 | | 1.1.1.1/24 | | tap2 | | tap1 | | | +-------------+ +--------------+","text":"创建patch port连接ovs两个桥patch port用于连接openvswitch桥 实验环境12345678910111213141516 patch-ovs-1 patch-ovs-2+-----------------+ +----------------------+| | | || br1 +----------+ br2 || | | |+-----------------+ +----------------------+ | | | | +-------------+ +--------------+ | | | ns2 | | ns1 | | | | | | 1.1.1.2/24 | | 1.1.1.1/24 | | tap2 | | tap1 | | | +-------------+ +--------------+ 创建脚本1234567891011121314151617181920212223ovs-vsctl add-br br1ovs-vsctl add-br br2ovs-vsctl add-port br1 tap1 -- set Interface tap1 type=internalovs-vsctl add-port br2 tap2 -- set Interface tap2 type=internalip netns add ns1ip netns add ns2ip link set tap1 netns ns1ip link set tap2 netns ns2ip netns exec ns1 ip addr add 1.1.1.1/24 dev tap1 ip netns exec ns1 ip link set tap1 up ip netns exec ns1 ip link set lo upip netns exec ns2 ip addr add 1.1.1.2/24 dev tap2 ip netns exec ns2 ip link set tap2 up ip netns exec ns2 ip link set lo up ovs-vsctl add-port br1 patch-ovs-1 -- set Interface patch-ovs-1 type=patch options:peer=patch-ovs-2ovs-vsctl add-port br2 patch-ovs-2 -- set Interface patch-ovs-2 type=patch options:peer=patch-ovs-1 测试1ip netns exec ns1 ping 1.1.1.2 实验配置查看1234567891011121314151617181920212223242526root@compute:~# ovs-vsctl show0f92aaa0-5385-48f5-a54a-644e05652f59 Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge "br2" Port "tap2" Interface "tap2" type: internal Port "patch-ovs-2" Interface "patch-ovs-2" type: patch options: {peer="patch-ovs-1"} Port "br2" Interface "br2" type: internal Bridge "br1" Port "br1" Interface "br1" type: internal Port "tap1" Interface "tap1" type: internal Port "patch-ovs-1" Interface "patch-ovs-1" type: patch options: {peer="patch-ovs-2"} 实验结果ns1 可以ping 通 ns2 ip 重点解析两个ovs bridge间连接分别在两个桥上执行下面命令使用patch port互联123ovs-vsctl add-port br1 patch-ovs-1 -- set Interface patch-ovs-1 type=patch options:peer=patch-ovs-2ovs-vsctl add-port br2 patch-ovs-2 -- set Interface patch-ovs-2 type=patch options:peer=patch-ovs-1","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"patch port","slug":"patch-port","permalink":"http://vcpu.me/tags/patch-port/"}]},{"title":"linux arp机制解析","slug":"linuxarp","date":"2017-10-01T10:00:00.000Z","updated":"2017-10-30T10:01:40.000Z","comments":true,"path":"linuxarp/","link":"","permalink":"http://vcpu.me/linuxarp/","excerpt":"一.问题1.使用keepalived进行热备份的系统需要一个虚拟的IP地址,然而该虚拟IP地址到底属于哪台机器是根据热备群的主备来决定的,因此主机器在获得该虚拟IP的时候,必须要广播一个免费的arp,起初人们认为这没有必要,理由是不这么做,热备群也工作的很好,然而事实证明,这是必须的;2.ARP缓存表项都有一个老化时间,然而在linux系统中却没有给出具体如何来设置这个老化时间。那么到底怎么设置这个老化时间呢?","text":"一.问题1.使用keepalived进行热备份的系统需要一个虚拟的IP地址,然而该虚拟IP地址到底属于哪台机器是根据热备群的主备来决定的,因此主机器在获得该虚拟IP的时候,必须要广播一个免费的arp,起初人们认为这没有必要,理由是不这么做,热备群也工作的很好,然而事实证明,这是必须的;2.ARP缓存表项都有一个老化时间,然而在linux系统中却没有给出具体如何来设置这个老化时间。那么到底怎么设置这个老化时间呢? 二.解答问题前的说明ARP协议的规范只是阐述了地址解析的细节,然而并没有规定协议栈的实现如何去维护ARP缓存。ARP缓存需要有一个到期时间,这是必要的,因为ARP缓存并不维护映射的状态,也不进行认证,因此协议本身不能保证这种映射永远都是正确的,它只能保证该映射在得到arp应答之后的一定时间内是有效的。这也给了ARP欺骗以可乘之机,不过本文不讨论这种欺骗。 像Cisco或者基于VRP的华为设备都有明确的配置来配置arp缓存的到期时间,然而Linux系统中却没有这样的配置,起码可以说没有这样的直接配置。Linux用户都知道如果需要配置什么系统行为,那么使用sysctl工具配置procfs下的sys接口是一个方法,然而当我们google了好久,终于发现关于ARP的配置处在/proc/sys/net/ipv4/neigh/ethX的时候,我们最终又迷茫于该目录下的N多文件,即使去查询Linux内核的Documents也不能清晰的明了这些文件的具体含义。对于Linux这样的成熟系统,一定有办法来配置ARP缓存的到期时间,但是具体到操作上,到底怎么配置呢?这还得从Linux实现的ARP状态机说起。 如果你看过《Understading Linux Networking Internals》并且真的做到深入理解的话,那么本文讲的基本就是废话,但是很多人是没有看过那本书的,因此本文的内容还是有一定价值的。 Linux协议栈实现为ARP缓存维护了一个状态机,在理解具体的行为之前,先看一下下面的图(该图基于《Understading Linux Networking Internals》里面的图26-13修改,在第二十六章):在上图中,我们看到只有arp缓存项的reachable状态对于外发包是可用的,对于stale状态的arp缓存项而言,它实际上是不可用的。如果此时有人要发包,那么需要进行重新解析,对于常规的理解,重新解析意味着要重新发送arp请求,然后事实上却不一定这样,因为Linux为arp增加了一个“事件点”来“不用发送arp请求”而对arp协议生成的缓存维护的优化措施,事实上,这种措施十分有效。这就是arp的“确认”机制,也就是说,如果说从一个邻居主动发来一个数据包到本机,那么就可以确认该包的“上一跳”这个邻居是有效的,然而为何只有到达本机的包才能确认“上一跳”这个邻居的有效性呢?因为Linux并不想为IP层的处理增加负担,也即不想改变IP层的原始语义。 Linux维护一个stale状态其实就是为了保留一个neighbour结构体,在其状态改变时只是个别字段得到修改或者填充。如果按照简单的实现,只保存一个reachable状态即可,其到期则删除arp缓存表项。Linux的做法只是做了很多的优化,但是如果你为这些优化而绞尽脑汁,那就悲剧了… 三.Linux如何来维护这个stale状态在Linux实现的ARP状态机中,最复杂的就是stale状态了,在此状态中的arp缓存表项面临着生死抉择,抉择者就是本地发出的包,如果本地发出的包使用了这个stale状态的arp缓存表项,那么就将状态机推进到delay状态,如果在“垃圾收集”定时器到期后还没有人使用该邻居,那么就有可能删除这个表项了,到底删除吗?这样看看有木有其它路径使用它,关键是看路由缓存,路由缓存虽然是一个第三层的概念,然而却保留了该路由的下一条的ARP缓存表项,这个意义上,Linux的路由缓存实则一个转发表而不是一个路由表。 如果有外发包使用了这个表项,那么该表项的ARP状态机将进入delay状态,在delay状态中,只要有“本地”确认的到来(本地接收包的上一跳来自该邻居),linux还是不会发送ARP请求的,但是如果一直都没有本地确认,那么Linux就将发送真正的ARP请求了,进入probe状态。因此可以看到,从stale状态开始,所有的状态只是为一种优化措施而存在的,stale状态的ARP缓存表项就是一个缓存的缓存,如果Linux只是将过期的reachable状态的arp缓存表项删除,语义是一样的,但是实现看起来以及理解起来会简单得多! 再次强调,reachable过期进入stale状态而不是直接删除,是为了保留neighbour结构体,优化内存以及CPU利用,实际上进入stale状态的arp缓存表项时不可用的,要想使其可用,要么在delay状态定时器到期前本地给予了确认,比如tcp收到了一个包,要么delay状态到期进入probe状态后arp请求得到了回应。否则还是会被删除。 四.Linux的ARP缓存实现要点在blog中分析源码是儿时的记忆了,现在不再浪费版面了。只要知道Linux在实现arp时维护的几个定时器的要点即可。1.Reachable状态定时器每当有arp回应到达或者其它能证明该ARP表项表示的邻居真的可达时,启动该定时器。到期时根据配置的时间将对应的ARP缓存表项转换到下一个状态。2.垃圾回收定时器定时启动该定时器,具体下一次什么到期,是根据配置的base_reachable_time来决定的,具体见下面的代码: 123456789101112131415161718192021222324static void neigh_periodic_timer(unsigned long arg){ ... if (time_after(now, tbl->last_rand + 300 * HZ)) { //内核每5分钟重新进行一次配置 struct neigh_parms *p; tbl->last_rand = now; for (p = &tbl->parms; p; p = p->next) p->reachable_time = neigh_rand_reach_time(p->base_reachable_time); } ... /* Cycle through all hash buckets every base_reachable_time/2 ticks. * ARP entry timeouts range from 1/2 base_reachable_time to 3/2 * base_reachable_time. */ expire = tbl->parms.base_reachable_time >> 1; expire /= (tbl->hash_mask + 1); if (!expire) expire = 1; //下次何时到期完全基于base_reachable_time); mod_timer(&tbl->gc_timer, now + expire); ...} 一旦这个定时器到期,将执行neigh_periodic_timer回调函数,里面有以下的逻辑,也即上面的…省略的部分:12345678if (atomic_read(&n->refcnt) == 1 && //n->used可能会因为“本地确认”机制而向前推进 (state == NUD_FAILED || time_after(now, n->used + n->parms->gc_staletime))) { *np = n->next; n->dead = 1; write_unlock(&n->lock); neigh_release(n); continue;} 如果在实验中,你的处于stale状态的表项没有被及时删除,那么试着执行一下下面的命令: 1ip route flush cache 然后再看看ip neigh ls all的结果,注意,不要指望马上会被删除,因为此时垃圾回收定时器还没有到期呢…但是我敢保证,不长的时间之后,该缓存表项将被删除。 五.第一个问题的解决在启用keepalived进行基于vrrp热备份的群组上,很多同学认为根本不需要在进入master状态时重新绑定自己的MAC地址和虚拟IP地址,然而这是根本错误的,如果说没有出现什么问题,那也是侥幸,因为各个路由器上默认配置的arp超时时间一般很短,然而我们不能依赖这种配置。请看下面的图示:如果发生了切换,假设路由器上的arp缓存超时时间为1小时,那么在将近一小时内,单向数据将无法通信(假设群组中的主机不会发送数据通过路由器,排出“本地确认”,毕竟我不知道路由器是不是在运行Linux),路由器上的数据将持续不断的法往原来的master,然而原始的matser已经不再持有虚拟IP地址。 因此,为了使得数据行为不再依赖路由器的配置,必须在vrrp协议下切换到master时手动绑定虚拟IP地址和自己的MAC地址,在Linux上使用方便的arping则是: 1arping -i ethX -S 1.1.1.1 -B -c 1 这样一来,获得1.1.1.1这个IP地址的master主机将IP地址为255.255.255.255的ARP请求广播到全网,假设路由器运行Linux,则路由器接收到该ARP请求后将根据来源IP地址更新其本地的ARP缓存表项(如果有的话),然而问题是,该表项更新的结果状态却是stale,这只是ARP的规定,具体在代码中体现是这样的,在arp_process函数的最后:123if (arp->ar_op != htons(ARPOP_REPLY) || skb->pkt_type != PACKET_HOST) state = NUD_STALE;neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0); 由此可见,只有实际的外发包的下一跳是1.1.1.1时,才会通过“本地确认”机制或者实际发送ARP请求的方式将对应的MAC地址映射reachable状态。 更正:在看了keepalived的源码之后,发现这个担心是多余的,毕竟keepalived已经很成熟了,不应该犯“如此低级的错误”,keepalived在某主机切换到master之后,会主动发送免费arp,在keepalived中有代码如是: 1234567891011121314151617181920vrrp_send_update(vrrp_rt * vrrp, ip_address * ipaddress, int idx){ char *msg; char addr_str[41]; if (!IP_IS6(ipaddress)) { msg = \"gratuitous ARPs\"; inet_ntop(AF_INET, &ipaddress->u.sin.sin_addr, addr_str, 41); send_gratuitous_arp(ipaddress); } else { msg = \"Unsolicited Neighbour Adverts\"; inet_ntop(AF_INET6, &ipaddress->u.sin6_addr, addr_str, 41); ndisc_send_unsolicited_na(ipaddress); } if (0 == idx && debug & 32) { log_message(LOG_INFO, \"VRRP_Instance(%s) Sending %s on %s for %s\", vrrp->iname, msg, IF_NAME(ipaddress->ifp), addr_str); }} 六.第二个问题的解决扯了这么多,在Linux上到底怎么设置ARP缓存的老化时间呢?我们看到/proc/sys/net/ipv4/neigh/ethX目录下面有多个文件,到底哪个是ARP缓存的老化时间呢?实际上,直接点说,就是base_reachable_time这个文件。其它的都只是优化行为的措施。比如gc_stale_time这个文件记录的是“ARP缓存表项的缓存”的存活时间,该时间只是一个缓存的缓存的存活时间,在该时间内,如果需要用到该邻居,那么直接使用表项记录的数据作为ARP请求的内容即可,或者得到“本地确认”后直接将其置为reachable状态,而不用再通过路由查找,ARP查找,ARP邻居创建,ARP邻居解析这种慢速的方式。 默认情况下,reachable状态的超时时间是30秒,超过30秒,ARP缓存表项将改为stale状态,此时,你可以认为该表项已经老化到期了,只是Linux的实现中并没有将其删除罢了,再过了gc_stale_time时间,表项才被删除。在ARP缓存表项成为非reachable之后,垃圾回收器负责执行“再过了gc_stale_time时间,表项才被删除”这件事,这个定时器的下次到期时间是根据base_reachable_time计算出来的,具体就是在neigh_periodic_timer中: 1234567891011121314if (time_after(now, tbl->last_rand + 300 * HZ)) { struct neigh_parms *p; tbl->last_rand = now; for (p = &tbl->parms; p; p = p->next) //随计化很重要,防止“共振行为”引发的ARP解析风暴 p->reachable_time = neigh_rand_reach_time(p->base_reachable_time);}...expire = tbl->parms.base_reachable_time >> 1;expire /= (tbl->hash_mask + 1);if (!expire) expire = 1;mod_timer(&tbl->gc_timer, now + expire); 可见一斑啊!适当地,我们可以通过看代码注释来理解这一点,好心人都会写上注释的。为了实验的条理清晰,我们设计以下两个场景:1.使用iptables禁止一切本地接收,从而屏蔽arp本地确认,使用sysctl将base_reachable_time设置为5秒,将gc_stale_time为5秒。2.关闭iptables的禁止策略,使用TCP下载外部网络一个超大文件或者进行持续短连接,使用sysctl将base_reachable_time设置为5秒,将gc_stale_time为5秒。在两个场景下都使用ping命令来ping本地局域网的默认网关,然后迅速Ctrl-C掉这个ping,用ip neigh show all可以看到默认网关的arp表项,然而在场景1下,大约5秒之内,arp表项将变为stale之后不再改变,再ping的话,表项先变为delay再变为probe,然后为reachable,5秒之内再次成为stale. 而在场景2下,arp表项持续为reachable以及dealy,这说明了Linux中的ARP状态机。那么为何场景1中,当表项成为stale之后很久都不会被删除呢?其实这是因为还有路由缓存项在使用它,此时你删除路由缓存之后,arp表项很快被删除。 七.总结1.在Linux上如果你想设置你的ARP缓存老化时间,那么执行sysctl -w net.ipv4.neigh.ethX=Y即可,如果设置别的,只是影响了性能,在Linux中,ARP缓存老化以其变为stale状态为准,而不是以其表项被删除为准,stale状态只是对缓存又进行了缓存;2.永远记住,在将一个IP地址更换到另一台本网段设备时,尽可能快地广播免费ARP,在Linux上可以使用arping来玩小技巧。 八.linux系统默认的ARP老化时间root@compute:~# cat /proc/sys/net/ipv4/neigh/enp0s3/gc_stale_time 60 root@compute:~# cat /proc/sys/net/ipv4/neigh/enp0s3/base_reachable_time 30 在没有有路由缓存和别人使用情况下,30 s 维持在reachable状态,然后变更称为STALE状态,再经过60 s 删除此表项。 如果处于STALE状态过期后,主机主动单播ARP确认是否延长表项时间。 九.ARP机制实验验证上述涩难懂的ARP机制讲解来自网络,具体博文由谁书写,已经难觅其源,在此引用下作为本博文基础部分,感谢作者奉献;笔者因ARP机制问题困扰了许久,因此下文从实验事实角度探究linux APR状态机的变化和网络连通性的关系,解惑罢了。 实验背景 两台ubuntu16.04 机器接入交换机二层互联 ubuntun 均安装arptables,用来控制arp流量 ubuntu使用iptables控制ip流量 通过ping和arping产生arp和ip流量 实验命令汇总 ping 10.30.10.141 arping 10.30.10.141 arptables -A INPUT -s 10.30.10.141 –opcode 1 -j DROP arptables -D INPUT -s 10.30.10.141 –opcode 1 -j DROP iptables -I INPUT 1 -s 10.30.10.141 -j DROP iptables -D INPUT -s 10.30.10.141 -j DROP cat /proc/sys/net/ipv4/neigh/enp0s3/base_reachable_time cat /proc/sys/net/ipv4/neigh/enp0s3/gc_stale_time echo 10 > /proc/sys/net/ipv4/neigh/enp0s3/base_reachable_time echo 5 > /proc/sys/net/ipv4/neigh/enp0s3/gc_stale_time watch -d -n 1 ip neigh show 实验情况1主机B无主机A的ARP表项,主机A 一直arping主机B,查看主机B中主机A arp表项的变化情况 主机A arping 主机 B 12345678910root@controller:~# arping 10.30.10.141 -I enp0s8ARPING 10.30.10.141 from 10.30.10.145 enp0s8Unicast reply from 10.30.10.141 [08:00:27:99:81:B7] 1.048msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 1.179msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 1.038msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 1.113msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 1.057msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 0.894msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 3.004msUnicast reply from 10.30.10.141 [08:00:27:99:81:B7] 0.833ms 主机B 长时间抓取ARP packet,发现主机B一直收到主机A ARP请求,主机B一直没有主动请求主机A 的ARP 12345678910root@compute:~# tcpdump -i enp0s3 arp -ne and host 10.30.10.145tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes12:12:31.733125 08:00:27:5f:1b:a9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.30.10.141 (ff:ff:ff:ff:ff:ff) tell 10.30.10.145, length 4612:12:31.733166 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Reply 10.30.10.141 is-at 08:00:27:99:81:b7, length 2812:12:32.734890 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Request who-has 10.30.10.141 (08:00:27:99:81:b7) tell 10.30.10.145, length 4612:12:32.734922 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Reply 10.30.10.141 is-at 08:00:27:99:81:b7, length 2812:12:33.735815 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Request who-has 10.30.10.141 (08:00:27:99:81:b7) tell 10.30.10.145, length 4612:12:33.735838 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Reply 10.30.10.141 is-at 08:00:27:99:81:b7, length 2812:12:34.736497 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Request who-has 10.30.10.141 (08:00:27:99:81:b7) tell 10.30.10.145, length 46 主机B 查询 arp表 arp -n 10.30.10.145 ether 08:00:27:5f:1b:a9 C enp0s3 主机B 查询watch -d -n 1 ip neigh show 10.30.10.145 dev enp0s3 lladdr 08:00:27:5f:1b:a9 STALE 结论: 主机B新增主机A的arp表项 主机B中ip neigh show显示主机A 新增表项且一直处于STALE状态 主机B在主机A一直请求ARP给主机B情况下,主机B就不会主动向主机A请求ARP了 实验情况2主机A 持续ping 主机B,观察主机B ARP表项变化 主机A ping主机B 1234root@controller:~# ping 10.30.10.141PING 10.30.10.141 (10.30.10.141) 56(84) bytes of data.64 bytes from 10.30.10.141: icmp_seq=1 ttl=64 time=0.297 ms64 bytes from 10.30.10.141: icmp_seq=2 ttl=64 time=0.402 ms 查看主机B配置时间 root@compute:~# cat /proc/sys/net/ipv4/neigh/enp0s3/gc_stale_time 5 root@compute:~# cat /proc/sys/net/ipv4/neigh/enp0s3/base_reachable_time 10 主机B抓取主动请求主机A的ARP,约每15s发一次 1234567891011121314root@compute:~# tcpdump -i enp0s3 arp -ne and host 10.30.10.145tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes15:03:37.286996 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 2815:03:37.287388 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Reply 10.30.10.145 is-at 08:00:27:5f:1b:a9, length 4615:03:53.673456 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 2815:03:53.673875 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Reply 10.30.10.145 is-at 08:00:27:5f:1b:a9, length 4615:04:10.055035 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 2815:04:10.055524 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Reply 10.30.10.145 is-at 08:00:27:5f:1b:a9, length 4615:04:26.439254 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 2815:04:26.439720 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Reply 10.30.10.145 is-at 08:00:27:5f:1b:a9, length 4615:04:42.823208 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 2815:04:42.823664 08:00:27:5f:1b:a9 > 08:00:27:99:81:b7, ethertype ARP (0x0806), length 60: Reply 10.30.10.145 is-at 08:00:27:5f:1b:a9, length 46 观察现象:主机B 10.30.10.145neigh表项,从建立reachable 10s后到STALE,然后过约5s后,重新变为reachable状态,且循环往复。STALE变为主机reachable状态时,主机B都会主动请求主机A ARP 结论推演: 主机A ping 主机B 主机B学习到主机A ARP表项达到reachable状态,约base_reachable_time时间后,变为STALE状态 约gc_stale_time后主机B主动发ARP请求主机A,主机A确认后,表项状态重新回到reachable 实验情况3主机A 持续ping 主机B,且在主机A禁止主机B的主动ARP请求(arptables -A INPUT -s 10.30.10.141 –opcode 1 -j DROP),观察主机B ARP表项变化 主机A现象1 主机A ping主机B,无法收到主机B的回复数据包,主机B 中 ARP状态从FAILD变更到INCOMPLETE 循环往复 主机A现象2 主机A ping 主机B,ping通约10个数据包后又不通,主机B中ARP状态从DELAY - PROB - FAILD 结论推演: 主机A ping 主机B 不通,因主机A限制主机B ARP请求,导致主机B无法获取到主机A的ARP,主机B无法回复icmp给主机A 主机A 能ping 通主机B,因主机A主的请求主机B 的ARP,使得主机B学习到主机A的ARP,主机B可正常回复主机A数据包,但是主机B收到主机A 主动请求ARP学习到的表项处于DELAY,因DELAY过期后主机B主动请求主机A ARP没有得到回复,因此 主机B中ARP状态从DELAY - PROB - FAILD ,表项最终失效后,主机A又无法ping通主机B了 实验情况4主机A 持续ping 主机B,且在主机A禁止主机B的主动ARP请求(arptables -A INPUT -s 10.30.10.141 –opcode 1 -j DROP),主机A 持续arping 主机B,观察主机B ARP表项变化以及连通情况 现象描述: 主机A可以一直ping通主机B,没有出现任何丢包 主机B查看主机A 对应ARP表项 arp -n ARP表象一直存在 主机Aping主机B偶尔出现大的延迟 主机B ip neigh show 查看ARP学习状态,主机B学习的主机A表项,一直处于STALE->DELAY->PROBE->FAILED 123456789root@controller:~# ping 10.30.10.141PING 10.30.10.141 (10.30.10.141) 56(84) bytes of data.64 bytes from 10.30.10.141: icmp_seq=510 ttl=64 time=0.304 ms64 bytes from 10.30.10.141: icmp_seq=511 ttl=64 time=61.3 ms64 bytes from 10.30.10.141: icmp_seq=512 ttl=64 time=0.584 ms64 bytes from 10.30.10.141: icmp_seq=513 ttl=64 time=0.330 ms64 bytes from 10.30.10.141: icmp_seq=514 ttl=64 time=0.326 ms64 bytes from 10.30.10.141: icmp_seq=515 ttl=64 time=0.309 ms64 bytes from 10.30.10.141: icmp_seq=516 ttl=64 time=0.317 ms 结论推演: 实验3和实验4差别在于,实验4的主机A会主动ARPping 主机B,因此主机B可以实时获取到主机A的MAC,因此其通讯是连通的ping服务不会出现丢包 但是主机B在ARP学习时候STALE->DELAY->PROBE->FAILED状态变化幅度较大,会出现回应主机A ping数据包不及时情况,这是ping延迟的根本原因 额外结论: STALE->DELAY->PROBE->FAILED 主机B ARP表象状态从DELAY->PROBE会连续发送3个ARP单播请求,因为主机A不会回应所以状态可能直接刷新成FAILED;为什么说可能刷新? 因为主机A没有ARPING 主机B或则主动ARP请求主机B时候,主机B被动学习到主机A ARP表现也会导致ARP表项刷新,使主机B记录主机A ARP邻居状态直接落果FAILED状态重新回到STALE状态。 附带发现: 主机B什么时候发ARP广播请求,什么时候发ARP单播请求? FAILED - > incomplete 状态时候发的是广播请求 10:51:54.106151 08:00:27:99:81:b7 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 28 DELAY -> PRPBE 状态时候发的是ARP单播请求 10:51:52.258761 08:00:27:99:81:b7 > 08:00:27:5f:1b:a9, ethertype ARP (0x0806), length 42: Request who-has 10.30.10.145 tell 10.30.10.141, length 28 这其实也很好理解,FAILED - > incomplete属于重新开始状态,因此是广播;DELAY -> PRPBE 是属于已经知道主机A MAC但是要进行老化判断,因此发生ARP单播确认。 感性的推演将来判别网络问题时使用: 如果主机B中主机A ip neigh show 邻居表处于reachable状态,或则总能及时回到reachable状态,主机B 向主机A的ARP通道时顺畅的(主机B发出的ARP请求主机A可以及时回复) 如果主机B中主机A ip neigh show邻居表从未处于reachable状态,一直于STALE、PROBE、DELAY、FAILED切换,说明该主机B到主机A的ARP通道有问题(主机A无法回应主机B的ARP请求),主机B可以回应主机A的ARP请求触发被动更新 实验情况5主机A 中ARP表项绑定主机B的ARP(arp -s 10.30.10.141 08:00:27:99:81:b7),主机B 中 ARP表项绑定主机A的ARP( arp -s 10.30.10.145 08:00:27:5f:1b:a9),主机A ping 主机B,观察连通情况,并在主机B抓取数据包观察是否有ARP交互 12ip neigh show10.30.10.145 dev enp0s3 lladdr 08:00:27:5f:1b:a9 PERMANENT 12arp -n10.30.10.145 ether 08:00:27:5f:1b:a9 CM enp0s3 现象描述: 主机A可持续ping通主机B,在主机B上没有抓取到主机A和主机B ARP交互数据包 结论推演: 经过在主机B查看主机A ip neigh show得知,其ARP状态不在变化而是固定在PERMANENT状态,所以其不再进行ARP交互就很容易理解了,因为是静态永久地址,因此不需要主动学习。 感性的推演将来判别网络问题时使用: 可以使用此种方法固定住ARP,排除主机学习ARP的问题,把网络问题聚焦在3层转发路由关系上,待3层转发网络问题解决后再行解决此问题。","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"arp机制实验验证","slug":"arp机制实验验证","permalink":"http://vcpu.me/tags/arp机制实验验证/"},{"name":"arp机制","slug":"arp机制","permalink":"http://vcpu.me/tags/arp机制/"}]},{"title":"openstack DVR openvswitch流表分析","slug":"4node-flowtable","date":"2017-09-04T10:00:00.000Z","updated":"2017-10-11T03:16:42.000Z","comments":true,"path":"4node-flowtable/","link":"","permalink":"http://vcpu.me/4node-flowtable/","excerpt":"openstack DVR环境组网如上图: 2计算节点 + 1 网络节点 +1控制节点 DVR配置计算节点无floating ip 主要分析计算节点的ovs 流表策略 ubuntun 16.04 + openstack newton版本 计算节点compute的ovs流表在此安装环境下计算节点共有3个ovs bridge(br-ex,br-int,br-tun) br-ex:负责计算节点连接外网使用,需在dvr环境下配置floating ip vm数据流量才走此桥出外网 br-int:负责连接虚拟机所接入的linux bridge、router namespace br-tun: 隧道所有的虚拟交换机,通过此隧道连接网络节点(dhcp、router、snat)、计算节点(虚拟机、router);完成大二层的需求 在对流表分析前,我们先了解网络拓扑,并获取接口和接口索引对应关系;","text":"openstack DVR环境组网如上图: 2计算节点 + 1 网络节点 +1控制节点 DVR配置计算节点无floating ip 主要分析计算节点的ovs 流表策略 ubuntun 16.04 + openstack newton版本 计算节点compute的ovs流表在此安装环境下计算节点共有3个ovs bridge(br-ex,br-int,br-tun) br-ex:负责计算节点连接外网使用,需在dvr环境下配置floating ip vm数据流量才走此桥出外网 br-int:负责连接虚拟机所接入的linux bridge、router namespace br-tun: 隧道所有的虚拟交换机,通过此隧道连接网络节点(dhcp、router、snat)、计算节点(虚拟机、router);完成大二层的需求 在对流表分析前,我们先了解网络拓扑,并获取接口和接口索引对应关系; 查看compute br-int接口和接口索引映射关系 12345678910111213141516171819202122232425262728293031323334353637root@compute:~# ovs-ofctl show br-intOFPT_FEATURES_REPLY (xid=0x2): dpid:00005278ead35849n_tables:254, n_buffers:256capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(patch-tun): addr:8e:c4:54:26:0f:46 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(qvo06ec6270-33): addr:0a:7f:a0:b5:1f:b8 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 3(qr-6474fe2d-84): addr:00:00:00:00:0e:00 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 4(qvo2118dc24-92): addr:6e:1b:a5:df:d7:77 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 5(int-br-ex): addr:aa:83:f4:ae:aa:41 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 6(qr-d2e6fcc5-c7): addr:00:00:00:00:0e:00 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max LOCAL(br-int): addr:52:78:ea:d3:58:49 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0root@compute:~# 查看compute br-tun接口和接口索引映射关系 1234567891011121314151617181920212223root@compute:~# ovs-ofctl show br-tunOFPT_FEATURES_REPLY (xid=0x2): dpid:00005222d0a63e4fn_tables:254, n_buffers:256capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(patch-int): addr:ae:2b:47:4b:15:6a config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(vxlan-0a1e0aa0): addr:06:2d:8c:b2:d2:64 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vxlan-0a1e0a9a): addr:4e:7d:65:8d:09:09 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-tun): addr:52:22:d0:a6:3e:4f config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0root@compute:~# 接口对应关系已经标注在openvswitch DVR环境组网图上,见上图 分析 br-int流表分析结果如图: 步骤解析如下:icmp6 协议交给table24处理1234root@compute:~# ovs-ofctl dump-flows br-intNXST_FLOW reply (xid=0x4):cookie=0xbc602bf73a053ff8, duration=2944.223s, table=0, n_packets=0, n_bytes=0, idle_age=2944, priority=10,icmp6,in_port=2,icmp_type=136 actions=resubmit(,24)cookie=0xbc602bf73a053ff8, duration=2938.510s, table=0, n_packets=0, n_bytes=0, idle_age=2938, priority=10,icmp6,in_port=4,icmp_type=136 actions=resubmit(,24) 2和4接口arp交给24表格处理12cookie=0xbc602bf73a053ff8, duration=2944.215s, table=0, n_packets=15, n_bytes=630, idle_age=547, priority=10,arp,in_port=2 actions=resubmit(,24)cookie=0xbc602bf73a053ff8, duration=2938.505s, table=0, n_packets=10, n_bytes=420, idle_age=6, priority=10,arp,in_port=4 actions=resubmit(,24) 5接口流入的数据包全部丢弃1cookie=0xbc602bf73a053ff8, duration=2968.234s, table=0, n_packets=4699, n_bytes=1348102, idle_age=1, priority=2,in_port=5 actions=drop 2和4口其它协议数据包统统交给25表格处理12cookie=0xbc602bf73a053ff8, duration=2944.232s, table=0, n_packets=927, n_bytes=106878, idle_age=1, priority=9,in_port=2 actions=resubmit(,25)cookie=0xbc602bf73a053ff8, duration=2938.520s, table=0, n_packets=1022, n_bytes=126940, idle_age=1, priority=9,in_port=4 actions=resubmit(,25) 其它数据包像正常交换机数据一样转发数据1cookie=0xbc602bf73a053ff8, duration=2968.576s, table=0, n_packets=2332, n_bytes=240272, idle_age=1, priority=0 actions=NORMAL 23表格默认全部丢弃数据包1cookie=0xbc602bf73a053ff8, duration=2968.580s, table=23, n_packets=0, n_bytes=0, idle_age=2968, priority=0 actions=drop 表24 处理ipv6 nd12cookie=0xbc602bf73a053ff8, duration=2944.226s, table=24, n_packets=0, n_bytes=0, idle_age=2944, priority=2,icmp6,in_port=2,icmp_type=136,nd_target=fe80::f816:3eff:fef3:31f5 actions=NORMALcookie=0xbc602bf73a053ff8, duration=2938.513s, table=24, n_packets=0, n_bytes=0, idle_age=2938, priority=2,icmp6,in_port=4,icmp_type=136,nd_target=fe80::f816:3eff:fe3c:d256 actions=NORMAL 表24 接口2或4 收到 172.16.1.3 1.1.1.11 的arp信息交给表25处理12cookie=0xbc602bf73a053ff8, duration=2944.219s, table=24, n_packets=15, n_bytes=630, idle_age=547, priority=2,arp,in_port=2,arp_spa=172.16.1.3 actions=resubmit(,25)cookie=0xbc602bf73a053ff8, duration=2938.507s, table=24, n_packets=10, n_bytes=420, idle_age=6, priority=2,arp,in_port=4,arp_spa=1.1.1.11 actions=resubmit(,25) 表24 如果上述未处理完成,将剩余数据包丢去1cookie=0xbc602bf73a053ff8, duration=2968.573s, table=24, n_packets=0, n_bytes=0, idle_age=2968, priority=0 actions=drop 表25 接口2或则4 源mac是fa:16:3e:f3:31:f5 fa:16:3e:3c:d2:56 的来的数据包正常处理12cookie=0xbc602bf73a053ff8, duration=2944.242s, table=25, n_packets=926, n_bytes=106092, idle_age=1, priority=2,in_port=2,dl_src=fa:16:3e:f3:31:f5 actions=NORMALcookie=0xbc602bf73a053ff8, duration=2938.524s, table=25, n_packets=1016, n_bytes=125944, idle_age=1, priority=2,in_port=4,dl_src=fa:16:3e:3c:d2:56 actions=NORMAL 分析 br-tun流表分析结果如图: 步骤解析如下:内网接口 交给table2处理123root@compute:~# ovs-ofctl dump-flows br-tunNXST_FLOW reply (xid=0x4): cookie=0xb9ef2c735af9d49c, duration=2988.059s, table=0, n_packets=299, n_bytes=30921, idle_age=208, priority=1,in_port=1 actions=resubmit(,2) 外网接口交给table4处理12cookie=0xb9ef2c735af9d49c, duration=2963.688s, table=0, n_packets=154, n_bytes=12924, idle_age=208, priority=1,in_port=2 actions=resubmit(,4)cookie=0xb9ef2c735af9d49c, duration=2963.630s, table=0, n_packets=200, n_bytes=19684, idle_age=210, priority=1,in_port=3 actions=resubmit(,4) 其它:1cookie=0xb9ef2c735af9d49c, duration=2988.034s, table=0, n_packets=0, n_bytes=0, idle_age=2988, priority=0 actions=drop 单播交给表20处理1cookie=0xb9ef2c735af9d49c, duration=2988.027s, table=2, n_packets=209, n_bytes=20555, idle_age=208, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20) 多播交给表22处理1cookie=0xb9ef2c735af9d49c, duration=2988.009s, table=2, n_packets=90, n_bytes=10366, idle_age=210, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,22) 其它:1cookie=0xb9ef2c735af9d49c, duration=2988.006s, table=3, n_packets=0, n_bytes=0, idle_age=2988, priority=0 actions=drop 从外网过来的数据包,去除tunid添加对应的vlanid后将数据包交给table10处理12cookie=0xb9ef2c735af9d49c, duration=2975.899s, table=4, n_packets=206, n_bytes=19127, idle_age=208, priority=1,tun_id=0x44 actions=mod_vlan_vid:1,resubmit(,10)cookie=0xb9ef2c735af9d49c, duration=2963.083s, table=4, n_packets=148, n_bytes=13481, idle_age=208, priority=1,tun_id=0x5b actions=mod_vlan_vid:2,resubmit(,10) 其它:12cookie=0xb9ef2c735af9d49c, duration=2988.005s, table=4, n_packets=0, n_bytes=0, idle_age=2988, priority=0 actions=dropcookie=0xb9ef2c735af9d49c, duration=2987.998s, table=6, n_packets=0, n_bytes=0, idle_age=2987, priority=0 actions=drop 学习mac等相关信息,将信息放到表20,将数据包从pach-int发出交给内网处理1cookie=0xb9ef2c735af9d49c, duration=2987.994s, table=10, n_packets=354, n_bytes=32608, idle_age=208, priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xb9ef2c735af9d49c,NXM_OF_VLAN_TCI[0..11],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->NXM_OF_VLAN_TCI[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:1 发往目的mac为fa:16:3e:3e:69:70 tag为1 (ip为172.16.1.9虚拟机数据包)被去除vlan 加上tunid,交给compute1 连接的隧道接口1cookie=0xb9ef2c735af9d49c, duration=2949.737s, table=20, n_packets=66, n_bytes=6977, idle_age=2871, priority=2,dl_vlan=1,dl_dst=fa:16:3e:3e:69:70 actions=strip_vlan,load:0x44->NXM_NX_TUN_ID[],output:3 发给172.16.1.2 dhcp 服务namespace数据包去除vlan后通过隧道交给网络节点1cookie=0xb9ef2c735af9d49c, duration=2937.625s, table=20, n_packets=1, n_bytes=42, idle_age=2886, priority=2,dl_vlan=1,dl_dst=fa:16:3e:cb:c0:9e actions=strip_vlan,load:0x44->NXM_NX_TUN_ID[],output:2 发给1.1.1.2 dhcp 服务namespace数据包去除vlan后通过隧道交给网络节点1cookie=0xb9ef2c735af9d49c, duration=2937.136s, table=20, n_packets=1, n_bytes=42, idle_age=2879, priority=2,dl_vlan=2,dl_dst=fa:16:3e:b0:bd:78 actions=strip_vlan,load:0x5b->NXM_NX_TUN_ID[],output:2 发给1.1.1.12的snat中网关数据包去除vlan后通过隧道交给网络节点1cookie=0xb9ef2c735af9d49c, duration=2926.229s, table=20, n_packets=0, n_bytes=0, idle_age=2958, priority=2,dl_vlan=2,dl_dst=fa:16:3e:6c:ab:eb actions=strip_vlan,load:0x5b->NXM_NX_TUN_ID[],output:2 发往172.16.1.4的snat中网关数据包去除vlan后通过隧道交给网络节点1cookie=0xb9ef2c735af9d49c, duration=2898.976s, table=20, n_packets=1, n_bytes=42, idle_age=2889, priority=2,dl_vlan=1,dl_dst=fa:16:3e:6b:31:95 actions=strip_vlan,load:0x44->NXM_NX_TUN_ID[],output:2 学习到的表项 12cookie=0xb9ef2c735af9d49c, duration=210.599s, table=20, n_packets=2, n_bytes=196, hard_timeout=300, idle_age=209, hard_age=208, priority=1,vlan_tci=0x0002/0x0fff,dl_dst=fa:16:3e:1f:95:2c actions=load:0->NXM_OF_VLAN_TCI[],load:0x5b->NXM_NX_TUN_ID[],output:2cookie=0xb9ef2c735af9d49c, duration=210.597s, table=20, n_packets=2, n_bytes=264, hard_timeout=300, idle_age=208, hard_age=208, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:2c:c3:0d actions=load:0->NXM_OF_VLAN_TCI[],load:0x44->NXM_NX_TUN_ID[],output:2 未命中表20其它策略,统统交给流表22处理 1cookie=0xb9ef2c735af9d49c, duration=2987.986s, table=20, n_packets=5, n_bytes=418, idle_age=210, priority=0 actions=resubmit(,22) 表22去除vlan 附加上tunid将数据包从隧道发出12cookie=0xb9ef2c735af9d49c, duration=2926.237s, table=22, n_packets=18, n_bytes=2034, idle_age=210, priority=1,dl_vlan=2 actions=strip_vlan,load:0x5b->NXM_NX_TUN_ID[],output:2,output:3cookie=0xb9ef2c735af9d49c, duration=2898.979s, table=22, n_packets=23, n_bytes=2374, idle_age=568, priority=1,dl_vlan=1 actions=strip_vlan,load:0x44->NXM_NX_TUN_ID[],output:2,output:3 表22其它:1cookie=0xb9ef2c735af9d49c, duration=2987.979s, table=22, n_packets=54, n_bytes=6376, idle_age=2959, priority=0 actions=drop 分析br-ex流表1234root@compute:~# ovs-ofctl dump-flows br-exNXST_FLOW reply (xid=0x4): cookie=0x92b626c066ab8e9d, duration=13306.841s, table=0, n_packets=224, n_bytes=20652, idle_age=1596, priority=2,in_port=2 actions=drop cookie=0x92b626c066ab8e9d, duration=13306.862s, table=0, n_packets=277292, n_bytes=89976779, idle_age=0, priority=0 actions=NORMAL 如上流表:默认2三层转发策略,且丢弃从外网收到的数据包","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"DVR","slug":"DVR","permalink":"http://vcpu.me/tags/DVR/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"控制节点、网络节点、2个计算节点","slug":"控制节点、网络节点、2个计算节点","permalink":"http://vcpu.me/tags/控制节点、网络节点、2个计算节点/"}]},{"title":"openstack多种组网环境下组网信息记录和插件组成","slug":"openstackenv","date":"2017-08-30T10:00:00.000Z","updated":"2017-10-11T03:20:48.000Z","comments":true,"path":"openstackenv/","link":"","permalink":"http://vcpu.me/openstackenv/","excerpt":"本文章简要介绍本文章旨在以图形方式记录已经搭建的实验环境中网络组成。都是实践结果并没有过多的理论解释,通过图形勾画出newton组网。实验环境包含如下: 两个服务器节点搭建openstack 环境一个服务器安装controller和network节点;一个服务器安装计算节点 三个服务器节点搭建的openstack环境一个服务器安装controller、一个服务器安装network、一个服务器安装计算节点 四个服务器节点搭建的openstack环境一个服务器安装controller、一个服务器安装network、另外两台服务器分别安装计算节点","text":"本文章简要介绍本文章旨在以图形方式记录已经搭建的实验环境中网络组成。都是实践结果并没有过多的理论解释,通过图形勾画出newton组网。实验环境包含如下: 两个服务器节点搭建openstack 环境一个服务器安装controller和network节点;一个服务器安装计算节点 三个服务器节点搭建的openstack环境一个服务器安装controller、一个服务器安装network、一个服务器安装计算节点 四个服务器节点搭建的openstack环境一个服务器安装controller、一个服务器安装network、另外两台服务器分别安装计算节点 两个服务器节点搭建openstack 环境openstack官方安装环境,未采用openvswitch的组网 替换linuxbridge为openvswitch后的组网 三个服务器节点搭建openstack环境未采用ovs,按照官方环境分离安装的组网和组件关系合图 采用ovs,组网信息图和组件合图 采用ovs,DVR情况下组件关系图 四个服务器节点搭建的openstack环境采用ovs,两个计算节点组网图 采用ovs,DVR环境东西流量走向图 采用ovs,DVR环境无folating ip上外网流量走向图 采用ovs,DVR环境含有floating ip上外网流量走向图 采用ovs,DVR环境不同子网东西流量相互ping流量走向图","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"newton","slug":"newton","permalink":"http://vcpu.me/tags/newton/"}]},{"title":"图示openstack -组件间基本关系和虚拟机创建","slug":"vmcreate","date":"2017-08-24T10:00:00.000Z","updated":"2017-08-25T07:57:08.000Z","comments":true,"path":"vmcreate/","link":"","permalink":"http://vcpu.me/vmcreate/","excerpt":"openstack核心组件逻辑关系 horizon控制台提供WEB操作界面,用来创建虚拟机、增加网络、创建fw和负载均衡策略等 nova 响应虚拟机创建请求、调度、销毁云主机 glance 提供虚拟机创建镜像服务 cinyder 虚拟机块存储,供操作系统使用 swift 对象存储,数据备份和镜像存储 neutron 创建管理虚拟网络,供虚拟机使用 keytone 认证授权,属于共享组件 ceilometer 监控计费服务","text":"openstack核心组件逻辑关系 horizon控制台提供WEB操作界面,用来创建虚拟机、增加网络、创建fw和负载均衡策略等 nova 响应虚拟机创建请求、调度、销毁云主机 glance 提供虚拟机创建镜像服务 cinyder 虚拟机块存储,供操作系统使用 swift 对象存储,数据备份和镜像存储 neutron 创建管理虚拟网络,供虚拟机使用 keytone 认证授权,属于共享组件 ceilometer 监控计费服务 keystone User(用户): 指使用Openstack service的用户,可以是人、服务、系统,但凡使用了Openstack service的对象都可以称为User。 Project(Tenant): 可以理解为一个人、或服务所拥有的 资源集合 。在一个Project(Tenant)中可以包含多个User,每一个User都会根据权限的划分来使用Project(Tenant)中的资源。比如通过Nova创建虚拟机时要指定到某个Project中,在Cinder创建卷也要指定到某个Project中。User访问Project的资源前,必须要与该Project关联,并且指定User在Project下的Role。 Role(角色): 用于划分权限。可以通过给User指定Role,使User获得Role对应的操作权限。Keystone返回给User的Token包含了Role列表,被访问的Services会判断访问它的User和User提供的Token中所包含的Role。系统默认使用管理Role admin和成员Role member 。 Policy(规则):OpenStack对User的验证除了OpenStack的身份验证以外,还需要鉴别User对某个Service是否有访问权限。Policy机制就是用来控制User对Tenant中资源(包括Services)的操作权限。对于Keystone service来说,Policy就是一个JSON文件,默认是/etc/keystone/policy.json。通过配置这个文件,Keystone Service实现了对User基于Role的权限管理。 Token: 是一个字符串表示,作为访问资源的令牌。Token包含了在 指定范围和有效时间内 可以被访问的资源。EG. 在Nova中一个tenant可以是一些虚拟机,在Swift和Glance中一个tenant可以是一些镜像存储,在Network中一个tenant可以是一些网络资源。Token一般被User持有。 Endpoint: 一个可以通过网络来访问和定位某个Openstack service的地址,通常是一个URL。比如,当Nova需要访问Glance服务去获取image 时,Nova通过访问Keystone拿到Glance的endpoint,然后通过访问该endpoint去获取Glance服务。我们可以通过Endpoint的region属性去定义多个region。 Endpoint 该使用对象分为三类:admin url –> 给admin用户使用,Post:35357 —->管理员internal url –> OpenStack内部服务使用来跟别的服务通信,Port:5000 —->内部员工public url –> 其它用户可以访问的地址,Post:5000 —->外部访问 角色,决定了你的权限.只和role有关.不论你从那个入口进入.你的权限,并不会改变.假如,就算管理员走internal url ,管理员的权限并不会改变有没有权限和endpoint没有关系 组件之间通信基于rest api创建完service后创建API EndPoint. 在openstack中,每一个service都有三种end points. Admin, public, internal。 Admin是用作管理用途的,如它能够修改user/tenant(project)。 public 是让客户调用的,比如可以部署在外网上让客户可以管理自己的云。internal是openstack内部调用的。三种endpoints 在网络上开放的权限一般也不同。Admin通常只能对内网开放,public通常可以对外网开放internal通常只能对安装有openstack对服务的机器开放。 V3新增Tenant 重命名为 Project添加了 Domain 的概念—->一堆project的集合添加了 Group 的概念(用户授权针对组来实现) keystone作用:论证用户权限分发rest api neutron neutron-server Neutron REST API 掌门人 neutron-plugin 不同网络功能实现的入口,各个厂商可以开发自己的plugin neutron-agent 插件具体动作执行 service-plugin: 实现L3-L7的网络服务(l3 router、firewall、loadbalancer、VPN、metering等) core-plugin: 负责管理L2的网络连接(Modular Layer 2) 12345+-------------+ +--------------+ +-----------------+| | | | | ||neutron server+-----> |neutron plugin+------> |neutron agent || | | | | |+-------------+ +--------------+ +-----------------+ nova 创建虚拟机流程 1.界面或命令行通过RESTful API向keystone获取认证信息。2.keystone通过用户请求认证信息,并生成auth-token返回给对应的认证请求。3.界面或命令行通过RESTful API向nova-api发送一个boot instance的请求(携带auth-token)。4.nova-api接受请求后向keystone发送认证请求,查看token是否为有效用户和token。5.keystone验证token是否有效,如有效则返回有效的认证和对应的角色(注:有些操作需要有角色权限才能操作)。6.通过认证后nova-api和数据库通讯。7.初始化新建虚拟机的数据库记录。8.nova-api通过rpc.call向nova-scheduler请求是否有创建虚拟机的资源(Host ID)。9.nova-scheduler进程侦听消息队列,获取nova-api的请求。10.nova-scheduler通过查询nova数据库中计算资源的情况,并通过调度算法计算符合虚拟机创建需要的主机。11.对于有符合虚拟机创建的主机,nova-scheduler更新数据库中虚拟机对应的物理主机信息。12.nova-scheduler通过rpc.cast向nova-compute发送对应的创建虚拟机请求的消息。13.nova-compute会从对应的消息队列中获取创建虚拟机请求的消息。14.nova-compute通过rpc.call向nova-conductor请求获取虚拟机消息。(Flavor)15.nova-conductor从消息队队列中拿到nova-compute请求消息。16.nova-conductor根据消息查询虚拟机对应的信息。17.nova-conductor从数据库中获得虚拟机对应信息。18nova-conductor把虚拟机信息通过消息的方式发送到消息队列中。19.nova-compute从对应的消息队列中获取虚拟机信息消息。20.nova-compute通过keystone的RESTfull API拿到认证的token,并通过HTTP请求glance-api获取创建虚拟机所需要镜像。21.glance-api向keystone认证token是否有效,并返回验证结果。22.token验证通过,nova-compute获得虚拟机镜像信息(URL)。23.nova-compute通过keystone的RESTfull API拿到认证k的token,并通过HTTP请求neutron-server获取创建虚拟机所需要的网络信息。24.neutron-server向keystone认证token是否有效,并返回验证结果。25.token验证通过,nova-compute获得虚拟机网络信息。26.nova-compute通过keystone的RESTfull API拿到认证的token,并通过HTTP请求cinder-api获取创建虚拟机所需要的持久化存储信息。17.cinder-api向keystone认证token是否有效,并返回验证结果。28.token验证通过,nova-compute获得虚拟机持久化存储信息。29.nova-compute根据instance的信息调用配置的虚拟化驱动来创建虚拟机。","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"}]},{"title":"metadata","slug":"metadata","date":"2017-08-24T10:00:00.000Z","updated":"2017-08-25T07:57:18.000Z","comments":true,"path":"metadata/","link":"","permalink":"http://vcpu.me/metadata/","excerpt":"metadata基本概念在创建虚拟机的时候,用户往往需要对虚拟机进行一些配置,比如:开启一些服务、安装某些包、添加 SSH 秘钥、配置 hostname 等等。 在 OpenStack 中,这些配置信息被分成两类:metadata 和 user data。 Metadata 主要包括虚拟机自身的一些常用属性,如 hostname、网络配置信息、SSH 登陆秘钥等,主要的形式为键值对。而 user data 主要包括一些命令、脚本等。 User data 通过文件传递,并支持多种文件格式,包括 gzip 压缩文件、shell 脚本、cloud-init 配置文件等。虽然 metadata 和 user data 并不相同,但是 OpenStack 向虚拟机提供这两种信息的机制是一致的,只是虚拟机在获取到信息后,对两者的处理方式不同罢了。所以下文统一用 matadata 来描述。 在 OpenStack 中,虚拟机获取 Metadata 信息的方式有两种:Config drive 和 metadata RESTful 服务。此文只关注解析metadata RESTful 方式","text":"metadata基本概念在创建虚拟机的时候,用户往往需要对虚拟机进行一些配置,比如:开启一些服务、安装某些包、添加 SSH 秘钥、配置 hostname 等等。 在 OpenStack 中,这些配置信息被分成两类:metadata 和 user data。 Metadata 主要包括虚拟机自身的一些常用属性,如 hostname、网络配置信息、SSH 登陆秘钥等,主要的形式为键值对。而 user data 主要包括一些命令、脚本等。 User data 通过文件传递,并支持多种文件格式,包括 gzip 压缩文件、shell 脚本、cloud-init 配置文件等。虽然 metadata 和 user data 并不相同,但是 OpenStack 向虚拟机提供这两种信息的机制是一致的,只是虚拟机在获取到信息后,对两者的处理方式不同罢了。所以下文统一用 matadata 来描述。 在 OpenStack 中,虚拟机获取 Metadata 信息的方式有两种:Config drive 和 metadata RESTful 服务。此文只关注解析metadata RESTful 方式 Metadata RESTful 工作原理 meta作用可以使用metadata技术实现用户个性的配置 总结Metadata 服务为用户自定义配置虚拟机提供了有效的解决方案。本文剖析了 OpenStack 提供 metadata 服务的RESTful机制。那么Config drive 和RESTful机制差别在哪? Config drive 机制主要用于配置虚拟机的网络信息,包括 IP、子网掩码、网关等。当虚拟机无法通过 DHCP 正确获取网络信息时,config drive 是获取 metadata 信息的必要方式。 如果虚拟机能够自动正确配置网络,那么可以通过 RESTful 服务的方式获取 metadata 信息。","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"metadata","slug":"metadata","permalink":"http://vcpu.me/tags/metadata/"},{"name":"nova-api-metadata","slug":"nova-api-metadata","permalink":"http://vcpu.me/tags/nova-api-metadata/"},{"name":"Neutron-ns-metadata-proxy","slug":"Neutron-ns-metadata-proxy","permalink":"http://vcpu.me/tags/Neutron-ns-metadata-proxy/"},{"name":"Neutron-metadata-agent","slug":"Neutron-metadata-agent","permalink":"http://vcpu.me/tags/Neutron-metadata-agent/"}]},{"title":"fwaas","slug":"fwaas","date":"2017-08-14T10:00:00.000Z","updated":"2017-08-25T07:57:47.000Z","comments":true,"path":"fwaas/","link":"","permalink":"http://vcpu.me/fwaas/","excerpt":"openstack配置安装环境 ubuntun版本root@controller:~# lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 16.04.2 LTSRelease: 16.04Codename: xenial openstack newton版本root@controller:~# openstack –versionopenstack 3.2.0 安装环境osx 下的vbox两台虚拟机(controller和compute)openstack网络节点和控制节点安装到controller机器上,计算节点安装到compute节点上 组网信息","text":"openstack配置安装环境 ubuntun版本root@controller:~# lsb_release -aNo LSB modules are available.Distributor ID: UbuntuDescription: Ubuntu 16.04.2 LTSRelease: 16.04Codename: xenial openstack newton版本root@controller:~# openstack –versionopenstack 3.2.0 安装环境osx 下的vbox两台虚拟机(controller和compute)openstack网络节点和控制节点安装到controller机器上,计算节点安装到compute节点上 组网信息 配置模块 配置详情/etc/neutron/neutron.conf[DEFAULT]1service_plugins = router,firewall [service_providers]1service_provider = FIREWALL:Iptables:neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver:default 注意:如果没有[service_providers]标签就在配置最后新增[fwaas]123agent_version = v1driver = iptablesenabled = True 注意:如果没有[service_providers]标签就在配置最后新增 /etc/neutron/fwaas_driver.ini[fwaas]12driver = neutron_fwaas.services.firewall.drivers.linux.iptables_fwaas.IptablesFwaasDriverenabled = True /etc/neutron/l3_agent.ini[AGENT]1extensions = fwaas 将配置加载到数据库中1neutron-db-manage --subproject neutron-fwaas upgrade head 配置完成重启neutron-server和neutron-l3-agent服务让配置生效service neutron-server restartservice neutron-l3-agent restart 添加FW策略默认不含邮rule策略12neutron firewall-policy-create myfwallpolicyneutron firewall-create myfwallpolicy --name myfw 确定fw开启成功1234567891011121314root@controller:~# neutron firewall-show myfw+--------------------+--------------------------------------+| Field | Value |+--------------------+--------------------------------------+| admin_state_up | True || description | || firewall_policy_id | 4a1e2adb-ac2c-44d2-8622-65bcc9982c05 || id | 47e900ef-6baf-4958-902a-77ba20f9791f || name | myfw || project_id | 2aba7c0d74a54b9a9acf1e2810ef996e || router_ids | 444e3a9e-1011-46fd-af5c-bcad004e236b || status | ACTIVE || tenant_id | 2aba7c0d74a54b9a9acf1e2810ef996e |+--------------------+--------------------------------------+ 说明:实验中newton版本的openstackPENDING_CREATE状态包含fw没有关联router,也是配置错误或则因为没有重启neutron-l3-agent和neutron-server 而导致的配置没有生效哎,在这里废了很大劲。 查看fwaas iptables 策略1234567891011121314151617181920212223242526272829303132333435363738root@controller:~# ip netnsqdhcp-0dbb735a-ab82-4658-a544-1e91cc7f68db (id: 3)qdhcp-e47041c9-008c-4c1b-8fe5-99120d8765b8 (id: 1)qdhcp-0ad9b398-b0ac-476a-b1bc-b1f58bf8eff4 (id: 2)qrouter-444e3a9e-1011-46fd-af5c-bcad004e236b (id: 0)root@controller:~# ip netns exec qrouter-444e3a9e-1011-46fd-af5c-bcad004e236b iptables -S-P INPUT ACCEPT-P FORWARD ACCEPT-P OUTPUT ACCEPT-N neutron-filter-top-N neutron-l3-agent-FORWARD-N neutron-l3-agent-INPUT-N neutron-l3-agent-OUTPUT-N neutron-l3-agent-fwaas-defau-N neutron-l3-agent-iv4ec95c473-N neutron-l3-agent-local-N neutron-l3-agent-ov4ec95c473-N neutron-l3-agent-scope-A INPUT -j neutron-l3-agent-INPUT-A FORWARD -j neutron-filter-top-A FORWARD -j neutron-l3-agent-FORWARD-A OUTPUT -j neutron-filter-top-A OUTPUT -j neutron-l3-agent-OUTPUT-A neutron-filter-top -j neutron-l3-agent-local-A neutron-l3-agent-FORWARD -j neutron-l3-agent-scope-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-iv4ec95c473-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-ov4ec95c473-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-fwaas-defau-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-fwaas-defau-A neutron-l3-agent-INPUT -m mark --mark 0x1/0xffff -j ACCEPT-A neutron-l3-agent-INPUT -p tcp -m tcp --dport 9697 -j DROP-A neutron-l3-agent-fwaas-defau -j DROP-A neutron-l3-agent-iv4ec95c473 -m state --state INVALID -j DROP-A neutron-l3-agent-iv4ec95c473 -m state --state RELATED,ESTABLISHED -j ACCEPT-A neutron-l3-agent-ov4ec95c473 -m state --state INVALID -j DROP-A neutron-l3-agent-ov4ec95c473 -m state --state RELATED,ESTABLISHED -j ACCEPT-A neutron-l3-agent-scope -o qr-4346328d-f4 -m mark ! --mark 0x4000000/0xffff0000 -j DROP-A neutron-l3-agent-scope -o qr-d879832c-1e -m mark ! --mark 0x4000000/0xffff0000 -j DROP 放通两个子网跨网时候icmp123456789101112131415161718192021222324root@controller:~# neutron firewall-rule-create --protocol icmp --action allow --name myruleCreated a new firewall_rule:+------------------------+--------------------------------------+| Field | Value |+------------------------+--------------------------------------+| action | allow || description | || destination_ip_address | || destination_port | || enabled | True || firewall_policy_id | || id | 77e7d0b6-d9b3-453e-a0c1-833faae5cd48 || ip_version | 4 || name | myrule || position | || project_id | 2aba7c0d74a54b9a9acf1e2810ef996e || protocol | icmp || shared | False || source_ip_address | || source_port | || tenant_id | 2aba7c0d74a54b9a9acf1e2810ef996e |+------------------------+--------------------------------------+root@controller:~# neutron firewall-policy-insert-rule myfwallpolicy myruleInserted firewall rule in firewall policy myfwallpolicy 查看放通icmp后策略1234567891011121314151617181920212223242526272829303132333435root@controller:~# ip netns exec qrouter-444e3a9e-1011-46fd-af5c-bcad004e236b iptables -S-P INPUT ACCEPT-P FORWARD ACCEPT-P OUTPUT ACCEPT-N neutron-filter-top-N neutron-l3-agent-FORWARD-N neutron-l3-agent-INPUT-N neutron-l3-agent-OUTPUT-N neutron-l3-agent-fwaas-defau-N neutron-l3-agent-iv4ec95c473-N neutron-l3-agent-local-N neutron-l3-agent-ov4ec95c473-N neutron-l3-agent-scope-A INPUT -j neutron-l3-agent-INPUT-A FORWARD -j neutron-filter-top-A FORWARD -j neutron-l3-agent-FORWARD-A OUTPUT -j neutron-filter-top-A OUTPUT -j neutron-l3-agent-OUTPUT-A neutron-filter-top -j neutron-l3-agent-local-A neutron-l3-agent-FORWARD -j neutron-l3-agent-scope-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-iv4ec95c473-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-ov4ec95c473-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-fwaas-defau-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-fwaas-defau-A neutron-l3-agent-INPUT -m mark --mark 0x1/0xffff -j ACCEPT-A neutron-l3-agent-INPUT -p tcp -m tcp --dport 9697 -j DROP-A neutron-l3-agent-fwaas-defau -j DROP-A neutron-l3-agent-iv4ec95c473 -m state --state INVALID -j DROP-A neutron-l3-agent-iv4ec95c473 -m state --state RELATED,ESTABLISHED -j ACCEPT-A neutron-l3-agent-iv4ec95c473 -p icmp -j ACCEPT-A neutron-l3-agent-ov4ec95c473 -m state --state INVALID -j DROP-A neutron-l3-agent-ov4ec95c473 -m state --state RELATED,ESTABLISHED -j ACCEPT-A neutron-l3-agent-ov4ec95c473 -p icmp -j ACCEPT-A neutron-l3-agent-scope -o qr-4346328d-f4 -m mark ! --mark 0x4000000/0xffff0000 -j DROP-A neutron-l3-agent-scope -o qr-d879832c-1e -m mark ! --mark 0x4000000/0xffff0000 -j DROP 12345-A FORWARD -j neutron-l3-agent-FORWARD-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-iv4ec95c473-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-ov4ec95c473-A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-fwaas-defau-A neutron-l3-agent-FORWARD -i qr-+ -j neutron-l3-agent-fwaas-defau router namespace转发表处理交给neutron-l3-agent-FORWARD表处理具体策略为: -A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-iv4ec95c473 -A neutron-l3-agent-FORWARD -o qr-+ -j neutron-l3-agent-fwaas-defauFORWARD最终交给neutron-l3-agent-iv4ec95c473和neutron-l3-agent-fwaas-defau处理-A neutron-l3-agent-iv4ec95c473 -m state –state INVALID -j DROP-A neutron-l3-agent-iv4ec95c473 -m state –state RELATED,ESTABLISHED -j ACCEPT-A neutron-l3-agent-fwaas-defau -j DROP 默认情况下所由经过router qr接口数据包进行转发的全部丢弃 放通规则后新增规则分析12-A neutron-l3-agent-iv4ec95c473 -p icmp -j ACCEPT-A neutron-l3-agent-ov4ec95c473 -p icmp -j ACCEPT 新增icmp放通策略,此时候跨越router ping流量可放通了 开启fw情况下如何使得vm上网 开通dns限制neutron firewall-rule-create –protocol udp –destination-port 53 –action allow –name dns 开通tcp协议限制neutron firewall-rule-create –protocol tcp –action allow –name myrul 如果在未开启fw前可以上外网,开fw后不能上外网后,经过上述策略放通就可以实现vm上外网需求 查看所有策略查看ruleneutron firewall-rule-list查看pllicyneutron firewall-policy-list查看fw墙neutron firewall-list 12345678910111213141516171819202122root@controller:~# neutron firewall-rule-list+------------------------------------+--------+------------------------------------+----------------------+---------+| id | name | firewall_policy_id | summary | enabled |+------------------------------------+--------+------------------------------------+----------------------+---------+| 69912a17-e462-40ce-b30e- | myrule | 7f3539bb-145c- | ICMP, | True || b547f0673dbd | | 4f59-b374-83abbb9729b0 | source: none(none), | || | | | dest: none(none), | || | | | allow | |+------------------------------------+--------+------------------------------------+----------------------+---------+root@controller:~# neutron firewall-policy-list+--------------------------------------+---------------+----------------------------------------+| id | name | firewall_rules |+--------------------------------------+---------------+----------------------------------------+| 7f3539bb-145c-4f59-b374-83abbb9729b0 | myfwallpolicy | [69912a17-e462-40ce-b30e-b547f0673dbd] |+--------------------------------------+---------------+----------------------------------------+root@controller:~# neutron firewall-list+--------------------------------------+------+--------------------------------------+| id | name | firewall_policy_id |+--------------------------------------+------+--------------------------------------+| 532f3f7a-376d-485a-a8db-2e3598fc9b86 | myfw | 7f3539bb-145c-4f59-b374-83abbb9729b0 |+--------------------------------------+------+--------------------------------------+","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/tags/openstack/"},{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"fwaas","slug":"fwaas","permalink":"http://vcpu.me/tags/fwaas/"}]},{"title":"openstack Security group","slug":"secritygroup","date":"2017-08-09T10:00:00.000Z","updated":"2017-08-09T06:59:00.000Z","comments":true,"path":"secritygroup/","link":"","permalink":"http://vcpu.me/secritygroup/","excerpt":"Security group(安全组)概念安全组是通过Linux IPtables实现的,安全组对象是虚拟网卡,由L2 Agent(neutron_openvswitch_agent、neutron_linuxbridfe_agent)实现,在计算节点上通过iptables规则限制虚拟网卡流量进出。其容易和FW混淆,FW主要是异常流量隔离,一般负责跨子网流量不管虚拟网卡间通讯流量。","text":"Security group(安全组)概念安全组是通过Linux IPtables实现的,安全组对象是虚拟网卡,由L2 Agent(neutron_openvswitch_agent、neutron_linuxbridfe_agent)实现,在计算节点上通过iptables规则限制虚拟网卡流量进出。其容易和FW混淆,FW主要是异常流量隔离,一般负责跨子网流量不管虚拟网卡间通讯流量。 neutron port-list 列出属于租户的所有接口123456789101112root@controller:~# . demo-openrcroot@controller:~# neutron port-list+--------------------------------------+------+-------------------+--------------------------------------------+| id | name | mac_address | fixed_ips |+--------------------------------------+------+-------------------+--------------------------------------------+| 44a3abfa-1203-48d8-b83c-0c8657853db9 | | fa:16:3e:71:34:ed | {"subnet_id": "f746bd76-5bb3-4432-8fde- || | | | 66ccf8c59da9", "ip_address": "172.16.1.7"} || d879832c-1ec1-429e-8e4f-2e11506435a4 | | fa:16:3e:2e:09:4b | {"subnet_id": "f746bd76-5bb3-4432-8fde- || | | | 66ccf8c59da9", "ip_address": "172.16.1.1"} || f804cbf3-c4e8-4f66-ade5-87661591ecca | | fa:16:3e:f2:22:5d | {"subnet_id": "f746bd76-5bb3-4432-8fde- || | | | 66ccf8c59da9", "ip_address": "172.16.1.2"} |+--------------------------------------+------+-------------------+--------------------------------------------+ openstack连接拓扑图(在此基础上研究安全组概念) 查看计算节点INPUT链总结图: 123456789root@compute:~# iptables --line-numbers -vnL INPUTChain INPUT (policy ACCEPT 36472 packets, 8475K bytes)num pkts bytes target prot opt in out source destination1 33322 7781K nova-compute-INPUT all -- * * 0.0.0.0/0 0.0.0.0/02 36472 8475K neutron-linuxbri-INPUT all -- * * 0.0.0.0/0 0.0.0.0/03 0 0 ACCEPT udp -- virbr0 * 0.0.0.0/0 0.0.0.0/0 udp dpt:534 0 0 ACCEPT tcp -- virbr0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:535 0 0 ACCEPT udp -- virbr0 * 0.0.0.0/0 0.0.0.0/0 udp dpt:676 0 0 ACCEPT tcp -- virbr0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:67 上述3、4、5、6策略放通dns和dhcp流量使用 根据策略1继续查看 nova-compute-INPUT123root@compute:~# iptables --line-numbers -vnL nova-compute-INPUTChain nova-compute-INPUT (1 references)num pkts bytes target prot opt in out source destination 根据策略2继续查看 neutron-linuxbri-INPUT123Chain neutron-linuxbri-INPUT (1 references)num pkts bytes target prot opt in out source destination1 0 0 neutron-linuxbri-o44a3abfa-1 all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap44a3abfa-12 --physdev-is-bridged /* Direct incoming traffic from VM to the security group chain. */ linux桥和vm互联接口为tap44a3abfa-12,vm发出流量,查找INPUT链最终会查找neutron-linuxbri-o44a3abfa-1 根据策略1 继续查看neutron-linuxbri-o44a3abfa-11234567891011root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-o44a3abfa-1Chain neutron-linuxbri-o44a3abfa-1 (2 references)num pkts bytes target prot opt in out source destination1 2 648 RETURN udp -- * * 0.0.0.0 255.255.255.255 udp spt:68 dpt:67 /* Allow DHCP client traffic. */2 904 106K neutron-linuxbri-s44a3abfa-1 all -- * * 0.0.0.0/0 0.0.0.0/03 0 0 RETURN udp -- * * 0.0.0.0/0 0.0.0.0/0 udp spt:68 dpt:67 /* Allow DHCP client traffic. */4 0 0 DROP udp -- * * 0.0.0.0/0 0.0.0.0/0 udp spt:67 udp dpt:68 /* Prevent DHCP Spoofing by VM. */5 865 103K RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED /* Direct packets associated with a known session to the RETURN chain. */6 39 2573 RETURN all -- * * 0.0.0.0/0 0.0.0.0/07 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID /* Drop packets that appear related to an existing connection (e.g. TCP ACK/FIN) but do not have an entry in conntrack. */8 0 0 neutron-linuxbri-sg-fallback all -- * * 0.0.0.0/0 0.0.0.0/0 /* Send unmatched traffic to the fallback chain. */ DHCP广播地址、vm发包的ip和mac必须是vm分配的 neutron-linuxbri-s44a3abfa-1 检查流量源ip和mac是否为已分配(防止伪装地址攻击)123456root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-s44a3abfa-1Chain neutron-linuxbri-s44a3abfa-1 (1 references)num pkts bytes target prot opt in out source destination1 904 106K RETURN all -- * * 172.16.1.7 0.0.0.0/0 MAC FA:16:3E:71:34:ED /* Allow traffic from defined IP/MAC pairs. */2 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* Drop traffic without an IP/MAC allow rule. */root@compute:~# 该链表主要检查从vm发出来的数据包,是否是openstack所分配的IP和MAC,如果不匹配,禁止通过。此处是为了防止利用VM上进行一些伪装地址的攻击。 查看计算节点OUTPUT链总结图: 12345678root@compute:~# iptables --line-numbers -vnL OUTPUTChain OUTPUT (policy ACCEPT 37285 packets, 8946K bytes)num pkts bytes target prot opt in out source destination1 140K 33M nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/02 34104 8216K nova-compute-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/03 37285 8946K neutron-filter-top all -- * * 0.0.0.0/0 0.0.0.0/04 37285 8946K neutron-linuxbri-OUTPUT all -- * * 0.0.0.0/0 0.0.0.0/05 0 0 ACCEPT udp -- * virbr0 0.0.0.0/0 0.0.0.0/0 udp dpt:68 策略1从OUTPUT跳转到nova-filter-top跳转nova-compute-local12345678root@compute:~# iptables --line-numbers -vnL nova-filter-topChain nova-filter-top (2 references)num pkts bytes target prot opt in out source destination1 39921 9619K nova-compute-local all -- * * 0.0.0.0/0 0.0.0.0/0root@compute:~# iptables --line-numbers -vnL nova-compute-localChain nova-compute-local (1 references)num pkts bytes target prot opt in out source destinationnova-compute-local 目前没有什么策略 策略2从OUTPUT跳转到nova-compute-OUTPUT123root@compute:~# iptables --line-numbers -vnL nova-compute-OUTPUTChain nova-compute-OUTPUT (1 references)num pkts bytes target prot opt in out source destination nova-compute-OUTPUT没有什么策略 策略3从OUTPIUT跳转到neutron-filter-top跳转到neutron-linuxbri-local12345678root@compute:~# iptables --line-numbers -vnL neutron-filter-topChain neutron-filter-top (2 references)num pkts bytes target prot opt in out source destination1 164K 38M neutron-linuxbri-local all -- * * 0.0.0.0/0 0.0.0.0/0root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-localChain neutron-linuxbri-local (1 references)num pkts bytes target prot opt in out source destination neutron-linuxbri-local目前没有什么策略 策略4从OUTPUT跳转到neutron-linuxbri-OUTPUT123root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-OUTPUTChain neutron-linuxbri-OUTPUT (1 references)num pkts bytes target prot opt in out source destination neutron-linuxbri-OUTPUT目前没有什么策略 策略5桥向68端口发udp包策略默认放开 查看计算节点FORWARD链总结图: 123456789101112root@compute:~# iptables --line-numbers -vnL FORWARDChain FORWARD (policy ACCEPT 0 packets, 0 bytes)num pkts bytes target prot opt in out source destination1 0 0 neutron-filter-top all -- * * 0.0.0.0/0 0.0.0.0/02 0 0 neutron-linuxbri-FORWARD all -- * * 0.0.0.0/0 0.0.0.0/03 0 0 nova-filter-top all -- * * 0.0.0.0/0 0.0.0.0/04 0 0 nova-compute-FORWARD all -- * * 0.0.0.0/0 0.0.0.0/05 0 0 ACCEPT all -- * virbr0 0.0.0.0/0 192.168.122.0/24 ctstate RELATED,ESTABLISHED6 0 0 ACCEPT all -- virbr0 * 192.168.122.0/24 0.0.0.0/07 0 0 ACCEPT all -- virbr0 virbr0 0.0.0.0/0 0.0.0.0/08 0 0 REJECT all -- * virbr0 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable9 0 0 REJECT all -- virbr0 * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable 策略1从FORWARD跳转neutron-filter-top跳转到 neutron-linuxbri-local123456789root@compute:~# iptables --line-numbers -vnL neutron-filter-topChain neutron-filter-top (2 references)num pkts bytes target prot opt in out source destination1 165K 38M neutron-linuxbri-local all -- * * 0.0.0.0/0 0.0.0.0/0root@compute:~#root@compute:~#root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-localChain neutron-linuxbri-local (1 references)num pkts bytes target prot opt in out source destination nova-compute-local目前没有什么策略 策略2从FORWARD跳转到neutron-linuxbri-FORWARD123456root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-FORWARDChain neutron-linuxbri-FORWARD (1 references)num pkts bytes target prot opt in out source destination1 1212 112K neutron-linuxbri-sg-chain all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-out tap44a3abfa-12 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */2 906 106K neutron-linuxbri-sg-chain all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap44a3abfa-12 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */root@compute:~# 经过接口tap44a3abfa-12流量全部交给链表neutron-linuxbri-sg-chain处理 链表neutron-linuxbri-sg-chain策略123456root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-sg-chainChain neutron-linuxbri-sg-chain (2 references)num pkts bytes target prot opt in out source destination1 1212 112K neutron-linuxbri-i44a3abfa-1 all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-out tap44a3abfa-12 --physdev-is-bridged /* Jump to the VM specific chain. */2 906 106K neutron-linuxbri-o44a3abfa-1 all -- * * 0.0.0.0/0 0.0.0.0/0 PHYSDEV match --physdev-in tap44a3abfa-12 --physdev-is-bridged /* Jump to the VM specific chain. */3 2604 267K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 策略1:经过tap44a3abfa-12发给vm转发流量均由neutron-linuxbri-i44a3abfa-1 处理 策略2:vm发出到直连桥流量均由neutron-linuxbri-o44a3abfa-1 处理 neutron-linuxbri-i44a3abfa-1策略12345678910root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-i44a3abfa-1Chain neutron-linuxbri-i44a3abfa-1 (1 references)num pkts bytes target prot opt in out source destination1 1194 110K RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED /* Direct packets associated with a known session to the RETURN chain. */2 2 729 RETURN udp -- * * 172.16.1.2 0.0.0.0/0 udp spt:67 udp dpt:683 8 480 RETURN tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:224 0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 match-set NIPv4d81fb7c3-1083-4203-8d6c- src5 6 504 RETURN icmp -- * * 0.0.0.0/0 0.0.0.0/06 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID /* Drop packets that appear related to an existing connection (e.g. TCP ACK/FIN) but do not have an entry in conntrack. */7 2 648 neutron-linuxbri-sg-fallback all -- * * 0.0.0.0/0 0.0.0.0/0 /* Send unmatched traffic to the fallback chain. */ DHCP、ssh、icmp、以及已经建立好会话状态的后续连接流量可发给vm,其它流量一律丢弃 neutron-linuxbri-o44a3abfa-1策略1234567891011root@compute:~# iptables --line-numbers -vnL neutron-linuxbri-o44a3abfa-1Chain neutron-linuxbri-o44a3abfa-1 (2 references)num pkts bytes target prot opt in out source destination1 2 648 RETURN udp -- * * 0.0.0.0 255.255.255.255 udp spt:68 dpt:67 /* Allow DHCP client traffic. */2 904 106K neutron-linuxbri-s44a3abfa-1 all -- * * 0.0.0.0/0 0.0.0.0/03 0 0 RETURN udp -- * * 0.0.0.0/0 0.0.0.0/0 udp spt:68 dpt:67 /* Allow DHCP client traffic. */4 0 0 DROP udp -- * * 0.0.0.0/0 0.0.0.0/0 udp spt:67 udp dpt:68 /* Prevent DHCP Spoofing by VM. */5 865 103K RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED /* Direct packets associated with a known session to the RETURN chain. */6 39 2573 RETURN all -- * * 0.0.0.0/0 0.0.0.0/07 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID /* Drop packets that appear related to an existing connection (e.g. TCP ACK/FIN) but do not have an entry in conntrack. */8 0 0 neutron-linuxbri-sg-fallback all -- * * 0.0.0.0/0 0.0.0.0/0 /* Send unmatched traffic to the fallback chain. */ DHCP返会流量、vm本机发出去流量均可以正常转发 策略3 从FORWARD跳转到neutron-filter-top跳转到nova-compute-local123456789root@compute:~# iptables --line-numbers -vnL nova-filter-topChain nova-filter-top (2 references)num pkts bytes target prot opt in out source destination1 4344 1066K nova-compute-local all -- * * 0.0.0.0/0 0.0.0.0/0root@compute:~#root@compute:~# iptables --line-numbers -vnL nova-compute-localChain nova-compute-local (1 references)num pkts bytes target prot opt in out source destinationroot@compute:~# neutron-linuxbri-local 目前没有什么策略 策略4 从FORWARD跳转交给neutron-linuxbri-FORWARD处理123456root@compute:~# iptables --line-numbers -vnL nova-compute-FORWARDChain nova-compute-FORWARD (1 references)num pkts bytes target prot opt in out source destination1 0 0 ACCEPT all -- brqe47041c9-00 * 0.0.0.0/0 0.0.0.0/02 0 0 ACCEPT all -- * brqe47041c9-00 0.0.0.0/0 0.0.0.0/03 0 0 DROP all -- * brqe47041c9-00 0.0.0.0/0 0.0.0.0/0 寻找计算节点vm连接的接口的策略1234567891011121314root@compute:~# brctl showbridge name bridge id STP enabled interfacesbrqe47041c9-00 8000.7a0a01214421 no tap44a3abfa-12 vxlan-68virbr0 8000.000000000000 yesroot@compute:~#root@compute:~#root@compute:~# iptables -S|grep tap44a3abfa-12-A neutron-linuxbri-FORWARD -m physdev --physdev-out tap44a3abfa-12 --physdev-is-bridged -m comment --comment "Direct traffic from the VM interface to the security group chain." -j neutron-linuxbri-sg-chain-A neutron-linuxbri-FORWARD -m physdev --physdev-in tap44a3abfa-12 --physdev-is-bridged -m comment --comment "Direct traffic from the VM interface to the security group chain." -j neutron-linuxbri-sg-chain-A neutron-linuxbri-INPUT -m physdev --physdev-in tap44a3abfa-12 --physdev-is-bridged -m comment --comment "Direct incoming traffic from VM to the security group chain." -j neutron-linuxbri-o44a3abfa-1-A neutron-linuxbri-sg-chain -m physdev --physdev-out tap44a3abfa-12 --physdev-is-bridged -m comment --comment "Jump to the VM specific chain." -j neutron-linuxbri-i44a3abfa-1-A neutron-linuxbri-sg-chain -m physdev --physdev-in tap44a3abfa-12 --physdev-is-bridged -m comment --comment "Jump to the VM specific chain." -j neutron-linuxbri-o44a3abfa-1root@compute:~# tap44a3abfa-12接口的FORWARD流量均扔给neutron-linuxbri-sg-chain处理,访问虚拟机流量扔给neutron-linuxbri-i44a3abfa-1,从虚拟机出来流量扔给neutron-linuxbri-o44a3abfa-1从vm发出流量到 tap44a3abfa-12接口的INPUT扔给neutron-linuxbri-o44a3abfa-1链处理 待续","categories":[{"name":"openstack","slug":"openstack","permalink":"http://vcpu.me/categories/openstack/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"Security group","slug":"Security-group","permalink":"http://vcpu.me/tags/Security-group/"},{"name":"安全组","slug":"安全组","permalink":"http://vcpu.me/tags/安全组/"}]},{"title":"openvSwitch 基础命令行","slug":"openvswitch4","date":"2017-08-08T10:00:00.000Z","updated":"2017-09-01T09:02:28.000Z","comments":true,"path":"openvswitch4/","link":"","permalink":"http://vcpu.me/openvswitch4/","excerpt":"创建ovs(虚拟交换机) br0ovs-vsctl add-br br0 创建一根线first_brfirst_ifip link add first_br type veth peer name first_if 启动线的两个接口ip link set first_br upip link set first_if up 将接口first_br接入虚拟交换机br0ovs-vsctl add-port br0 first_br","text":"创建ovs(虚拟交换机) br0ovs-vsctl add-br br0 创建一根线first_brfirst_ifip link add first_br type veth peer name first_if 启动线的两个接口ip link set first_br upip link set first_if up 将接口first_br接入虚拟交换机br0ovs-vsctl add-port br0 first_br 列出网桥的基本信息ovs-vsctl list Bridge123456789101112131415161718192021222324root@controller-VirtualBox:~# ovs-vsctl list Bridge_uuid : 2a47c03f-4060-4b90-ac24-131dbe051446auto_attach : []controller : []datapath_id : "00003ec0472a904b"datapath_type : ""datapath_version : "<unknown>"external_ids : {}fail_mode : []flood_vlans : []flow_tables : {}ipfix : []mcast_snooping_enable: falsemirrors : []name : "br0"netflow : []other_config : {}ports : [7cb41bb3-ae24-473d-b93a-df5786fb177c, f83add01-bac5-4c63-859c-068a2cd2535d]protocols : []rstp_enable : falserstp_status : {}sflow : []status : {}stp_enable : false 将first_br划分到vlan100中ovs-vsctl set Port first_br tag=100 查看接口是否加入到vlan100中1234567891011root@controller-VirtualBox:~# ovs-vsctl show29c5cc05-2a31-4f14-a304-3a697698f96a Bridge "br0" Port "br0" Interface "br0" type: internal Port first_br tag: 100 Interface first_br ovs_version: "2.5.2"root@controller-VirtualBox 将接口移出vlanovs-vsctl clear port first_br tag 将first_br加入trunks中,并且标记tag110ovs-vsctl add Port first_br trunks 110,120 将trunks信息去除110 tag,剩余120 tagovs-vsctl remove Port first_br trunks 110 一次性完全把接口first_br清出trunkovs-vsctl clear Port first_br trunks 查看数据库Open_vSwitch内容ovsdb-client get-schema Open_vSwitch 查看ovs所有数据库内容,并格式化输出ovsdb-client dump 列出当前ovs工作数据库ovsdb-client list-dbs 12root@controller-VirtualBox:~# ovsdb-client list-dbsOpen_vSwitch 列出当前ovs工作数据库中的表格ovsdb-client list-tables Open_vSwitch12345678910111213141516171819root@controller-VirtualBox:~# ovsdb-client list-tables Open_vSwitchTable-------------------------ControllerBridgeQueueIPFIXNetFlowOpen_vSwitchQoSPortsFlowSSLFlow_Sample_Collector_SetMirrorFlow_TableInterfaceAutoAttachManager 列出Open_vSwitch中表格Port所有类型和范围信息ovsdb-client list-columns Open_vSwitch Port 12345678910111213141516171819202122232425oot@controller-VirtualBox:~# ovsdb-client list-columns Open_vSwitch PortColumn Type----------------- -----------------------------------------------------------------------------------------------------bond_downdelay "integer"name "string"statistics {"key":"string","max":"unlimited","min":0,"value":"integer"}fake_bridge "boolean"mac {"key":"string","min":0}trunks {"key":{"maxInteger":4095,"minInteger":0,"type":"integer"},"max":4096,"min":0}_uuid "uuid"tag {"key":{"maxInteger":4095,"minInteger":0,"type":"integer"},"min":0}rstp_status {"key":"string","max":"unlimited","min":0,"value":"string"}_version "uuid"bond_updelay "integer"bond_active_slave {"key":"string","min":0}external_ids {"key":"string","max":"unlimited","min":0,"value":"string"}other_config {"key":"string","max":"unlimited","min":0,"value":"string"}status {"key":"string","max":"unlimited","min":0,"value":"string"}qos {"key":{"refTable":"QoS","type":"uuid"},"min":0}bond_mode {"key":{"enum":["set",["active-backup","balance-slb","balance-tcp"]],"type":"string"},"min":0}bond_fake_iface "boolean"interfaces {"key":{"refTable":"Interface","type":"uuid"},"max":"unlimited"}vlan_mode {"key":{"enum":["set",["access","native-tagged","native-untagged","trunk"]],"type":"string"},"min":0}rstp_statistics {"key":"string","max":"unlimited","min":0,"value":"integer"}lacp {"key":{"enum":["set",["active","off","passive"]],"type":"string"},"min":0} 列出Port表格的数据信息ovsdb-client dump Port 查看ovs 接口信息详情ovs-vsctl list Port 1234567891011121314151617181920212223242526272829303132333435363738394041424344root@controller-VirtualBox:~# ovs-vsctl list Port_uuid : f83add01-bac5-4c63-859c-068a2cd2535dbond_active_slave : []bond_downdelay : 0bond_fake_iface : falsebond_mode : []bond_updelay : 0external_ids : {}fake_bridge : falseinterfaces : [31edb9df-45bc-45ba-90b9-d9e78211ecc6]lacp : []mac : []name : first_brother_config : {}qos : []rstp_statistics : {}rstp_status : {}statistics : {}status : {}tag : []trunks : [120]vlan_mode : []_uuid : 7cb41bb3-ae24-473d-b93a-df5786fb177cbond_active_slave : []bond_downdelay : 0bond_fake_iface : falsebond_mode : []bond_updelay : 0external_ids : {}fake_bridge : falseinterfaces : [0f86bf9b-c702-481c-b1e6-0919f1d06c2a]lacp : []mac : []name : "br0"other_config : {}qos : []rstp_statistics : {}rstp_status : {}statistics : {}status : {}tag : []trunks : []vlan_mode : [] 查看ovs交换机学习到的MAC表项123456root@compute:~# ovs-appctl fdb/show br-int port VLAN MAC Age 3 2 fa:16:3e:1f:95:2c 0 2 1 fa:16:3e:f3:31:f5 0 6 1 fa:16:3e:2c:c3:0d 0 4 2 fa:16:3e:3c:d2:56 0 此命令可运用在normal转发时ovs交换机上,如果ovs交换机没有进行normal转发,请通过openflow表项探究流量转发路径。","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"虚拟交换机","slug":"虚拟交换机","permalink":"http://vcpu.me/tags/虚拟交换机/"},{"name":"cmd","slug":"cmd","permalink":"http://vcpu.me/tags/cmd/"}]},{"title":"openvSwitch flow","slug":"openvswitch3","date":"2017-08-07T10:00:00.000Z","updated":"2017-08-07T07:39:02.000Z","comments":true,"path":"openvswitch3/","link":"","permalink":"http://vcpu.me/openvswitch3/","excerpt":"OpenvSwitch flowOpenFlow 是用于管理交换机流表的协议,ovs-ofctl 则是 OVS 提供的命令行工具。在没有配置 OpenFlow 控制器的模式下,可以使用 ovs-ofctl 命令通过 OpenFlow 协议去连接 OVS,创建、修改或删除 OVS 中的流表项,并对 OVS 的运行状况进行动态监控。 ovs-ofctl流表操作命令ovs-ofctl add−flow/add−flows/mod−flows “流表匹配条件,actions=[target][,target…]” 流表匹配条件在 OVS 中,流表项作为 ovs-ofctl 的参数,采用如下的格式:字段=值。如果有多个字段,可以用逗号或者空格分开","text":"OpenvSwitch flowOpenFlow 是用于管理交换机流表的协议,ovs-ofctl 则是 OVS 提供的命令行工具。在没有配置 OpenFlow 控制器的模式下,可以使用 ovs-ofctl 命令通过 OpenFlow 协议去连接 OVS,创建、修改或删除 OVS 中的流表项,并对 OVS 的运行状况进行动态监控。 ovs-ofctl流表操作命令ovs-ofctl add−flow/add−flows/mod−flows “流表匹配条件,actions=[target][,target…]” 流表匹配条件在 OVS 中,流表项作为 ovs-ofctl 的参数,采用如下的格式:字段=值。如果有多个字段,可以用逗号或者空格分开 流表执行的动作流表可以有多个执行动作,是从左向右以此执行,常用动作如下: output:port: 输出数据包到指定的端口。port 是指端口的 OpenFlow 端口编号 group:group_id 输出数据包到openflow group。group_id是指openflow group的id enqueue:port:queue 将数据包放到openflow port端口的queue号队列中 mod_vlan_vid: vlan_id 修改数据包中的 VLAN tag为vlan_id,如果数据包中无tag,则添加。如果数据包中已经是vlan_id,同时调整期vlan优先级为0 strip_vlan: 移除数据包中的 VLAN tag mod_dl_src/ mod_dl_dest: 修改源或者目标的 MAC 地址信息 mod_nw_src/mod_nw_dst: 修改源或者目标的 IPv4 地址信息 mod_tp_src/mod_tp_dst: 将数据包的TCP/UDP/SCTP源或则目的端口 drop 将数据包丢弃 resubmit:port: 替换流表的 in_port 字段,并重新进行匹配 load:value−>dst[start..end]: 写数据到指定的字段 move:src[start..end]−>dst[start..end] 含义未探究清楚、待定 learn(argument[,argument]…) 含义未探究清楚、待定 normal 按照常规L2/L3处理流程处理数据包 flood 将数据包输出到除该数据包输入口外和不可被flooding端口外的所有物理端口 all 将数据包输出到除了该数据包的输入口外的所有物理口 local 将数据包输出到与bridge同名的端口 in_port 将数据包输出到其输入口 controller(key=value) 将数据包以“packet in”消息形式发给openflow控制器max_len=nbytes 将数据包的nbytes字节数据发给控制器reason=reason 指明“packet in” reason; action(默认reason)、no_match、invalid_ttlid=controller-id 指明要发送给的控制器id mod_nw_tos:tos 修改ip头的服务类型tos中的高六位(修改数值4的倍数[0,255]之间) mod_nw_ecn:ecn 修改ip头的服务类型tos中底2位 mod_nw_ttl:ttl 修改TTL[0,255] set_tunnel:id flow table使用实例实验要求 创建一个Virtual Switch br0 包含下面四个Port:first_br, truck portsecond_br, VLAN 20third_br, forth_br VLAN 30 包含五个flow table:Table 0: Admission control.Table 1: VLAN input processing.Table 2: Learn source MAC and VLAN for ingress port.Table 3: Look up learned port for destination MAC and VLAN.Table 4: Output processing 实验拓扑12345678910111213141516171819202122232425 +----+ +----+ | | first_if | | second_if +----+ +----+ | | | | | | | | | first_br | second_br +---+ +---++-------------------------------------------------------+| +---+ +---+ || || || br0 || || || +---+ +---+ |+-------------------------------------------------------+ +---+ third_br +---+ forth_br | | | | | | | third_if | forth_if +---+ +----+ +---+ +----+ 实验拓扑基础环境实现脚本1234567891011121314151617181920ovs-vsctl add-br br0ip link add first_br type veth peer name firstip link add second_br type veth peer name second_ifip link add third_br type veth peer name third_ifip link add forth_br type veth peer name forth_ifovs-vsctl add-port br0 first_br -- set Interface first_br ofport_request=1ovs-vsctl add-port br0 second_br -- set Interface second_br ofport_request=2ovs-vsctl add-port br0 third_br -- set Interface third_br ofport_request=3ovs-vsctl add-port br0 forth_br -- set Interface forth_br ofport_request=4ip link set first_if upip link set first_br upip link set second_br upip link set second_if upip link set third_if upip link set third_br upip link set forth_br upip link set forth_if up 脚本执行完成后,查看flow策略123root@controller-VirtualBox:~# ovs-ofctl dump-flows br0NXST_FLOW reply (xid=0x4): cookie=0x0, duration=52.681s, table=0, n_packets=0, n_bytes=0, idle_age=52, priority=0 actions=NORMA flow table 0 策略(基础策略,所有数据包都先经过我)数据包会首先进入flow table0,因此这里的flow table策略相当于总开关 多播不允许进入ovs br01ovs-ofctl add-flow br0 "table=0, dl_src=01:00:00:00:00:00/01:00:00:00:00:00, actions=drop" STP报文丢弃1ovs-ofctl add-flow br0 "table=0, dl_dst=01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0, actions=drop" 如果非上述两种类数据包交给table 1策略处理1ovs-ofctl add-flow br0 "table=0, priority=0, actions=resubmit(,1)" 测试 命中 table=0 cookie=0 dl_dst=01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0OpenFlow actions=drop 12345678910root@controller-VirtualBox:~# ovs-appctl ofproto/trace br0 in_port=1,dl_dst=01:80:c2:00:00:05Bridge: br0Flow: in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=01:80:c2:00:00:05,dl_type=0x0000Rule: table=0 cookie=0 dl_dst=01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0OpenFlow actions=dropFinal flow: unchangedMegaflow: recirc_id=0,in_port=1,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=01:80:c2:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000Datapath actions: drop 命中table=0 cookie=0 priority=0 OpenFlow actions=resubmit(,1) 123456root@controller-VirtualBox:~# ovs-appctl ofproto/trace br0 in_port=1,dl_dst=01:80:c2:00:00:10Bridge: br0Flow: in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=01:80:c2:00:00:10,dl_type=0x0000Rule: table=0 cookie=0 priority=0OpenFlow actions=resubmit(,1) flow table 1 策略(搭建含有trunk、vlan概念的基本组网信息)table 1 默认策略为数据包丢弃1ovs-ofctl add-flow br0 "table=1, priority=0, actions=drop" table 1 从first_br(in_port=1)进入的数据包继续交给table 2处理1ovs-ofctl add-flow br0 "table=1, priority=99, in_port=1, actions=resubmit(,2)" 测试:从first_br进入br0 tag为5的数据包直接交给table2处理 123456789101112root@controller-VirtualBox:~# ovs-appctl ofproto/trace br0 in_port=1,vlan_tci=5Bridge: br0Flow: in_port=1,vlan_tci=0x0005,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000Rule: table=0 cookie=0 priority=0OpenFlow actions=resubmit(,1) Resubmitted flow: in_port=1,vlan_tci=0x0005,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000 Resubmitted regs: reg0=0x0 reg1=0x0 reg2=0x0 reg3=0x0 reg4=0x0 reg5=0x0 reg6=0x0 reg7=0x0 Resubmitted odp: drop Resubmitted megaflow: recirc_id=0,in_port=1,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000 Rule: table=1 cookie=0 priority=99,in_port=1 OpenFlow actions=resubmit(,2) 从其它口进入数据包,没有tag加上对应tag,有tag则不动12345ovs-ofctl add-flows br0 - <<'EOF' table=1, priority=99, in_port=2, vlan_tci=0, actions=mod_vlan_vid:20, resubmit(,2) table=1, priority=99, in_port=3, vlan_tci=0, actions=mod_vlan_vid:30, resubmit(,2) table=1, priority=99, in_port=4, vlan_tci=0, actions=mod_vlan_vid:30, resubmit(,2) EOF 执行完上述标签操作后,其拓扑图更改如下123456789101112131415161718192021222324252627 +----+ +----+ | | first_if | | second_if +----+ +----+ | | | | | | | | vlan 20trunk口 | | | first_br | second_br +---+ +---+ +-------------------------------------------------------+ | +---+ +---+ | | | | | | br0 | | | | | | +---+ +---+ | +-------------------------------------------------------+ +---+ third_br +---+ forth_br | | | vlan30 | vlan30 | | | | | third_if | forth_if +---+ +----+ +---+ +----+ 测试1:从接口second_br进入br0数据包被加上tag20标签交给了table2处理 123456789101112131415161718192021222324root@controller-VirtualBox:~# ovs-appctl ofproto/trace br0 in_port=2Bridge: br0Flow: in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000Rule: table=0 cookie=0 priority=0OpenFlow actions=resubmit(,1) Resubmitted flow: in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000 Resubmitted regs: reg0=0x0 reg1=0x0 reg2=0x0 reg3=0x0 reg4=0x0 reg5=0x0 reg6=0x0 reg7=0x0 Resubmitted odp: drop Resubmitted megaflow: recirc_id=0,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000 Rule: table=1 cookie=0 priority=99,in_port=2,vlan_tci=0x0000 OpenFlow actions=mod_vlan_vid:20,resubmit(,2) Resubmitted flow: in_port=2,dl_vlan=20,dl_vlan_pcp=0,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000 Resubmitted regs: reg0=0x0 reg1=0x0 reg2=0x0 reg3=0x0 reg4=0x0 reg5=0x0 reg6=0x0 reg7=0x0 Resubmitted odp: drop Resubmitted megaflow: recirc_id=0,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000 Rule: table=254 cookie=0 priority=0,reg0=0x2 OpenFlow actions=dropFinal flow: in_port=2,dl_vlan=20,dl_vlan_pcp=0,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000Megaflow: recirc_id=0,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000Datapath actions: drop 测试2:tag 5的数据包从second_br进入br0,命中默认丢包策略1234567891011121314151617root@controller-VirtualBox:~# ovs-appctl ofproto/trace br0 in_port=2,vlan_tci=5Bridge: br0Flow: in_port=2,vlan_tci=0x0005,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000Rule: table=0 cookie=0 priority=0OpenFlow actions=resubmit(,1) Resubmitted flow: in_port=2,vlan_tci=0x0005,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,dl_type=0x0000 Resubmitted regs: reg0=0x0 reg1=0x0 reg2=0x0 reg3=0x0 reg4=0x0 reg5=0x0 reg6=0x0 reg7=0x0 Resubmitted odp: drop Resubmitted megaflow: recirc_id=0,in_port=2,vlan_tci=0x0001/0x0001,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000 Rule: table=1 cookie=0 priority=0 OpenFlow actions=dropFinal flow: unchangedMegaflow: recirc_id=0,in_port=2,vlan_tci=0x0001/0x1001,dl_src=00:00:00:00:00:00/01:00:00:00:00:00,dl_dst=00:00:00:00:00:00/ff:ff:ff:ff:ff:f0,dl_type=0x0000Datapath actions: drop flow table 2 策略(让br0具有mac学习能力)学习port-mac-vlan tag 数据表将学习结果放到table10,学习后交给table3继续处理1ovs-ofctl add-flow br0 "table=2 actions=learn(table=10, NXM_OF_VLAN_TCI[0..11], NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[], load:NXM_OF_IN_PORT[]->NXM_NX_REG0[0..15]), resubmit(,3)" learn表示这是一个学习的action table 10,这是一个MAC learning table,学习的结果会放在这个table中。 NXM_OF_VLAN_TCI这个是VLAN Tag,在MAC Learning * table中,每一个entry都是仅仅对某一个VLAN来说的,不同VLAN的learning table是分开的。在学习的结果的entry中,会标出这个entry是对于哪个VLAN的。 NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[]这个的意思是当前包里面的MAC Source Address会被放在学习结果的entry里面的dl_dst里面。这是因为每个switch都是通过Ingress包来学习,某个MAC从某个port进来,switch就应该记住以后发往这个MAC的包要从这个port出去,因而MAC source address就被放在了Mac destination address里面,因为这是为发送用的。 NXM_OF_IN_PORT[]->NXM_NX_REG0将portf放入register. 一般对于学习的entry还需要有hard_timeout,这是的每个学习结果都会expire,需要重新学习。 table2 转发表项学习测试产生从接口first_br进入br0 vlan20的数据包1ovs-appctl ofproto/trace br0 in_port=1,vlan_tci=20,dl_src=50:00:00:00:00:01 -generate 结果:执行 ovs-ofctl dump-flows br0时候学习到了一个table10表项1cookie=0x0, duration=3.235s, table=10, n_packets=0, n_bytes=0, idle_age=3, vlan_tci=0x0014/0x0fff,dl_dst=50:00:00:00:00:01 actions=load:0x1->NXM_NX_REG0[0..15] table 3策略(把table2学习到的转发策略运用起来)针对于要发送的数据包,根据学习结果table10进行转发,不用每次flood下面命令含义是:先到table10查找,如果找不到相应策略再交给table4处理1ovs-ofctl add-flow br0 "table=3 priority=50 actions=resubmit(,10), resubmit(,4)" 查找学习表项排除多播和广播如果数据包为多播火灾 广播,就不需要查找table10,直接交给table4处理1ovs-ofctl add-flow br0 "table=3 priority=99 dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,4)" table 4策略(让vlan和trunk隔离工作起来)对于接口first_br,是trunk port,如果数据包携带tag,就让其带着从改接口出去1ovs-ofctl add-flow br0 "table=4 reg0=1 actions=1" 对于接口234,发出时候需要去除vlan标签12345ovs-ofctl add-flows br0 - <<'EOF' table=4 reg0=2 actions=strip_vlan,2 table=4 reg0=3 actions=strip_vlan,3 table=4 reg0=4 actions=strip_vlan,4 EOF 对于broadcast,增加策略让其在一个广播域发送数据包12345ovs-ofctl add-flows br0 - <<'EOF' table=4 reg0=0 priority=99 dl_vlan=20 actions=1,strip_vlan,2 table=4 reg0=0 priority=99 dl_vlan=30 actions=1,strip_vlan,3,4 table=4 reg0=0 priority=50 actions=1 EOF 所以对于register = 0的,也即是broadcast的,属于vlan 20的,则从port 1, 2出去,属于vlan 30的,则从port 1, 3, 4出去。 table4 广播隔离测试从first_br来的tag=30数据包广播1ovs-appctl ofproto/trace br0 in_port=1,dl_dst=ff:ff:ff:ff:ff:ff,dl_vlan=30 从接口third_br(in_port 3)进入的数据包被加上tag30标签,转发给接口1接口41ovs-appctl ofproto/trace br0 in_port=3,dl_dst=ff:ff:ff:ff:ff:ff 首次产生的数据包,因同属于tag30会被发送给接口3和接口41ovs-appctl ofproto/trace br0 in_port=1,dl_vlan=30,dl_src=10:00:00:00:00:01,dl_dst=20:00:00:00:00:01 -generate 学习了20:00:00:00:00:01 in_port=1 tag 30 到table10 数据包回复时候,仅仅从接口1发出去1ovs-appctl ofproto/trace br0 in_port=4,dl_src=20:00:00:00:00:01,dl_dst=10:00:00:00:00:01 -generate 根据上述表项进行数据包转发,不需要flood数据包之需要发给接口1并且学习到了20:00:00:00:00:01 in_port=4 tag=30 数据包包回复时候,仅仅发给接口41ovs-appctl ofproto/trace br0 in_port=1,dl_vlan=30,dl_src=10:00:00:00:00:01,dl_dst=20:00:00:00:00:01 -generate 根据上述表项进行数据包转发,不需要flood数据包之需要发给接口4","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"虚拟交换机","slug":"虚拟交换机","permalink":"http://vcpu.me/tags/虚拟交换机/"},{"name":"ovs-ofctl","slug":"ovs-ofctl","permalink":"http://vcpu.me/tags/ovs-ofctl/"}]},{"title":"openvSwitch tunnel","slug":"openvswitch2","date":"2017-08-03T10:00:00.000Z","updated":"2017-08-07T09:45:44.000Z","comments":true,"path":"openvswitch2/","link":"","permalink":"http://vcpu.me/openvswitch2/","excerpt":"ovs tunnel实验拓扑 实验拓扑搭建基础脚本12345678910111213141516171819#vbox虚拟机1ip netns add leftip link add name veth1 type veth peer name sw1-p1ip link set dev veth1 netns leftip netns exec left ifconfig veth1 10.0.0.1/24 upovs-vsctl add-br sw1ovs-vsctl add-port sw1 sw1-p1ip link set sw1-p1 up#vbox虚拟机2ip netns add rightip link add name veth1 type veth peer name sw2-p1ip link set dev veth1 netns rightip netns exec right ifconfig veth1 10.0.0.2/24 upovs-vsctl add-br sw2ovs-vsctl add-port sw2 sw2-p1ip link set sw2-p1 up","text":"ovs tunnel实验拓扑 实验拓扑搭建基础脚本12345678910111213141516171819#vbox虚拟机1ip netns add leftip link add name veth1 type veth peer name sw1-p1ip link set dev veth1 netns leftip netns exec left ifconfig veth1 10.0.0.1/24 upovs-vsctl add-br sw1ovs-vsctl add-port sw1 sw1-p1ip link set sw1-p1 up#vbox虚拟机2ip netns add rightip link add name veth1 type veth peer name sw2-p1ip link set dev veth1 netns rightip netns exec right ifconfig veth1 10.0.0.2/24 upovs-vsctl add-br sw2ovs-vsctl add-port sw2 sw2-p1ip link set sw2-p1 up GRE TUNNEL 配置 vbox虚拟机1 1ovs-vsctl add-port sw1 tun0 -- set Interface tun0 type=gre options:remote_ip=10.30.10.141 vbox虚拟机2 1ovs-vsctl add-port sw2 tun0 -- set Interface tun0 type=gre options:remote_ip=10.30.10.145 查看配置 vbox虚拟机1 12345678910111213root@controller-VirtualBox:~# ovs-vsctl show29c5cc05-2a31-4f14-a304-3a697698f96a Bridge "sw1" Port "sw1" Interface "sw1" type: internal Port "sw1-p1" Interface "sw1-p1" Port "tun0" Interface "tun0" type: gre options: {remote_ip="10.30.10.141"} ovs_version: "2.5.2" vbox虚拟机2 1234567891011121314root@controller-VirtualBox:~# ovs-vsctl show0f92aaa0-5385-48f5-a54a-644e05652f59 Bridge "sw2" Port "sw2-p1" Interface "sw2-p1" Port "sw2" Interface "sw2" type: internal Port "tun0" Interface "tun0" type: gre options: {remote_ip="10.30.10.145"} ovs_version: "2.5.2"root@controller-VirtualBox 连通性测试,在vbox虚拟机1 left ns中ping虚拟机2 right1234root@controller-VirtualBox:~# ip netns exec left ping 10.0.0.2PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.953 ms64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.609 ms VXLAN TUNNEL配置 vbox虚拟机1 1ovs-vsctl add-port sw1 tun0 -- set Interface tun0 type=vxlan options:local_ip=10.30.10.145 options:remote_ip=10.30.10.141 vbbox虚拟机2 1ovs-vsctl add-port sw2 tun0 -- set Interface tun0 type=vxlan options:local_ip=10.30.10.141 options:remote_ip=10.30.10.145 查看配置 vbox虚拟机1 12345678910111213root@controller-VirtualBox:~# ovs-vsctl show29c5cc05-2a31-4f14-a304-3a697698f96a Bridge "sw1" Port "sw1" Interface "sw1" type: internal Port "sw1-p1" Interface "sw1-p1" Port "tun0" Interface "tun0" type: vxlan options: {local_ip="10.30.10.145", remote_ip="10.30.10.141"} ovs_version: "2.5.2" vbox虚拟机2 1234567891011121314root@controller-VirtualBox:~# ovs-vsctl show0f92aaa0-5385-48f5-a54a-644e05652f59 Bridge "sw2" Port "sw2-p1" Interface "sw2-p1" Port "sw2" Interface "sw2" type: internal Port "tun0" Interface "tun0" type: vxlan options: {local_ip="10.30.10.141", remote_ip="10.30.10.145"} ovs_version: "2.5.2"root@controller-VirtualBox:~# 连通性测试,在vbox虚拟机1 left ns中ping虚拟机2 right123456root@controller-VirtualBox:~# ip netns exec left ping 10.0.0.2PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.905 ms64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.490 ms2 packets transmitted, 2 received, 0% packet loss, time 1001msrtt min/avg/max/mdev = 0.490/0.697/0.905/0.209 ms ipsec_gre配置如果运行失败或则无法创建接口,首先安装1apt-get install openvswitch* -y vbox虚拟机1 1ovs-vsctl add-port sw1 tun0 -- set interface tun0 type=ipsec_gre options:remote_ip=10.30.10.141 options:psk=test123 vbbox虚拟机2 1ovs-vsctl add-port sw2 tun0 -- set interface tun0 type=ipsec_gre options:remote_ip=10.30.10.145 options:psk=test123 在ipsec gre tunnel情况下,我的环境中并没有完成连通,而且日志中也没有报错;具体原因还需要探究。当然我已经翻阅论坛,然并卵,哎。。。,暂且未解之谜吧","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"虚拟交换机","slug":"虚拟交换机","permalink":"http://vcpu.me/tags/虚拟交换机/"},{"name":"gre","slug":"gre","permalink":"http://vcpu.me/tags/gre/"},{"name":"vxlan","slug":"vxlan","permalink":"http://vcpu.me/tags/vxlan/"},{"name":"ovx tunnel","slug":"ovx-tunnel","permalink":"http://vcpu.me/tags/ovx-tunnel/"},{"name":"ipsec gre","slug":"ipsec-gre","permalink":"http://vcpu.me/tags/ipsec-gre/"}]},{"title":"openvSwitch mirror、Qos、STP","slug":"openvswitch1","date":"2017-08-03T10:00:00.000Z","updated":"2017-08-03T07:44:29.000Z","comments":true,"path":"openvswitch1/","link":"","permalink":"http://vcpu.me/openvswitch1/","excerpt":"ovs mirrorovs mirror介绍mirror可以将虚拟交换机把数据包发给的接口或则vlan 命令中对数据包选择选项:select_all,所有的包select_dst_portselect_src_portselect_vlan 将数据包镜像的命令:output_port (SPAN Switched Port ANalyzer 本地流量)output_vlan (RSPAN Remote Switched Port ANalyzer 远程流量) 实验拓扑","text":"ovs mirrorovs mirror介绍mirror可以将虚拟交换机把数据包发给的接口或则vlan 命令中对数据包选择选项:select_all,所有的包select_dst_portselect_src_portselect_vlan 将数据包镜像的命令:output_port (SPAN Switched Port ANalyzer 本地流量)output_vlan (RSPAN Remote Switched Port ANalyzer 远程流量) 实验拓扑 拓扑实现脚本12345678910111213141516171819202122232425262728293031323334353637383940ip netns add ns1ip netns add ns2ip netns add ns3ip netns add ns4ip netns add ns5ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip addr add 1.1.1.1/24 dev tap1ip netns exec ns1 ip link set tap1 upip netns exec ns1 ip link set lo upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip addr add 1.1.1.2/24 dev tap2ip netns exec ns2 ip link set tap2 upip netns exec ns2 ip link set lo upip link add firstbr type veth peer name firstifip link add secondbr type veth peer name secondifip link add thirdbr type veth peer name thirdifovs-vsctl add-port br0 firstbrip link set firstif netns ns3ovs-vsctl add-port br0 secondbrovs-vsctl add-port br1 secondifovs-vsctl add-port br1 thirdbrip link set thirdif netns ns4ip netns exec ns3 ip link set firstif upip netns exec ns4 ip link set thirdif upip link set firstbr upip link set secondbr upip link set secondif upip link set thirdbr up 附加拓扑清除脚本 12345678910ip link del firstbr type veth peer name firstifip link del secondbr type veth peer name secondifip link del thirdbr type veth peer name thirdifovs-vsctl del-br br0ovs-vsctl del-br br1ip netns del ns1ip netns del ns2ip netns del ns3ip netns del ns4 拓扑环境测试ip netns exec ns1 ping -c 2 1.1.1.1ip netns exec ns1 ping -c 2 1.1.1.2 将ns1 ping ns2的双向流量镜像到ns3(本地镜像)镜像命令1ovs-vsctl -- set bridge br0 mirrors=@m -- --id=@tap1 get Port tap1 -- --id=@firstbr get Port firstbr -- --id=@m create Mirror name=mirrortap1 select-dst-port=@tap1 select-src-port=@tap1 output-port=@firstbr 实验过程记录 ns1 ping ns2 (tap1: 1.1.1.1-> tap2: 1.1.1.2) 123456root@controller-VirtualBox:~/floodlight# ip netns exec ns1 ping -c 20 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.269 ms64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.050 ms64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.075 ms64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.047 ms 在ns3可以抓到流泪 123456789root@controller-VirtualBox:~# ip netns exec ns3 tcpdump -i firstiftcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on firstif, link-type EN10MB (Ethernet), capture size 262144 bytes15:01:59.883866 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12123, seq 1, length 6415:01:59.883935 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12123, seq 1, length 6415:02:00.903931 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12123, seq 2, length 6415:02:00.903951 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12123, seq 2, length 6415:02:01.928521 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12123, seq 3, length 6415:02:01.928553 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12123, seq 3, length 64 实验结论经过虚拟交换机器配置,经过tap1所有的流量可以额外镜像复制到ns3 将ns1 ping ns2的双向流量镜像到ns4(远程镜像)设置secondbr、secondif、thirdbr为vlan110123ovs-vsctl set Port secondbr tag=110ovs-vsctl set Port secondif tag=110ovs-vsctl set Port thirdbr tag=110 其它: vs-vsctl clear port secondbr tag 设置虚拟交换机br0:tap1接口镜像输出到vlan1101234ovs-vsctl -- set bridge br0 mirrors=@m -- --id=@tap1 get Port tap1 -- --id=@m create Mirror name=mirrortap1 select-dst-port=@tap1 select-src-port=@tap1 output-vlan=110#配置虚拟交换机从vlan110来的,都output到vlan110ovs-vsctl -- set bridge br1 mirrors=@m -- --id=@m create Mirror name=mirrorvlan select-vlan=110 output-vlan=110 禁止学习vlan110的macovs-vsctl set bridge br0 flood-vlans=110ovs-vsctl set bridge br1 flood-vlans=110 实验过程记录 ns1 ping ns2 123456root@controller-VirtualBox:~/floodlight# ip netns exec ns1 ping -c 200 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.103 ms64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.056 ms64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.064 ms64 bytes from 1.1.1.2: icmp_seq=4 ttl=64 time=0.083 ms ns4 tcpdump抓到了远程镜像的报文 123456789root@controller-VirtualBox:~# ip netns exec ns4 tcpdump -i thirdiftcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on thirdif, link-type EN10MB (Ethernet), capture size 262144 bytes15:19:21.064214 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12180, seq 3, length 6415:19:21.064245 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12180, seq 3, length 6415:19:22.088654 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12180, seq 4, length 6415:19:22.088681 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12180, seq 4, length 6415:19:23.111975 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 12180, seq 5, length 6415:19:23.112006 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 12180, seq 5, length 64 br0学习到了ns1和ns2接口的mac,br1不会学习到镜像流量的mac 123456root@controller-VirtualBox:~/floodlight# ovs-appctl fdb/show br0 port VLAN MAC Age 2 0 d2:82:aa:d7:3d:f2 150 1 0 b2:3f:c1:d1:16:7d 150root@controller-VirtualBox:~/floodlight# ovs-appctl fdb/show br1 port VLAN MAC Age 实验结论通过一定配置克实现虚拟交换剂的远程镜像功能 清除实验环境的镜像配置12345ovs-vsctl clear Bridge br0 mirrorsovs-vsctl clear Bridge br0 flood_vlansovs-vsctl clear Bridge br1 mirrorsovs-vsctl clear Bridge br1 flood_vlans ovs QOS实验拓扑 拓扑实现脚本1234567891011121314151617181920212223242526272829303132333435363738394041ip netns add ns1ip netns add ns2ip netns add ns3ip netns add ns4ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip addr add 1.1.1.1/24 dev tap1ip netns exec ns1 ip link set tap1 upip netns exec ns1 ip link set lo upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip addr add 1.1.1.2/24 dev tap2ip netns exec ns2 ip link set tap2 upip netns exec ns2 ip link set lo upovs-vsctl add-port br0 tap3 -- set Interface tap3 type=internalip link set tap3 netns ns3ip netns exec ns3 ip addr add 1.1.1.3/24 dev tap3ip netns exec ns3 ip link set tap3 upip netns exec ns3 ip link set lo upip link add firstbr type veth peer name firstifovs-vsctl add-port br0 firstbrovs-vsctl add-port br1 firstifip link set firstbr upip link set firstif upovs-vsctl add-port br1 tap4 -- set Interface tap4 type=internalip link set tap4 netns ns4ip netns exec ns4 ip addr add 1.1.1.4/24 dev tap4ip netns exec ns4 ip link set tap4 upip netns exec ns4 ip link set lo up 拓扑销毁方法 1234567ip link del firstbr type veth peer name firstifovs-vsctl del-br br0ovs-vsctl del-br br1ip netns del ns1ip netns del ns2ip netns del ns3ip netns del ns4 测试验证方法 在服务端 12345678#在ns4串口中netserver开启流量测试服务端root@controller-VirtualBox:~# netserverStarting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPECroot@controller-VirtualBox:~# netstat -ntlpActive Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program nametcp6 0 0 :::12865 :::* LISTEN 14197/netserver 在测试端 123456789第四列为速率: 36669.53(即约36Gbps)root@controller-VirtualBox:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.00 700111 0 36688.91212992 10.00 699741 36669.53 设置接口firstif发包速率10Mbps12ovs-vsctl set Interface firstif ingress_policing_rate=10000ovs-vsctl set Interface firstif ingress_policing_burst=10000 测试结论:如下结论,从ns1->ns4 104741.35 从ns4->ns1 10.27 也就是当流量走入接口firstif后速率限制为10Mbps,导致到ns4位10Mbps,因此回程流量(ns4->ns1)为10Mbps左右 12345678root@controller-VirtualBox:~# netperf -H 1.1.1.4 -t UDP_STREAMMIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 1.1.1.4 (1.1.1.4) port 0 AF_INET : demoSocket Message Elapsed MessagesSize Size Time Okay Errors Throughputbytes bytes secs # # 10^6bits/sec212992 65507 10.00 1998882 0 104741.35212992 10.00 196 10.27 清除接口速率限制方法12ovs-vsctl set Interface firstif ingress_policing_rate=0ovs-vsctl set Interface firstif ingress_policing_burst=0 ovs stp环路拓扑 组成拓扑结构的脚本构成连通脚本12345678910111213141516171819202122232425262728293031323334353637383940414243ip netns add ns1ovs-vsctl add-br br1ovs-vsctl add-port br1 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip addr add 1.1.1.1/24 dev tap1ip netns exec ns1 ip link set tap1 upip netns add ns2ovs-vsctl add-br br2ovs-vsctl add-port br2 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip addr add 1.1.1.2/24 dev tap2ip netns exec ns2 ip link set tap2 upip netns add ns3ovs-vsctl add-br br3ovs-vsctl add-port br3 tap3 -- set Interface tap3 type=internalip link set tap3 netns ns3ip netns exec ns3 ip addr add 1.1.1.3/24 dev tap3ip netns exec ns3 ip link set tap3 upip link add br1-br2 type veth peer name br2-br1ip link add br1-br3 type veth peer name br3-br1ip link add br2-br3 type veth peer name br3-br2ip link set dev br1-br2 upip link set dev br2-br1 upip link set dev br1-br3 upip link set dev br3-br1 upip link set dev br2-br3 upip link set dev br3-br2 up#连接br1 <-> br2ovs-vsctl add-port br1 br1-br2ovs-vsctl add-port br2 br2-br1#连接br1<->br3ovs-vsctl add-port br1 br1-br3ovs-vsctl add-port br3 br3-br1#连接br2<->br3 一半ovs-vsctl add-port br2 br2-br3 拓扑清理脚本123456789ip link del br1-br2 type veth peer name br2-br1ip link del br1-br3 type veth peer name br3-br1ip link del br2-br3 type veth peer name br3-br2ovs-vsctl del-br br1ovs-vsctl del-br br2ovs-vsctl del-br br3ip netns del ns1ip netns del ns2ip netns del ns3 上述脚本执行完成后br3 br2 未完全连接,因此还未组成环路此时测试连通性: 连通脚本的测试命令12ip netns exec ns1 ping -c 3 1.1.1.2ip netns exec ns1 ping -c 4 1.1.1.3 测试结果1234567891011121314151617181920root@controller-VirtualBox:~# ip netns exec ns1 ping -c 3 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.225 ms64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.049 ms64 bytes from 1.1.1.2: icmp_seq=3 ttl=64 time=0.051 ms--- 1.1.1.2 ping statistics ---3 packets transmitted, 3 received, 0% packet loss, time 2031msrtt min/avg/max/mdev = 0.049/0.108/0.225/0.082 msroot@controller-VirtualBox:~# ip netns exec ns1 ping -c 4 1.1.1.3PING 1.1.1.3 (1.1.1.3) 56(84) bytes of data.64 bytes from 1.1.1.3: icmp_seq=1 ttl=64 time=0.219 ms64 bytes from 1.1.1.3: icmp_seq=2 ttl=64 time=0.051 ms64 bytes from 1.1.1.3: icmp_seq=3 ttl=64 time=0.046 ms64 bytes from 1.1.1.3: icmp_seq=4 ttl=64 time=0.050 ms--- 1.1.1.3 ping statistics ---4 packets transmitted, 4 received, 0% packet loss, time 3065msrtt min/avg/max/mdev = 0.046/0.091/0.219/0.074 ms 将br2和br3连通组成二层环路1ovs-vsctl add-port br3 br3-br2 环路后进行ping连通测试结果,时通时不通123456root@controller-VirtualBox:~# ip netns exec ns1 ping -c 30 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.436 ms--- 1.1.1.2 ping statistics ---18 packets transmitted, 1 received, 94% packet loss, time 17402msrtt min/avg/max/mdev = 0.436/0.436/0.436/0.000 ms 环路情况下分析 查询接口名称和接口索引的对应关系 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172root@controller-VirtualBox:~# ovs-ofctl show br1OFPT_FEATURES_REPLY (xid=0x2): dpid:0000bee79fe0e948n_tables:254, n_buffers:0capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(tap1): addr:ff:7f:00:00:36:38 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 2(br1-br2): addr:b2:86:95:fa:4b:b9 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 3(br1-br3): addr:12:a5:67:28:40:f1 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br1): addr:be:e7:9f:e0:e9:48 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0root@controller-VirtualBox:~# ovs-ofctl show br2OFPT_FEATURES_REPLY (xid=0x2): dpid:00007a10ec6b6942n_tables:254, n_buffers:0capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(tap2): addr:ff:7f:00:00:36:38 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 2(br2-br1): addr:0e:dd:29:5c:73:ec config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 3(br2-br3): addr:4a:73:b5:67:af:fe config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br2): addr:7a:10:ec:6b:69:42 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0root@controller-VirtualBox:~# ovs-ofctl show br3OFPT_FEATURES_REPLY (xid=0x2): dpid:0000462c85b99a4dn_tables:254, n_buffers:0capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 1(tap3): addr:ff:7f:00:00:36:38 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 2(br3-br1): addr:ca:ed:37:ef:7d:21 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 3(br3-br2): addr:ea:55:a1:2e:d9:d1 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br3): addr:46:2c:85:b9:9a:4d config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 查看ns1的arp表项,发现已经无法学习到1.1.1.2 mac地址 12345root@controller-VirtualBox:~# ip netns exec ns1 arp -nAddress HWtype HWaddress Flags Mask Iface1.1.1.3 ether 72:e4:76:db:2e:c5 C tap11.1.1.2 (incomplete) tap1root@controller-VirtualBox:~# 分别查看br1、br2、br3的mac转发表,发现br1和br2表项已经错乱 123456789101112131415root@controller-VirtualBox:~# ovs-appctl fdb/show br1 port VLAN MAC Age 2 0 3a:52:a3:d6:67:12 0 3 0 b2:86:95:fa:4b:b9 0 3 0 02:a1:bd:ae:56:75 0root@controller-VirtualBox:~# ovs-appctl fdb/show br2 port VLAN MAC Age 3 0 3a:52:a3:d6:67:12 0 2 0 b2:86:95:fa:4b:b9 0 2 0 02:a1:bd:ae:56:75 0root@controller-VirtualBox:~# ovs-appctl fdb/show br3 port VLAN MAC Age 2 0 3a:52:a3:d6:67:12 0 3 0 b2:86:95:fa:4b:b9 0 2 0 02:a1:bd:ae:56:75 0 表项错乱说明:br1连接ns1的tap1本应该从接口1学习到,结果从接口三学习到br2连接ns2的tap2本应该从接口1学习到,结果从接口三学习到 开启ovs stp解决上述二层环路问题ovs-vsctl list bridge查看stp默认是关闭的开启stp123ovs-vsctl set Bridge br1 stp_enable=trueovs-vsctl set Bridge br2 stp_enable=trueovs-vsctl set Bridge br3 stp_enable=true 经过测试网络恢复正常123456789101112root@controller-VirtualBox:~# ip netns exec ns1 ping -c 2 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.272 ms64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.057 ms--- 1.1.1.2 ping statistics ---2 packets transmitted, 2 received, 0% packet loss, time 1023msrtt min/avg/max/mdev = 0.057/0.164/0.272/0.108 msroot@controller-VirtualBox:~# ip netns exec ns1 ping -c 3 1.1.1.3PING 1.1.1.3 (1.1.1.3) 56(84) bytes of data.64 bytes from 1.1.1.3: icmp_seq=1 ttl=64 time=0.463 ms64 bytes from 1.1.1.3: icmp_seq=2 ttl=64 time=0.068 ms64 bytes from 1.1.1.3: icmp_seq=3 ttl=64 time=0.050 ms stp研究ovs-vsctl list bridge 获取到如下信息 br3 stp_root_path_cost=”0”br2 stp_root_path_cost=”2”br1 stp_root_path_cost=”2”‘ 在br3-br1处抓包,并且令开窗口ns1 ping ns2 窗口1: ns1 ping ns21234root@controller-VirtualBox:~# ping 1.1.1.2PING 1.1.1.2 (1.1.1.2) 56(84) bytes of data.64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=0.502 ms64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=0.078 ms 窗口2:抓取到报文流量经过了br31234tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br3-br1, link-type EN10MB (Ethernet), capture size 262144 bytes15:08:46.538492 IP 1.1.1.1 > 1.1.1.2: ICMP echo request, id 4525, seq 3, length 6415:08:46.538524 IP 1.1.1.2 > 1.1.1.1: ICMP echo reply, id 4525, seq 3, length 64 通过测试得知上述环形网络,已经被逻辑修剪为如下拓扑图,从此解决了ovs二层环路问题 stp相关命令总结 开启br1 stp 1ovs-vsctl set bridge br1 tp_enable=ture 关闭br1 stp 1ovs-vsctl set Bridge br1 stp_enable=false 设置br1 stp cost 1ovs-vsctl set Port br2-br1 other_config:stp-path-cost=100 查询br1 stp状态 1ovs-vsctl get bridge br1 stp_enable 设置br1的stp优先级 1ovs-vsctl set bridge br1 other_config:stp-priority=0x7800 移除br1 stp设置 1ovs-vsctl clear bridge br100 other_config","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"虚拟交换机","slug":"虚拟交换机","permalink":"http://vcpu.me/tags/虚拟交换机/"},{"name":"mirror","slug":"mirror","permalink":"http://vcpu.me/tags/mirror/"},{"name":"Qos","slug":"Qos","permalink":"http://vcpu.me/tags/Qos/"},{"name":"STP","slug":"STP","permalink":"http://vcpu.me/tags/STP/"}]},{"title":"openvSwitch","slug":"openvswitch","date":"2017-07-31T10:00:00.000Z","updated":"2017-08-03T07:24:09.000Z","comments":true,"path":"openvswitch/","link":"","permalink":"http://vcpu.me/openvswitch/","excerpt":"openvSwitch简介openvSwitch 是SDN常用的虚拟交换机,其将普通交换机的数据平面和控制平面相分离,SDN交换机只负责数据的转发,而控制指令则由更上一级的控制器下发 OpenvSwitch(虚拟交换机)常用模块 ovs-vswitchd 主要模块、实现交换功能,含有支持流交换的Linux内核模块,基于流交换它和上层controller通信采用OPENFLOW协议,与ovsdb-server通信采用OVSDB协议,和内核通信采用netlink通信支持多个独立datapatch(网桥) 12root 2225 1 0 20:05 ? 00:00:00 ovs-vswitchd: monitoring pid 2226 (healthy)root 2226 2225 0 20:05 ? 00:00:00 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor ovsdb-server 虚拟交换机配置信息(接口、交换内容、VLAN等)存放的数据库服务,ovs-vswitchd会根据ovsdb-server中的配置工作。ovsdb-server 和ovs-vswitchd 使用OVSDB(JSON-RPC)方式通信12root 2201 1 0 20:05 ? 00:00:00 ovsdb-server: monitoring pid 2202 (healthy)root 2202 2201 0 20:05 ? 00:00:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor","text":"openvSwitch简介openvSwitch 是SDN常用的虚拟交换机,其将普通交换机的数据平面和控制平面相分离,SDN交换机只负责数据的转发,而控制指令则由更上一级的控制器下发 OpenvSwitch(虚拟交换机)常用模块 ovs-vswitchd 主要模块、实现交换功能,含有支持流交换的Linux内核模块,基于流交换它和上层controller通信采用OPENFLOW协议,与ovsdb-server通信采用OVSDB协议,和内核通信采用netlink通信支持多个独立datapatch(网桥) 12root 2225 1 0 20:05 ? 00:00:00 ovs-vswitchd: monitoring pid 2226 (healthy)root 2226 2225 0 20:05 ? 00:00:00 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor ovsdb-server 虚拟交换机配置信息(接口、交换内容、VLAN等)存放的数据库服务,ovs-vswitchd会根据ovsdb-server中的配置工作。ovsdb-server 和ovs-vswitchd 使用OVSDB(JSON-RPC)方式通信12root 2201 1 0 20:05 ? 00:00:00 ovsdb-server: monitoring pid 2202 (healthy)root 2202 2201 0 20:05 ? 00:00:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor openvSwitch编译编译环境:ubuntun16.04123456789101112131415161718192021cat build.shapt-get install build-essential libssl-dev linux-headers-$(uname -r) apt-get install graphviz autoconf automake bzip2 debhelper dh-autoreconf libssl-dev libtool openssl procps python-all python-qt4 python-twisted-conch python-zopeinterface python-six dkms module-assistant ipsec-tools racoon libc6-dev module-init-tools netbase python-argparse uuid-runtime -ygit clone https://github.com/openvswitch/ovs.gitcd ovs./boot.sh./configure --with-linux=/lib/modules/`uname -r`/buildmakemake installmake modules_install/sbin/modprobe openvswitchovsdb-tool create /usr/local/etc/openvswitch/conf.db vswitchd/vswitch.ovsschemaovsdb-server --remote=punix:/usr/local/var/run/openvswitch/db.sock \\ --remote=db:Open_vSwitch,Open_vSwitch,manager_options \\ --private-key=db:Open_vSwitch,SSL,private_key \\ --certificate=db:Open_vSwitch,SSL,certificate \\ --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert \\ --pidfile --detachovs-vsctl --no-wait initovs-vswitchd --pidfile --detach make modules_install 错误ubbuntun16.04报错如下,该错误忽略即可 123456789101112131415161718192021222324252627282930313233343536373839root@controller-VirtualBox:~/ovs# make modules_installcd datapath/linux && make modules_installmake[1]: Entering directory '/root/ovs/datapath/linux'make -C /lib/modules/4.8.0-36-generic/build M=/root/ovs/datapath/linux modules_installmake[2]: Entering directory '/usr/src/linux-headers-4.8.0-36-generic' INSTALL /root/ovs/datapath/linux/openvswitch.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory INSTALL /root/ovs/datapath/linux/vport-geneve.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory INSTALL /root/ovs/datapath/linux/vport-gre.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory INSTALL /root/ovs/datapath/linux/vport-lisp.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory INSTALL /root/ovs/datapath/linux/vport-stt.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory INSTALL /root/ovs/datapath/linux/vport-vxlan.koAt main.c:158:- SSL error:02001002:system library:fopen:No such file or directory: bss_file.c:175- SSL error:2006D080:BIO routines:BIO_new_file:no such file: bss_file.c:178sign-file: certs/signing_key.pem: No such file or directory DEPMOD 4.8.0-36-genericmake[2]: Leaving directory '/usr/src/linux-headers-4.8.0-36-generic'depmod `sed -n 's/#define UTS_RELEASE "\\([^"]*\\)"/\\1/p' /lib/modules/4.8.0-36-generic/build/include/generated/utsrelease.h`make[1]: Leaving directory '/root/ovs/datapath/linux' 建立ovs接口连接两个namespace组成二层网络环境搭建拓扑123456789101112131415161718192021 br0 +--------------------------------------+ +--+ +--+ +---+ | tap1 tap2| +---+ | +--+ +--+ | | | | | | +--------------------------------------+ | | | | | | | | |+------------------+ +-------------------+| tap1 | | tap2 ||192.168.1.102/24 | | 192.168.1.102/24 || | | || | | || | | || namespace ns1 | | namespace ns1 || | | |+------------------+ +-------------------+ 实现脚本1234567891011121314151617181920ip netns add ns1ip netns add ns2ovs-vsctl add-br br0ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip link set dev tap1 upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip link set dev tap2 upip netns exec ns1 ip addr add 192.168.1.102/24 dev tap1ip netns exec ns2 ip addr add 192.168.1.101/24 dev tap2ip netns exec ns1 ip link set lo upip netns exec ns2 ip link set lo upip netns exec ns1 ping -c 4 192.168.1.101ip netns exec ns1 ping -c 4 192.168.1.102 建立vlan二层网络环境搭建拓扑12345678910111213141516171819202122 br0 trunk vlan tag 10,11 br1 +------------------------+ +------------------------+ | | tag10 tag10 | | | trunk_br0 +-----------------------+trunk_br1 | | +-----------------------+ | | | tag11 tag11 | | |tap1 tap2 | | tap3 | +------------------------+ +------------------------+ |tag 10 tag11| tag10| | | | | | |192.168.1.101/24 | | 192.168.1.102/24 | 192.168.1.103/24 +-------+ +-------+ +-------+ | tap1 | | tap2 | |tap3 | | | | | | | | | | | | | | | | | | | | | | | | | +-------+ +-------+ +-------+ ns1 ns2 ns3 环境实现脚本1234567891011121314151617181920212223242526272829303132333435363738ip netns add ns1ip netns add ns2ovs-vsctl add-br br0ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalovs-vsctl set Port tap1 tag=10ip link set tap1 netns ns1ip netns exec ns1 ip link set dev tap1 upovs-vsctl add-port br0 tap2 -- set Interface tap2 type=internalovs-vsctl set Port tap2 tag=11ip link set tap2 netns ns2ip netns exec ns2 ip link set dev tap2 upip netns exec ns1 ip addr add 192.168.1.101/24 dev tap1ip netns exec ns2 ip addr add 192.168.1.102/24 dev tap2ip netns exec ns1 ip link set lo upip netns exec ns2 ip link set lo upovs-vsctl add-br br1ovs-vsctl add-port br1 tap3 -- set Interface tap3 type=internalovs-vsctl add-port br0 trunk_br0 trunks=10,11 -- set Interface trunk_br0 type=patch options:peer=trunk_br1ovs-vsctl add-port br1 trunk_br1 trunks=10,11 -- set Interface trunk_br1 type=patch options:peer=trunk_br0ip netns add ns3ip link set tap3 netns ns3ip netns exec ns3 ip addr add 192.168.1.103/24 dev tap3ip netns exec ns3 ip link set dev tap3 upovs-vsctl set Port tap3 tag=10ip netns exec ns3 ping -c 4 192.168.1.101ip netns exec ns3 ping -c 4 192.168.1.102 说明: br0和br1两个交换机之间连接使用的是patch口,在创建时候需要指明peer(对端口)选项 12ovs-vsctl add-port br0 trunk_br0 trunks=10,11 -- set Interface trunk_br0 type=patch options:peer=trunk_br1ovs-vsctl add-port br1 trunk_br1 trunks=10,11 -- set Interface trunk_br1 type=patch options:peer=trunk_br0 br0和br1两个交换机之间连接在trunk口附加上tag10和tag11 结论 ns3:tap3:vlan10 能ping通ns1:tap1:vlan10 因为ns3和ns1属于同一个vlan;同时无法ping通ns2 ovs vlan报文转发原理探究环境搭建拓扑1234567891011121314151617181920212223242526272829303132 first_ns second_ns third_ns +-----------+ +-----------+ +-----------+ | | | | | | | | | | | | | | | | | | | first_br | second_br | third_br| +-----------+ +-----------+ +-----------+10.0.0.4/24 10.0.0.5/24 | 10.0.0.6/24 | | | | | | |tag 10 | 无 tag | trunk 11,12 +------------------------------------------+ | first_br second_br third_br | | | br0 | | | +------------------------------------------+ | tag 10 | | | | |10.0.0.1/24 +------------+ | | | | | | | | +------------+ ns1 搭建网络脚本1234567891011121314151617181920212223242526272829303132 ovs-vsctl add-br br0ovs-vsctl add-port br0 first_br -- set Interface first_br type=internalovs-vsctl set Port first_br tag=10 ip netns add firstip link set first_br netns firstip netns exec first ip addr add 10.0.0.4/24 dev first_brip netns exec first ip link set dev first_br upip netns add ns1ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalovs-vsctl set Port tap1 tag=10ip link set tap1 netns ns1ip netns exec ns1 ip link set lo upip netns exec ns1 ip link set dev tap1 upip netns exec ns1 ip addr add 10.0.0.1/24 dev tap1ovs-vsctl add-port br0 second_br -- set Interface second_br type=internalip netns add secondip link set second_br netns secondip netns exec second ip addr add 10.0.0.5/24 dev second_brip netns exec second ip link set dev second_br upovs-vsctl add-port br0 third_br trunks=11,12 -- set Interface third_br type=internalip netns add thirdip link set third_br netns thirdip netns exec third ip addr add 10.0.0.6/24 dev third_brip netns exec third ip link set dev third_br up 实验过程:进入netns1,一直ping 10.0.0.4,在netns first、second、third分别抓包 实验记录 first抓取报文12345root@controller-VirtualBox:~# ip netns exec first tcpdump -n -e -i first_br arptcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on first_br, link-type EN10MB (Ethernet), capture size 262144 bytes15:47:54.636790 9a:03:f1:61:48:9d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.0.0.4 tell 10.0.0.1, length 2815:47:54.636808 4e:cc:d6:5a:53:f4 > 9a:03:f1:61:48:9d, ethertype ARP (0x0806), length 42: Reply 10.0.0.4 is-at 4e:cc:d6:5a:53:f4, length 28 抓到arp广播包 second抓取报文1234root@controller-VirtualBox:~# ip netns exec second tcpdump -n -e -i second_br arptcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on second_br, link-type EN10MB (Ethernet), capture size 262144 bytes15:49:40.345271 9a:03:f1:61:48:9d > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 10, p 0, ethertype ARP, Request who-has 10.0.0.4 tell 10.0.0.1, length 28 抓到arp广播包 third抓取报文123root@controller-VirtualBox:~# ip netns exec third tcpdump -n -e -i third_br arptcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on third_br, link-type EN10MB (Ethernet), capture size 262144 bytes 没有抓到arp广播包 结论 trunk port(1)这个port不配置tag,配置trunks,如果trunks为空,则所有的VLAN都trunk,也就意味着对于所有的VLAN的包,本身带什么VLAN ID,就是携带者什么VLAN ID,(2)如果没有设置VLAN,就属于VLAN 0,全部允许通过。(3)如果trunks不为空,则仅仅带着这些VLAN ID的包通过。 access port(1)这个port配置tag,从这个port进来的包会被打上这个tag,(2)从其他的trunk port中进来的本身就带有VLAN ID的包,如果VLAN ID等于tag,则会从这个port发出,(3)从其他的access port上来的包,如果tag相同,也会被forward到这个port。(4)从access port发出的包不带VLAN ID。(5)如果一个本身带VLAN ID的包到达access port,即便VLAN ID等于tag,也会被抛弃。 ovs bonding链路冗余实验拓扑123456789101112131415161718192021 192.168.0.101/24+---------+ +-------------------------------+| | | || tap0+------------+tap0 br0 || | ++ || | | +-------------+ || | +---------|bond0|---------------++---------+ +-------------+ ns1 br0_tap0 | | br0_tap1 | | | | br1_tap0 | | br1_tap1 192.168.0.102/24 +--------------++---------+ +---------|bond1|---------------+| +------------+ +--------------+ || | | || | | br1 || | | || | +-------------------------------++---------+ ns2 环境搭建脚本1234567891011121314151617181920212223242526272829ovs-vsctl add-br br0ovs-vsctl add-br br1ip link add br0_tap0 type veth peer name br1_tap0ip link add br0_tap1 type veth peer name br1_tap1ip link set br0_tap0 upip link set br0_tap1 upip link set br1_tap0 upip link set br1_tap1 upovs-vsctl add-bond br0 bond0 br0_tap0 br0_tap1 ovs-vsctl add-bond br1 bond1 br1_tap0 br1_tap1ip netns add ns1ip netns add ns2ovs-vsctl add-port br0 tap1 -- set Interface tap1 type=internalip link set tap1 netns ns1ip netns exec ns1 ip link set dev tap1 upip netns exec ns1 ip addr add 192.168.1.101/24 dev tap1ovs-vsctl add-port br1 tap2 -- set Interface tap2 type=internalip link set tap2 netns ns2ip netns exec ns2 ip link set dev tap2 upip netns exec ns2 ip addr add 192.168.1.102/24 dev tap2ovs-vsctl set Port bond0 lacp=active ovs-vsctl set Port bond1 lacp=active 环境搭建完成后查看bond12345678910111213141516171819202122232425262728293031323334root@controller-VirtualBox:~# ovs-appctl bond/show---- bond1 ----bond_mode: active-backupbond may use recirculation: no, Recirc-ID : -1bond-hash-basis: 0updelay: 0 msdowndelay: 0 mslacp_status: negotiatedlacp_fallback_ab: falseactive slave mac: 9e:d9:94:98:26:85(br1_tap0)slave br1_tap0: enabled active slave may_enable: trueslave br1_tap1: enabled may_enable: true---- bond0 ----bond_mode: active-backupbond may use recirculation: no, Recirc-ID : -1bond-hash-basis: 0updelay: 0 msdowndelay: 0 mslacp_status: negotiatedlacp_fallback_ab: falseactive slave mac: 6a:d8:f8:f3:2b:6d(br0_tap0)slave br0_tap0: enabled active slave may_enable: trueslave br0_tap1: enabled may_enable: true 查看结论: 默认建立的bond为主备模式(bond_mode: active-backup)br1_tap0和br0_tap0 流量走这对口 抓包结果123456789101112131415161718抓取 br0_tap0root@controller-VirtualBox:~# tcpdump -n -e -i br0_tap0tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br0_tap0, link-type EN10MB (Ethernet), capture size 262144 bytes17:12:45.022854 6a:d8:f8:f3:2b:6d > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:12:45.023409 9e:d9:94:98:26:85 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:13:15.024627 6a:d8:f8:f3:2b:6d > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:13:15.025299 9e:d9:94:98:26:85 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:13:45.025411 6a:d8:f8:f3:2b:6d > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:13:45.025724 9e:d9:94:98:26:85 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110在ns1 ping ns2,流量只走在主机接口上(br1_tap0和br0_tap0)root@controller-VirtualBox:~# tcpdump -n -e -i br0_tap0tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br0_tap0, link-type EN10MB (Ethernet), capture size 262144 bytes17:15:29.980585 d2:25:e0:f4:59:83 > 3a:df:e7:d3:68:84, ethertype IPv4 (0x0800), length 98: 192.168.1.101 > 192.168.1.102: ICMP echo request, id 6006, seq 4, length 6417:15:29.980608 3a:df:e7:d3:68:84 > d2:25:e0:f4:59:83, ethertype IPv4 (0x0800), length 98: 192.168.1.102 > 192.168.1.101: ICMP echo reply, id 6006, seq 4, length 6417:15:31.003922 d2:25:e0:f4:59:83 > 3a:df:e7:d3:68:84, ethertype IPv4 (0x0800), length 98: 192.168.1.101 > 192.168.1.102: ICMP echo request, id 6006, seq 5, length 64 抓取br1_tap01234567root@controller-VirtualBox:~# tcpdump -n -e -i br1_tap0tcpdump: verbose output suppressed, use -v or -vv for full protocol decodelistening on br1_tap0, link-type EN10MB (Ethernet), capture size 262144 bytes17:14:15.027630 6a:d8:f8:f3:2b:6d > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:14:15.028468 9e:d9:94:98:26:85 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:14:45.028801 6a:d8:f8:f3:2b:6d > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 11017:14:45.029541 9e:d9:94:98:26:85 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110 另外补充 查看lacpovs-appctl lacp/show active-backup 主-备 无法提升吞吐 balance-slb, 根据包的 source MAC + vlan tag來均衡流量 banlnce-tcp, 根据包的 L2/L3/L4 header来均衡流量 banlance-tcp必须让硬件交换机设置802.3ad,balance-slb则设不设均可,设了流量提高比较大。 ovs-vsctl set Port bond0 bond_mode=balance-slb 观察流量命令 cat /proc/net/dev LACP是链路汇聚控制协议(具体研究待定) openvswitch概念补充几个重要的概念 Bridge: Bridge 代表一个以太网交换机(Switch),一个主机中可以创建一个或者多个 Bridge 设备。 Port: 端口与物理交换机的端口概念类似,每个 Port 都隶属于一个 Bridge。 Interface: 连接到 Port 的网络接口设备。在通常情况下,Port 和 Interface 是一对一的关系, 只有在配置 Port 为 bond 模式后,Port 和 Interface 是一对多的关系。 Controller: OpenFlow 控制器。OVS 可以同时接受一个或者多个 OpenFlow 控制器的管理。 datapath: 在 OVS 中,datapath 负责执行数据交换,也就是把从接收端口收到的数据包在流表中进行匹配,并执行匹配到的动作。 Flow table: 每个 datapath 都和一个“flow table”关联,当 datapath 接收到数据之后, OVS 会在 flow table 中查找可以匹配的 flow,执行对应的操作, 例如转发数据到另外的端口。 运行原理内核模块实现了多个“数据路径(DataPath)”(类似于网桥),每个都可以有多个“vports”(类似于桥内的端口)。每个数据路径也通过关联流表(flow table)来设置操作,而这些流表中的流都是用户空间在报文头和元数据的基础上映射的关键信息,一般的操作都是将数据包转发到另一个vport。当一个数据包到达一个vport,内核模块所做的处理是提取其流的关键信息并在流表中查找这些关键信息。当有一个匹配的流时它执行对应的操作。如果没有匹配,它会将数据包送到用户空间的处理队列中(作为处理的一部分,用户空间可能会设置一个流用于以后遇到相同类型的数据包可以在内核中执行操作)。细节如下图所示: ovs管理组件 ovs-dpctl:一个工具,用来配置交换机内核模块,可以控制转发规则。 ovs-vsctl:主要是获取或者更改ovs-vswitchd的配置信息,此工具操作的时候会更新ovsdb-server中的数据库。 ovs-appctl:主要是向OVS守护进程发送命令的,一般用不上。 a utility that sends commands to running Open vSwitch daemons (ovs-vswitchd) ovsdbmonitor:GUI工具来显示ovsdb-server中数据信息。(Ubuntu下是可以使用apt-get安装,可以远程获取OVS数据库和OpenFlow的流表) ovs-controller:一个简单的OpenFlow控制器 ovs-ofctl:用来控制OVS作为OpenFlow交换机工作时候的流表内容。 ovs-pki:OpenFlow交换机创建和管理公钥框架; ovs-tcpundump:tcpdump的补丁,解析OpenFlow的消息; brocompat.ko : Linux bridge compatibility module openvswitch.ko : Open vSwitch switching datapath ovs流表实验网络拓扑 拓扑实现脚本12345678910111213141516171819202122232425ovs-vsctl add-br ovs-switchovs-vsctl add-port ovs-switch p0 -- set Interface p0 ofport_request=100ovs-vsctl set Interface p0 type=internalip netns add ns0ip link set p0 netns ns0 ip netns exec ns0 ip addr add 192.168.1.100/24 dev p0ip netns exec ns0 ifconfig p0 promisc upovs-vsctl add-port ovs-switch p1 -- set Interface p1 ofport_request=101ovs-vsctl set Interface p1 type=internalip netns add ns1ip link set p1 netns ns1ip netns exec ns1 ip addr add 192.168.1.101/24 dev p1 ip netns exec ns1 ifconfig p1 promisc upovs-vsctl add-port ovs-switch p2 -- set Interface p2 ofport_request=102ovs-vsctl set Interface p2 type=internalip netns add ns2ip link set p2 netns ns2ip netns exec ns2 ip addr add 192.168.1.102/24 dev p2ip netns exec ns2 ifconfig p2 promisc up ovs-ofctl show ovs-switch 显示虚拟交换机流表的统计信息12345678910111213141516171819202122root@controller-VirtualBox:~# ovs-ofctl show ovs-switchOFPT_FEATURES_REPLY (xid=0x2): dpid:0000aaa801224e40n_tables:254, n_buffers:0capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IPactions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 100(p0): addr:fc:7f:00:00:36:b8 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 101(p1): addr:56:a7:a1:27:a9:b8 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max 102(p2): addr:d2:af:d5:d2:c9:52 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max LOCAL(ovs-switch): addr:aa:a8:01:22:4e:40 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps maxOFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 ovs-dpctl show 查看内核接口状态12345678910root@controller-VirtualBox:~# ovs-dpctl showsystem@ovs-system: lookups: hit:334 missed:22 lost:0 flows: 0 masks: hit:371 total:0 hit/pkt:1.04 port 0: ovs-system (internal) port 1: ovs-switch (internal) port 2: p0 (internal) port 3: p1 (internal) port 4: p2 (internal) ovs-ofctl dump-flows ovs-switch查看ovs-switch的流表接口p1发出去的icmp发包源地址变成10.10.10.10ovs-ofctl del-flows ovs-switch “in_port=100”删除in_port=100的流表接口p1发出去的icmp报文源地址均变成10.10.10.101ovs-ofctl add-flow ovs-switch "priority=1 idle_timeout=0,in_port=100,actions=mod_nw_src:10.10.10.10,normal" 所有的icmp定向到p2(in_port=102)1ovs-ofctl add-flow ovs-switch idle_timeout=0,dl_type=0x0800,nw_proto=1,actions=output:102 在该组网下数据包转发过程测试产生数据包源 p0(06:fc:2c:fd:84:f5) -> p1(56:a7:a1:27:a9:b8)123456789101112root@controller-VirtualBox:~# ovs-appctl ofproto/trace ovs-switch in_port=100,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8 -generateFlow: in_port=100,vlan_tci=0x0000,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000bridge("ovs-switch")-------------------- 0. priority 0 NORMAL -> no learned MAC for destination, floodingFinal flow: unchangedMegaflow: recirc_id=0,in_port=100,vlan_tci=0x0000/0x1fff,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000Datapath actions: 1,3,4 数据包转发过程:上述产生的数据包最终被广播到接口1,3,4 port 1: ovs-switch (internal) port 2: p0 (internal) port 3: p1 (internal) port 4: p2 (internal) 配置接口p1为tag101后 产生数据包源 p0(06:fc:2c:fd:84:f5) -> p1(56:a7:a1:27:a9:b8)12345678910111213root@controller-VirtualBox:~# ovs-vsctl set Port p1 tag=101root@controller-VirtualBox:~# ovs-appctl ofproto/trace ovs-switch in_port=100,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8 -generateFlow: in_port=100,vlan_tci=0x0000,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000bridge("ovs-switch")-------------------- 0. priority 0 NORMAL -> no learned MAC for destination, floodingFinal flow: unchangedMegaflow: recirc_id=0,in_port=100,vlan_tci=0x0000/0x1fff,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000 数据包转发过程:上述产生数据包最终被广播到接口1,4 因为接口2已经配置了tag101和产生数据包不在同一个广播域 修改从p0口发出的数据包强制加上vlantag 101, 产生数据包p0(06:fc:2c:fd:84:f5) -> p1(56:a7:a1:27:a9:b8)12345678910111213141516ovs-ofctl add-flow ovs-switch "priority=3,in_port=100,dl_vlan=0xffff,actions=mod_vlan_vid:101,normal"root@controller-VirtualBox:~# ovs-appctl ofproto/trace ovs-switch in_port=100,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8 -generateFlow: in_port=100,vlan_tci=0x0000,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000bridge("ovs-switch")-------------------- 0. in_port=100,vlan_tci=0x0000, priority 3 mod_vlan_vid:101 NORMAL -> learned that 06:fc:2c:fd:84:f5 is on port p0 in VLAN 101 -> forwarding to learned portFinal flow: in_port=100,dl_vlan=101,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000Megaflow: recirc_id=0,in_port=100,vlan_tci=0x0000,dl_src=06:fc:2c:fd:84:f5,dl_dst=56:a7:a1:27:a9:b8,dl_type=0x0000Datapath actions: 3 数据包转发过程 产生的流从p0->p1 ,数据包被完整转发到port 3 ,也就是说明该数据包被完整添加上了tag101 产生一条p1->p0流量,该流量携带tag101,发送给接口p0123456789101112root@controller-VirtualBox:~# ovs-appctl ofproto/trace ovs-switch in_port=101,dl_dst=06:fc:2c:fd:84:f5,dl_src=56:a7:a1:27:a9:b8 -generateFlow: in_port=101,vlan_tci=0x0000,dl_src=56:a7:a1:27:a9:b8,dl_dst=06:fc:2c:fd:84:f5,dl_type=0x0000bridge("ovs-switch")-------------------- 0. priority 0 NORMAL -> forwarding to learned portFinal flow: unchangedMegaflow: recirc_id=0,in_port=101,vlan_tci=0x0000/0x1fff,dl_src=56:a7:a1:27:a9:b8,dl_dst=06:fc:2c:fd:84:f5,dl_type=0x0000Datapath actions: push_vlan(vid=101,pcp=0),2 ovs配合使用Controller之一 Floodlight操作系统环境:ubuntun 16.04ovs版本信息:(Open vSwitch) 2.5.210.30.10.145/24 依赖包安装12apt-get install build-essential ant maven python-devapt-get install build-essential openjdk-8-jdk ant maven python-dev 下载floodlight12345678git clone git://github.com/floodlight/floodlight.gitcd floodlight/git submodule initgit submodule updateantjava -jar target/floodlight.jar# nohup java -jar target/floodlight.jar > floodlight.log 2>&1 & 在被管理的ovs上虚拟机上执行加入命令ovs-vsctl set-controller ovs-switch tcp:10.30.10.145:6653ovs-vsctl set Bridge ovs-switch fail-mode=secure 其它说明:(1)10.30.10.145为floodlight controller的安装服务地址(2)一堆教程说端口是6633,新版本发生了变化更改为了6653,具体变化版本号未研究(3)采用的是secure,不停的连接Controller 当 OVS 交换机连接到 Floodlight 控制器后,理论上所有的流表规则应该交给控制器来建立。由于 OVS 交换机和控制器之间是通过网络通讯来传递数据的,所以网络连接失败会影响到 Flow 的建立,为了处理该问题提供两种模式: standlone: 默认模式。如果 OVS 交换机超过三次无法正常连接到 OpenFlow 控制器,OVS 交换机自己会负责建立流表。在这种模式下,OVS 和常见的 L2 交换机相似。与此同时,OVS 也会继续尝试连接控制器,一旦网络连接恢复,OVS 会再次切换到使用控制器进行流表管理。 secure: 在 secure 模式下,如果 OVS 无法正常连接到 OpenFlow 控制器,OVS 会不停的尝试与控制器重新建立连接,而不会自己负责建立流表。 访问测试http://10.30.10.145:8080/ui/pages/index.html 备注说明博文的篇幅太长有碍观瞻,深入研究内容请见后续博文openvSwitch XX。","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"ubuntun16.04","slug":"ubuntun16-04","permalink":"http://vcpu.me/tags/ubuntun16-04/"},{"name":"ovs","slug":"ovs","permalink":"http://vcpu.me/tags/ovs/"},{"name":"虚拟交换机","slug":"虚拟交换机","permalink":"http://vcpu.me/tags/虚拟交换机/"}]},{"title":"iptables","slug":"iptables","date":"2017-07-27T10:00:00.000Z","updated":"2017-07-27T03:24:52.000Z","comments":true,"path":"iptables/","link":"","permalink":"http://vcpu.me/iptables/","excerpt":"filter表根据已配置好的规则操作本机的包(INPUT)、转发的包(FORWARD)、本机发出的包(OUTPUT);最终决定是否放行(ACCEPT)、丢弃(DROP)、拒绝(REJECT)、产生告警日志(LOG)。 nat表根据配置好的规则,在包刚刚到达FW时修改目的地址(PREROUTING)、本机产生包修改目的地址(OUTPUT)、离开转发系统前修改源地址(POSTROUTING);最终实现网络SNAT、REDIRECT、DNAT mangle表主要用于修改数据包的TOS(Type Of Service,服务类型)、TTL(Time To Live,生存周期)指以及为数据包设置Mark标记,以实现Qos(Quality Of Service,服务质量)调整以及策略路由等应用,由于需要相应的路由设备支持,因此应用并不广泛。包含五个规则链——PREROUTING,POSTROUTING,INPUT,OUTPUT,FORWARD。","text":"filter表根据已配置好的规则操作本机的包(INPUT)、转发的包(FORWARD)、本机发出的包(OUTPUT);最终决定是否放行(ACCEPT)、丢弃(DROP)、拒绝(REJECT)、产生告警日志(LOG)。 nat表根据配置好的规则,在包刚刚到达FW时修改目的地址(PREROUTING)、本机产生包修改目的地址(OUTPUT)、离开转发系统前修改源地址(POSTROUTING);最终实现网络SNAT、REDIRECT、DNAT mangle表主要用于修改数据包的TOS(Type Of Service,服务类型)、TTL(Time To Live,生存周期)指以及为数据包设置Mark标记,以实现Qos(Quality Of Service,服务质量)调整以及策略路由等应用,由于需要相应的路由设备支持,因此应用并不广泛。包含五个规则链——PREROUTING,POSTROUTING,INPUT,OUTPUT,FORWARD。 iptables工作的netfilter框架挂载点转发流程 1.数据包从网络A经过网卡中断送入内核协议栈,首先进入prerouting挂载点,路由前的操作可加入到此挂载点完成,如图的对目的地址的修改,修改后的地址会跟进修改后的地址会继续进入转发路由查找但是查找路由的目的地址变成了目的NAT后的地 2.经过转发路由查找,如果是发给本机的报文,交给本机的应用程序处理;在具体的应用程序处理之前会经过input挂载点,如图你可以进行上本机的报文限制,直接在input挂载点丢弃报文,使其无法到达具体的应用程序。 2-1.应用程序在处理完成后,根据本机路由表酱数据发出,在本机报文发出前,会经过output挂载点,此处你的策略只针对于本机发出报文有效。如果你想修改本机发出报文的源地址,你需要在此实现。 3.经过转发路由查找,如果是转发的报文,在具体的转发报文之前,你可以通过在forward挂载点操作进行过滤等动作。 4.无论是转发报文还是本机发出的报文,它们最终会经过postrouting挂载点,送到最终物理发包流程;也就是说postrouting是数据包离开本netfilter协议栈的最后一个流程,你可以在这个流程中修改源IP地址。 iptables基本命令 [-t 表名]:该规则所操作的哪个表,可以使用filter、nat等,如果没有指定则默认为filter -A:新增一条规则,到该规则链列表的最后一行 -I:插入一条规则,原本该位置上的规则会往后顺序移动,没有指定编号则为1 -D:从规则链中删除一条规则,要么输入完整的规则,或者指定规则编号加以删除 -R:替换某条规则,规则替换不会改变顺序,而且必须指定编号。 -P:设置某条规则链的默认动作 -nL:-L、-n,查看当前运行的防火墙规则列表 chain名:指定规则表的哪个链,如INPUT、OUPUT、FORWARD、PREROUTING等 [规则编号]:插入、删除、替换规则时用,–line-numbers显示号码 [-i|o 网卡名称]:i是指定数据包从哪块网卡进入,o是指定数据包从哪块网卡输出 [-p 协议类型]:可以指定规则应用的协议,包含tcp、udp和icmp等 [-s 源IP地址]:源主机的IP地址或子网地址 [–sport 源端口号]:数据包的IP的源端口号 [-d目标IP地址]:目标主机的IP地址或子网地址 [–dport目标端口号]:数据包的IP的目标端口号 -m:extend matches,这个选项用于提供更多的匹配参数,如: -m state –state ESTABLISHED,RELATED -m tcp –dport 22 -m multiport –dports 80,8080 -m icmp –icmp-type 8 :处理数据包的动作,包括ACCEPT、DROP、REJECT等iptables命令 iptables -nL 查看当前iptables filter规则 iptables -nL 等价于iptables -nL -t filter iptables -nL -t nat查看nat规则iptables -nL –line-number 显示规则链编号iptables -D FORWARD 2删除FORWARD链第二条规则iptables -A INPUT -j REJECT –reject-with icmp-host-prohibited拒绝所有报文,并回应主机已被封锁–reject-with icmp-net-prohibited 拒绝数据包回应一条icmp改造主机被封锁–reject-with tcp-reset 拒绝数据包 tcp回应rst–reject-with icmp-net-prohibited 拒绝数据包回应一条icmp告知网络禁止 实现在外界和本机访问主机的80端口就相当于访问8080端口12iptables -t nat -A PREROUTING -p -m tcp --dport 80 -j REDIRECT --to-ports 8080iptables -t nat -A OUTPUT -p tcp -m --dport 80 -j REDIRECT --to-ports 8080 设置input默认丢包,forward和output默认放通123iptables -P INPUT DROPiptables -P FORWARD ACCEPTiptables -P OUTPUT ACCEPT iptables -m state –state NEW,ESTABLISHED1iptables -I INPUT 3 -p icmp -m state --state NEW -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包,没有安全策略情况下,主机A和主机B可以相互ping通测试结果:执行上述命令后效果为:主机A ping主机B ping不通;主机B ping主机A能通一个包 1iptables -I INPUT 3 -p icmp -m state --state ESTABLISHED -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包测试结果:主机A ping主机B能ping通,主机B ping主机A并不能ping通 1iptables -I INPUT 3 -p icmp -m state --state RELATED -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包测结果:主机A和主机B均不通 1iptables -I INPUT 3 -s 10.30.10.141 -p tcp -m state --state NEW -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包,没有安全策略情况下,主机A和主机B可以相互ssh连接测试结果:主机A无法 telnet 主机B(10.30.10.141) 22端口, 主机B telnet 主机A 可完成三次握手 1iptables -I INPUT 3 -s 10.30.10.141 -p tcp -m state --state ESTABLISHED -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包,没有安全策略情况下,主机A和主机B可以相互ssh连接测试结果:主机A可正常ssh主机B,主机B无法ssh主机A 1iptables -I INPUT 3 -s 10.30.10.141 -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT 背景:策略放在主机A, 默认策略是INPUT丢包,没有安全策略情况下,主机A和主机B可以相互ssh连接测试结果:主机A和主机B可以相互ssh连接 其它说明指定要匹配包的的状态,当前有4种状态可用:INVALID,ESTABLISHED,NEW和RELATED。 INVALID意味着这个包没有已知的流或连接与之关联,也可能是它包含的数据或包头有问题。 ESTABLISHED意思是包是完全有效的,而且属于一个已建立的连接,这个连接的两端都已经有数据发送 RELATED说明包正在建立一个新的连接,这个连接是和一个已建立的连接相关的。比如,FTP data transfer,ICMP error 和一个TCP或UDP连接相关。 NEW状态可以理解成允许首包建立会话,这里首包不适一个包的意思(icmp是第一个包的意思、TCP是三次握手的意思)。 10.30.0.0/16 网断允许访问本机的80和443端口1iptables -A INPUT -s 10.30.0.0/16 -p tcp -m tcp -m multiport --dports 80,443 -j ACCEPT 10.30.10.0/24网段所有报文均不允许访问本机tcp的80端口,直接丢弃数据包1iptables -A INPUT -s 10.30.10.0/24 -p tcp -m tcp --dport 80 -j DROP 完全信任主机10.30.10.301iptables -A INPUT -s 10.30.10.30 -j ACCEPT 放行lo口所有规则并且插入到规则2位置1iptables -I INPUT 2 -i lo -j ACCEPT 当有人连接22端口时候产生日志,并对日志写入进行限制1iptables -R INPUT 1 -p tcp --dport 22 -m limit --limit 3/minute --limit-burst 8 -j LOG iptables -A PREROUTING -i eth0 -d 202.110.123.100 -j DNAT –to 10.30.10.30目的NAT,访问202.110.123.100,最终相当于访问10.30.10.30 iptables -A POSTROUTING -o eth0 -s 1.1.1.1 -j SNAT –to 202.110.123.200源NAT,1.1.1.1内网机器上外网,其报文源地址全部变成 202.110.123.200 iptables -A POSTROUTING -o eth0 -s 192.168.1.200 -j SNAT –to NASQUERADE源NAT,1.1.1.1内网机器上外网,其报文根据接口情况进行变更 ubuntun iptables规则开机启动步骤1:将规则保存到1iptables-save > /etc/iptables.up.rules 步骤2:系统启动时候加载这些配置,vim /etc/network/interfaces;在最后一行加入12345root@controller:~# cat /etc/network/interfaces# interfaces(5) file used by ifup(8) and ifdown(8)auto loiface lo inet loopbackpre-up iptables-restore < /etc/iptables.up.rules","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"iptables","slug":"iptables","permalink":"http://vcpu.me/tags/iptables/"},{"name":"netfilter框架原理","slug":"netfilter框架原理","permalink":"http://vcpu.me/tags/netfilter框架原理/"}]},{"title":" vxlan","slug":"vxlan1","date":"2017-07-19T10:00:00.000Z","updated":"2017-07-19T05:43:20.000Z","comments":true,"path":"vxlan1/","link":"","permalink":"http://vcpu.me/vxlan1/","excerpt":"云计算网络要求虚拟机迁移(1)二层网络且链路冗余可靠虚拟机迁移从一个物理机迁移到另一个物理机,要求虚拟机不间断业务;因此虚拟机迁移时候必须要保证IP和MAC等网络参数维持不变,放眼此要求也迁移的两台虚拟机处于二层网络时才能满足;而且要求网络本身能多链路冗余和可靠。(2)二层网络下STP(网络生成树)协议复杂,且搞不定大网络规模网络(3)各个厂商虚拟化网络技术,虽然可以简化拓扑,但网络规模和灵活性上欠缺,适合小规模网络构建,例如数据中心内部网络(4)大规模网络扩展TRILL/SPB/FabricPath/VPLS等技术,课解决上述问题,但是对网络硬件有要求,部署成本高 虚拟机规模受网络规格限制大二层网络环境下,所有交换机均需要记录下所有地址和接口映射;一般场景下核心交换机可满足此需求,但是针对于接入交换机却搞不定这件事情;当然你也可以提升所有交换机档次,那么网络建设成本也相应提升。 网络广播隔离(1)主流的vlan技术,会限制vlan数目为4094个(2)vlan技术的静态配置型技术,这样使得数据中心网络几乎为所有vlan被允许通过,导致任何一个vlan 的未知广播数据会在整网泛滥,无节制消耗网络交换能力和带宽。(3)对于小规模的云计算虚拟化环境,现有的网络技术如虚拟机接入感知(VEPA/802.1Qbg)、数据中心二层网络扩展(IRF/vPC/TRILL/FabricPath)、数据中心间二层技术(OTV/EVI/TRILL)等可以很好的满足业务需求,上述限制不成为瓶颈。然而,完全依赖于物理网络设备本身的技术改良,目前看来并不能完全解决大规模云计算环境下的问题。","text":"云计算网络要求虚拟机迁移(1)二层网络且链路冗余可靠虚拟机迁移从一个物理机迁移到另一个物理机,要求虚拟机不间断业务;因此虚拟机迁移时候必须要保证IP和MAC等网络参数维持不变,放眼此要求也迁移的两台虚拟机处于二层网络时才能满足;而且要求网络本身能多链路冗余和可靠。(2)二层网络下STP(网络生成树)协议复杂,且搞不定大网络规模网络(3)各个厂商虚拟化网络技术,虽然可以简化拓扑,但网络规模和灵活性上欠缺,适合小规模网络构建,例如数据中心内部网络(4)大规模网络扩展TRILL/SPB/FabricPath/VPLS等技术,课解决上述问题,但是对网络硬件有要求,部署成本高 虚拟机规模受网络规格限制大二层网络环境下,所有交换机均需要记录下所有地址和接口映射;一般场景下核心交换机可满足此需求,但是针对于接入交换机却搞不定这件事情;当然你也可以提升所有交换机档次,那么网络建设成本也相应提升。 网络广播隔离(1)主流的vlan技术,会限制vlan数目为4094个(2)vlan技术的静态配置型技术,这样使得数据中心网络几乎为所有vlan被允许通过,导致任何一个vlan 的未知广播数据会在整网泛滥,无节制消耗网络交换能力和带宽。(3)对于小规模的云计算虚拟化环境,现有的网络技术如虚拟机接入感知(VEPA/802.1Qbg)、数据中心二层网络扩展(IRF/vPC/TRILL/FabricPath)、数据中心间二层技术(OTV/EVI/TRILL)等可以很好的满足业务需求,上述限制不成为瓶颈。然而,完全依赖于物理网络设备本身的技术改良,目前看来并不能完全解决大规模云计算环境下的问题。 如何满足云计算网络要求? so you can use vxlan云计算,虚拟化服务器迁移不改变IP地址,也不用修改主机路由等;这种使用场景二层网络可满足,传统网络通常采用VLAN进行通信隔离和广播隔离实现2层网络;Vxlan是实现如何在三层网络中进行二层传输的overlay技术。Vxlan技术是SDN解决方案中最流行的技术。 vxlan & openflow different?overlay只是一组组网方案,openflow是一个控制协议;实际使用网络中是通过openflow协议控制vSwitch构建overlay网络。 why vxlan(overlay)?云计算需要弹性的二层网络,这种弹性overlay网络可满足overlay网络的本质是在三层网络中实现二层网络的扩展。其也就具备了三层网络的优点。三层网络优点如下:(1)三层网络路由方式转发,突破网络结构限制(2)具有良性大规模扩展能力(3)对网络设备本身无要求(4)故障自愈能力强(5)负责均衡能力强 overlay组网,设备互联采用三层,ARP表不需要泛红到全网,tor交换机维护mac地址表页也会很小vxlan可以解决度租户问题overlay组网避免二层带来的广播、组播、单播问题vlan报文头 & vxlan报文头vlan报文结构 vxlan报文结构 vxlan封装端口默认是4798支持多达16M((2^24-1)/1024^2)租户vxlan数据平面 & 控制平面(1)数据平面-隧道机制已经知道,VTEP为虚拟机的数据包加上了层包头,这些新的报头之有在数据到达目的VTEP后才会被去掉。中间路径的网络设备只会根据外层包头内的目的地址进行数据转发,对于转发路径上的网络来说,一个Vxlan数据包跟一个普通IP包相比,除了个头大一点外没有区别。 由于VXLAN的数据包在整个转发过程中保持了内部数据的完整,因此VXLAN的数据平面是一个基于隧道的数据平面。 (2) 控制平面—-改进的二层协议 VXLAN不会在虚拟机之间维持一个长连接,所以VXLAN需要一个控制平面来记录对端地址可达情况。控制平面的表为(VNI,内层MAC,外层vtep_ip)。Vxlan学习地址的时候仍然保存着二层协议的特征,节点之间不会周期性的交换各自的路由表,对于不认识的MAC地址,VXLAN依靠组播来获取路径信息(如果有SDN Controller,可以向SDN单播获取)。 另一方面,VXLAN还有自学习的功能,当VTEP收到一个UDP数据报后,会检查自己是否收到过这个虚拟机的数据,如果没有,VTEP就会记录源vni/源外层ip/源内层mac对应关系,避免组播学习 vxlan实验同网段ping主机虚拟主机1123456789101112root@controller-VirtualBox:~# cat vxlan1.sh ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 enp0s8ifconfig enp0s8 0 ifconfig br0 192.168.55.151/24 uproute add default gw 192.168.55.254 br0ovs-vsctl add-port br1 vx1 -- set interface vx1 type=vxlan options:remote_ip=192.168.55.188ifconfig br1 10.0.0.1/24 up 虚拟主机2 1234567891011root@controller-VirtualBox:~# cat vxlan1.sh ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 enp0s8ifconfig enp0s8 0 ifconfig br0 192.168.55.188/24 uproute add default gw 192.168.55.254 br0ovs-vsctl add-port br1 vx1 -- set interface vx1 type=vxlan options:remote_ip=192.168.55.151ifconfig br1 10.0.0.2/24 up 此实验下vxlan ARP数据包处理过程步骤1:虚拟机1中ping 虚拟机2的10.0.0.2,arp表中无10.0.0.2的MAC地址,虚拟机1广播ARP请求步骤2:br1即VTEP1会将报文封装起来。封装时候目的IP是确定的,源ip会经过路由抉择进行选择(VNI、源地址为VTEP1地址192.168.55.151、目的地址为VTEP2地址192.168.55.188)步骤3:被封装流量经过物理网络到达VTEP2步骤4:VTEP2接受到封装后报文,获取记录(VNI、内层源MAC、内层源IP),解除封装并在VNI中广播步骤5:虚拟机2收到广播流量后,br1回复ARP请求,br1即VETP2会将ARP回复报文进行封装,因为回复数据要发给192.168.55.151,经过路由抉择发送源ip为192.168.55.188步骤6:虚拟机1VTEP1收到报文后,学习收到报文的内层源MAC、内层源IP,解封将流量发给虚拟机1的br1步骤7:br1获取MAC地址,ARP交互结束 vxlan实验不同网段ping主机不同网断vxlan只需要将网关设置为开启vxlan的接口br1虚拟主机112345678910111213root@controller-VirtualBox:~# cat vxlan1.sh ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 enp0s8ifconfig enp0s8 0 ifconfig br0 192.168.55.151/24 uproute add default gw 192.168.55.254 br0ovs-vsctl add-port br1 vx1 -- set interface vx1 type=vxlan options:remote_ip=192.168.55.188ifconfig br1 10.0.0.1/24 uproute add -net 10.0.1.0 netmask 255.255.255.0 gw 10.0.0.1 dev br1 虚拟主机2123456789101112root@controller-VirtualBox:~# cat vxlan1.sh ovs-vsctl add-br br0ovs-vsctl add-br br1ovs-vsctl add-port br0 enp0s8ifconfig enp0s8 0 ifconfig br0 192.168.55.188/24 uproute add default gw 192.168.55.254 br0ovs-vsctl add-port br1 vx1 -- set interface vx1 type=vxlan options:remote_ip=192.168.55.151ifconfig br1 10.0.1.1/24 uproute add -net 10.0.0.0 netmask 255.255.255.0 gw 10.0.1.1 dev br1 来自牛人的总结总要有结尾,此总结来自 http://www.jianshu.com/p/d8351b6bf41b 1、Overlay网络的优点简单说就两句话:一是在三层网络中利用封装技术提高二层网络扩展性的同时规避了传统二层网络的种种弊端;二是通过引入VNID的概念,满足了云计算多租户vlan不够,以及租户间网络隔离的问题。 2、Overlay技术的诞生实际就是为了解决云计算大环境下传统网络的种种问题,从它的技术构想到落地方案都是按照云计算的思路来的,所以对于一个规模较小且相对稳定的网络环境中是没有必要用overlay技术的,用了反而把网络搞复杂了。 3、Overlay的所有技术路线中,VXLAN的运用最广泛,得到了最多的主流网络厂商的支持,未来很长一段时间里很可能成为overlay技术的代名词; 4、对于overlay的软硬件模式之争,是仁者见仁、智者见智的,我觉得目前来看软件模式更加容易落地,如果虚拟化平台是开源的需要自己去研发vSwitch是有一些落地难度的,但是vmware NSX相对还是比较成熟的,但是也需要谨慎,最好在一些测试环境或者非重要的环境中先试用。而硬件的overlay方式存在一个所有硬件交换机都需要更换的问题,成本比较高,设备的更换的风险也高,而且硬件overlay方案的捆绑太厉害,需要谨慎考虑。不过我在测试了cisco的ACI后感觉还是很有特点的,后面会单独把ACI拿出来介绍一下,如果是一个完全全新的数据中心倒是可以考虑尝试硬件的overlay模式。 5、在网络的世界里并不是说一个技术叫好就会叫座,比如当年的ATM和Ethernet之争,会有很多客观因素来决定的,比如用户使用习惯,技术落地难度等等,所以对于overlay的技术走向或者SDN的技术走向,还是有很多不确定因素的,各个流派间的博弈也是愈演愈烈,最终鹿死谁手真的很难说,因此这也是对我们用户来说比较头痛的事情,一个正在发展且没有什么标准化的技术,落地的风险比较大,只能抱着边尝试边观望的态度,不轻易落地也不能轻易掉队。","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"overlay","slug":"overlay","permalink":"http://vcpu.me/tags/overlay/"},{"name":"云计算组网","slug":"云计算组网","permalink":"http://vcpu.me/tags/云计算组网/"},{"name":"vxlan原理","slug":"vxlan原理","permalink":"http://vcpu.me/tags/vxlan原理/"},{"name":"vxlan实验","slug":"vxlan实验","permalink":"http://vcpu.me/tags/vxlan实验/"}]},{"title":" 虚拟网络","slug":"network1","date":"2017-07-14T10:00:00.000Z","updated":"2017-07-14T07:31:24.000Z","comments":true,"path":"network1/","link":"","permalink":"http://vcpu.me/network1/","excerpt":"network namespace是什么?当前系统的网络空间,拥有单独的网络协议栈,网络设备、网卡、路由表、转发表、ARP表、ip地址表、iptables、socket等与网络有关的组建均独立,就像进入了另一个网络环境且该网络空间可以实现交换机、路由器、防火墙等功能 使用netns在linux系统上搭建网络网络描述: 该网络由四部分组成:外网、虚拟网络空间net0、虚拟网络空间net1、虚拟网络空间net2 net0和net2分别为网段10.0.1.0/24 10.0.2.0/24的网络空间,并且接入网桥 net1为网络空间,分别以接口eth0和eth1接入虚拟网桥,分别和net0和net2同一个网段 net1为网络空间,以eth2接入网桥,和物理接口enp0s8同一个网段 enp0s8为虚拟机的桥接物理网卡(虚拟机接口混杂模式开启) 192.168.55.165物理机器,接入虚拟机虚拟bridge另外一端均属于ubuntun虚拟机上搭建虚拟网络 淡黄色部分属于虚拟网桥 net0和net2的默认网关是net1 net1的默认网关是办公网络交换机192.168.55.254 经过net1出去流量做snat 搭建网络最终效果: net0和net1和net2 网络空间中均可以上外网","text":"network namespace是什么?当前系统的网络空间,拥有单独的网络协议栈,网络设备、网卡、路由表、转发表、ARP表、ip地址表、iptables、socket等与网络有关的组建均独立,就像进入了另一个网络环境且该网络空间可以实现交换机、路由器、防火墙等功能 使用netns在linux系统上搭建网络网络描述: 该网络由四部分组成:外网、虚拟网络空间net0、虚拟网络空间net1、虚拟网络空间net2 net0和net2分别为网段10.0.1.0/24 10.0.2.0/24的网络空间,并且接入网桥 net1为网络空间,分别以接口eth0和eth1接入虚拟网桥,分别和net0和net2同一个网段 net1为网络空间,以eth2接入网桥,和物理接口enp0s8同一个网段 enp0s8为虚拟机的桥接物理网卡(虚拟机接口混杂模式开启) 192.168.55.165物理机器,接入虚拟机虚拟bridge另外一端均属于ubuntun虚拟机上搭建虚拟网络 淡黄色部分属于虚拟网桥 net0和net2的默认网关是net1 net1的默认网关是办公网络交换机192.168.55.254 经过net1出去流量做snat 搭建网络最终效果: net0和net1和net2 网络空间中均可以上外网 搭建命令123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081root@controller-VirtualBox:/home/controller# cat br2.sh#new bridge & startip addr flush dev enp0s8brctl addbr br0brctl addif br0 enp0s8ifconfig br0 192.168.55.244/24 upip route add default via 192.168.55.254# ip link add br0 type bridge# ip link set dev br0 up#add net0:eth0 <-> br0:tap0 ip link add net0_eth0 type veth peer name tap0 ip netns add net0 ip link set dev net0_eth0 netns net0 ip netns exec net0 ip link set dev net0_eth0 name eth0 ip netns exec net0 ip addr add 10.0.1.1/24 dev eth0 ip netns exec net0 ip link set dev eth0 up ip link set dev tap0 master br0 ip link set dev tap0 up#add net1:eth0 <-> br0:tap1 ip link add net1_eth0 type veth peer name tap1 ip netns add net1 ip link set dev net1_eth0 netns net1 ip netns exec net1 ip link set dev net1_eth0 name eth0 ip netns exec net1 ip addr add 10.0.1.2/24 dev eth0 ip netns exec net1 ip link set dev eth0 up ip link set dev tap1 master br0 ip link set dev tap1 up#add net2:eth0 <-> br0:tap2 ip link add net2_eth0 type veth peer name tap2 ip netns add net2 ip link set dev net2_eth0 netns net2 ip netns exec net2 ip link set dev net2_eth0 name eth0 ip netns exec net2 ip addr add 10.0.2.1/24 dev eth0 ip netns exec net2 ip link set dev eth0 up ip link set dev tap2 master br0 ip link set dev tap2 up#connect net2:eth1 <-> br0:tap3 ip link add net2_eth1 type veth peer name tap3 ip link set dev net2_eth1 netns net1 ip netns exec net1 ip link set dev net2_eth1 name eth1 ip netns exec net1 ip addr add 10.0.2.2/24 dev eth1 ip netns exec net1 ip link set dev eth1 up ip link set dev tap3 master br0 ip link set dev tap3 up#add route#ip netns exec net2 ip route add 10.0.1.0/24 via 10.0.2.2 dev eth0#ip netns exec net0 ip route add 10.0.2.0/24 via 10.0.1.2 dev eth0ip netns exec net2 route add default gw 10.0.2.2ip netns exec net0 route add default gw 10.0.1.2#open gateway forwardip netns exec net1 sysctl net.ipv4.ip_forward=1#connect net1:eth2 <-> br0:tap4 ip link add net1_eth2 type veth peer name tap4 ip link set dev net1_eth2 netns net1 ip netns exec net1 ip link set dev net1_eth2 name eth2 ip netns exec net1 ip addr add 192.168.55.233/24 dev eth2 ip netns exec net1 ip link set dev eth2 up ip link set dev tap4 master br0 ip link set dev tap4 up#add net1 gateway 192.168.55.254ip netns exec net1 route add default gw 192.168.55.254#add snatip netns exec net1 iptables -t nat -A POSTROUTING -s 10.0.1.0/24 -o eth2 -j MASQUERADEip netns exec net1 iptables -t nat -A POSTROUTING -s 10.0.2.0/24 -o eth2 -j MASQUERADEip netns exec net0 ping -c 3 8.8.8.8ip netns exec net2 ping -c 3 8.8.8.8 Linux虚拟网络设备tun/tap网络设备概念?网络设备工作在驱动和协议栈之间,负责衔接它们之间的交互。它帮助驱动和协议栈只关注本身事情。 虚拟网络设备和物理物理设备区别 ?物理网卡其实就是物理设备,比如物理网卡eth0,它分别连接内核协议栈河外面的物理网络,从物理网络收到的数据包会通过接口转发给内核协议栈,从内核协议栈发出包也会通过物理设备转发最终通过物理网络发出去 虚拟设备和物理设备对于内核网络设备管理模块来讲地位一致且无区别。只不过物理设备往往把数据包送到外网,虚拟设备要看具体实现了。 tun/tap ?tun/tap连接的应用程序,可以理解其为运行的另一台服务器,这台服务器可用于加密、隧道等数据加工;处理完成后从新借用一个地址将处理完后数据包封装,发出。 1234567891011121314151617181920212223242526272829+----------------------------------------------------------------+| || +--------------------+ +--------------------+ || | User Application A | | User Application B |<-----+ || +--------------------+ +--------------------+ | || | 1 | 5 | ||...............|......................|...................|.....|| ↓ ↓ | || +----------+ +----------+ | || | socket A | | socket B | | || +----------+ +----------+ | || | 2 | 6 | ||.................|.................|......................|.....|| ↓ ↓ | || +------------------------+ 4 | || | Newwork Protocol Stack | | || +------------------------+ | || | 7 | 3 | ||................|...................|.....................|.....|| ↓ ↓ | || +----------------+ +----------------+ | || | eth0 | | tun0 | | || +----------------+ +----------------+ | ||192.168.55.188 | | 10.0.1.1 | || | 8 +---------------------+ || | |+----------------|-----------------------------------------------+ ↓ Physical Network 上述图表述的应用场景是VPN场景:发到10.0.1.0/24 网络数据通过应用程序B这个隧道,利用192.168.55.188发出到远端。 tun/tap场景下数据包流程1.应用程序A是一个普通的程序,通过socket A发送了一个数据包,假设这个数据包的目的IP地址是10.0.1.22.socket将这个数据包丢给协议栈3.协议栈根据数据包的目的IP地址,匹配本地路由规则,知道这个数据包应该由tun0出去,于是将数据包交给tun04.tun0收到数据包之后,发现另一端被进程B打开了,于是将数据包丢给了进程B5.进程B收到数据包之后,做一些跟业务相关的处理,然后构造一个新的数据包,将原来的数据包嵌入在新的数据包中,最后通过socket B将数据包转发出去,这时候新数据包的源地址变成了eth0的地址,而目的IP地址变成了一个其它的地址,比如是192.168.55.2546.socket B将数据包丢给协议栈7.协议栈根据本地路由,发现这个数据包应该要通过eth0发送出去,于是将数据包交给eth08.eth0通过物理网络将数据包发送出去 192.168.55.254收到数据包后,打开数据包,取出原始数据,转发给10.0.1.2,收到10.0.1.2应答后,从新构造应答包并讲原始报文封装里面;走路由返回给程序B,应用程序B解封装,最终将数据包回复给应用程序A 至此一个完整的隧道交互完成了,tun/tap承担了奖协议栈数据包从新交付到应用程序作用,使得数据包有机会在用户态进行封装。 tun另一端是一个用户程序tun0是一个Tun/Tap虚拟设备,从上图中可以看出它和物理设备eth0的差别,它们的一端虽然都连着协议栈,但另一端不一样,eth0的另一端是物理网络,这个物理网络可能就是一个交换机,而tun0的另一端是一个用户层的程序,协议栈发给tun0的数据包能被这个应用程序读取到,并且应用程序能直接向tun0写数据。 tun和tap区别用户层程序通过tun设备只能读写IP数据包,而通过tap设备能读写链路层数据包,类似于普通socket和raw socket的差别一样,处理数据包的格式不一样。 实例解析tun/tap可用于linux用户态从内核查路由重新收到数据包,处理完成后再发出12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364#include <net/if.h>#include <sys/ioctl.h>#include <sys/stat.h>#include <fcntl.h>#include <string.h>#include <sys/types.h>#include <linux/if_tun.h>#include<stdlib.h>#include<stdio.h>#include <unistd.h>int tun_alloc(int flags){ struct ifreq ifr; int fd, err; char *clonedev = \"/dev/net/tun\"; if ((fd = open(clonedev, O_RDWR)) < 0) { return fd; } memset(&ifr, 0, sizeof(ifr)); ifr.ifr_flags = flags; if ((err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0) { close(fd); return err; } printf(\"Open tun/tap device: %s for reading...\\n\", ifr.ifr_name); return fd;}int main(){ int tun_fd, nread; char buffer[1500]; /* Flags: IFF_TUN - TUN device (no Ethernet headers) * IFF_TAP - TAP device * IFF_NO_PI - Do not provide packet information */ tun_fd = tun_alloc(IFF_TUN | IFF_NO_PI); if (tun_fd < 0) { perror(\"Allocating interface\"); exit(1); } while (1) { nread = read(tun_fd, buffer, sizeof(buffer)); if (nread < 0) { perror(\"Reading from interface\"); close(tun_fd); exit(1); } printf(\"Read %d bytes from tun/tap device\\n\", nread); } return 0;} 执行步骤如下:(1)linux运行窗口1编译运行tun12root@controller-VirtualBox:/home/controller# gcc tun.c -o tunroot@controller-VirtualBox:/home/controller# ./tun (2)linux运行窗口2 查看建立的网络设备tun0123456789101112131415root@controller-VirtualBox:/home/controller# ip addr1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:5f:1b:a9 brd ff:ff:ff:ff:ff:ff inet 192.168.55.188/24 brd 192.168.55.255 scope global dynamic enp0s8 valid_lft 9908sec preferred_lft 9908sec inet6 fe80::174:582b:9b7c:3df4/64 scope link valid_lft forever preferred_lft forever4: tun0: <POINTOPOINT,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 500 link/none (3)linux运行窗口2给tun0配置地址并激活1234root@controller-VirtualBox:/home/controller# ip addr add 10.0.1.1/24 dev tun0root@controller-VirtualBox:/home/controller#root@controller-VirtualBox:/home/controller#root@controller-VirtualBox:/home/controller# ip link set tun0 up (4)linux运行窗口2 ping 10.0.1.2 (5)linux运行窗口1 查看出现下面效果,用户态socket已经通过tun0接收到ping包了1234567891011root@controller-VirtualBox:/home/controller# ./tunOpen tun/tap device: tun0 for reading...Read 48 bytes from tun/tap deviceRead 48 bytes from tun/tap deviceRead 48 bytes from tun/tap deviceRead 84 bytes from tun/tap deviceRead 84 bytes from tun/tap deviceRead 84 bytes from tun/tap deviceRead 84 bytes from tun/tap deviceRead 84 bytes from tun/tap deviceRead 84 bytes from tun/tap device","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"network namespace","slug":"network-namespace","permalink":"http://vcpu.me/tags/network-namespace/"},{"name":"网络设备tun/tap","slug":"网络设备tun-tap","permalink":"http://vcpu.me/tags/网络设备tun-tap/"}]},{"title":" 网络基本命令","slug":"cmd1","date":"2017-07-13T10:00:00.000Z","updated":"2017-07-13T10:59:59.000Z","comments":true,"path":"cmd1/","link":"","permalink":"http://vcpu.me/cmd1/","excerpt":"ARP相关命令ip n显示ARP表全部清除linux arp表项1arp -n|awk '/^[1-9]/{system("arp -d "$1)}' 清除接口eth0所有的mac1ip neigh flush dev eth0 arp -v 额外显示接口类型和arp表项统计信息1234root@controller-VirtualBox:/home/controller# arp -vAddress HWtype HWaddress Flags Mask Iface10.0.1.2 ether 22:dc:c1:9c:8c:b4 C eth0Entries: 1 Skipped: 0 Found: 1 arp -a 显示全部ARP缓存12root@controller-VirtualBox:/home/controller# arp -a? (10.0.1.2) at 22:dc:c1:9c:8c:b4 [ether] on eth0 arp -n 可快速查询ARP缓存1234root@controller-VirtualBox:/home/controller# arp -nAddress HWtype HWaddress Flags Mask Iface10.0.1.2 ether 22:dc:c1:9c:8c:b4 C eth0root@controller-VirtualBox:/home/controller#","text":"ARP相关命令ip n显示ARP表全部清除linux arp表项1arp -n|awk '/^[1-9]/{system("arp -d "$1)}' 清除接口eth0所有的mac1ip neigh flush dev eth0 arp -v 额外显示接口类型和arp表项统计信息1234root@controller-VirtualBox:/home/controller# arp -vAddress HWtype HWaddress Flags Mask Iface10.0.1.2 ether 22:dc:c1:9c:8c:b4 C eth0Entries: 1 Skipped: 0 Found: 1 arp -a 显示全部ARP缓存12root@controller-VirtualBox:/home/controller# arp -a? (10.0.1.2) at 22:dc:c1:9c:8c:b4 [ether] on eth0 arp -n 可快速查询ARP缓存1234root@controller-VirtualBox:/home/controller# arp -nAddress HWtype HWaddress Flags Mask Iface10.0.1.2 ether 22:dc:c1:9c:8c:b4 C eth0root@controller-VirtualBox:/home/controller# arptables增加一条规则,限制10.0.10.220 ARP packet进入到环境中1arptables -A INPUT -s 10.0.10.220 -j DROP 删除arotables规则1arptables -D INPUT -s 10.0.10.220 -j DROP 查看arptables 规则1arptables -L arpwatcharpwatch命令用来监听网络上arp的记录。-d:启动排错模式;-f<记录文件>:设置存储ARP记录的文件,预设为/var/lib/arpwatch/arp.dat;-i<接口>:指定监听ARP的接口,预设的接口为eth0;-r<记录文件>:从指定的文件中读取ARP记录,而不是从网络上监听。 读出监听的ARP地址对应表 必须要杀死arpwatch进程后才才能读出下面数据123456789101112131415161718192021root@controller-VirtualBox:/home/controller# cat /var/lib/arpwatch/arp.dat3c:97:0e:3c:8c:53 192.168.55.155 1499942089 br0e0:69:95:4c:d4:ac 192.168.55.169 1499942233 br028:d2:44:86:46:be 192.168.55.193 1499942137 br068:f7:28:89:49:bd 192.168.55.199 1499942147 br03c:97:0e:3d:22:0c 192.168.55.137 1499942189 br068:f7:28:63:19:4f 192.168.55.192 1499942097 br010:c3:7b:6e:e6:95 192.168.55.174 1499942112 br020:dc:e6:fe:d7:fe 192.168.55.253 1499942035 br0b0:83:fe:6c:ec:a9 192.168.55.135 1499942042 br008:00:27:5f:1b:a9 192.168.55.244 1499942227 br000:15:c6:26:4d:48 192.168.55.254 1499942227 br090:e6:ba:ea:81:93 192.168.55.98 1499942212 br08c:a6:df:98:94:50 192.168.55.197 1499942083 br074:27:ea:54:8b:61 192.168.55.198 1499942096 br03c:97:0e:a7:a8:4f 192.168.55.116 1499942182 br050:7b:9d:f1:6e:47 192.168.55.158 1499942145 br03c:97:0e:3c:8d:fd 192.168.55.143 1499942234 br028:d2:44:bd:02:89 192.168.55.113 1499942210 br0d4:61:fe:2b:73:13 192.168.55.220 1499942214 br068:f7:28:06:1a:02 192.168.55.132 1499942226 br0 路由相关命令route -n12345root@controller-VirtualBox:/home/controller# route -nKernel IP routing tableDestination Gateway Genmask Flags Metric Ref Use Iface0.0.0.0 10.0.1.2 0.0.0.0 UG 0 0 0 eth010.0.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 ip r 查看路由1234root@controller-VirtualBox:/home/controller# ip rdefault via 192.168.55.254 dev br0169.254.0.0/16 dev br0 scope link metric 1000192.168.55.0/24 dev br0 proto kernel scope link src 192.168.55.244 ip r get xx.xx.xx.xx123root@controller-VirtualBox:/home/controller# ip r get 8.8.8.88.8.8.8 via 10.0.1.2 dev eth0 src 10.0.1.1 cache 增加默认路由route add default gw 10.0.1.2ip route add default via 10.0.1.2 ip -s link 查看网络统计数据12345678910111213root@controller-VirtualBox:/home/controller# ip -s link1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 RX: bytes packets errors dropped overrun mcast 0 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 0 0 0 0 0 05: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 8e:90:7e:35:80:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0 RX: bytes packets errors dropped overrun mcast 4621478 12933 0 17 0 0 TX: bytes packets errors dropped carrier collsns 2720 32 0 0 0 0 ip -s -s link ls eth0 显示具体接口的详细信息12345678910115: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 8e:90:7e:35:80:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0 RX: bytes packets errors dropped overrun mcast 5197968 14512 0 20 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 9034 97 0 0 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 2root@controller-VirtualBox:/home/controller# 地址配置相关命令配置地址ip addr add 10.0.1.1/24 dev eth0 ip addr 显示所有接口信息ip addr show eth0 显示eth0接口信息1234567root@controller-VirtualBox:/home/controller# ip addr show eth05: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 8e:90:7e:35:80:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.0.1.1/24 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::8c90:7eff:fe35:808e/64 scope link valid_lft forever preferred_lft forever ip addr del x.x.x.x/mask dev eth0 删除接口eth0地址12345678root@controller-VirtualBox:/home/controller# ip addr del 10.0.1.1/24 dev eth0root@controller-VirtualBox:/home/controller# ip addr1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:005: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 8e:90:7e:35:80:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::8c90:7eff:fe35:808e/64 scope link valid_lft forever preferred_lft forever 关闭和开启接口ip link set eth0 up/down linux网桥相关brctl show 查看网桥配置12345678910root@controller-VirtualBox:/home/controller# brctl showbridge name bridge id STP enabled interfacesbr0 8000.0800275f1ba9 no enp0s8 tap0 tap1 tap2 tap3 tap4### brctl showmacs br0 查看网桥学习的mac表项 brctl showmacs br0查看br0的mac表项12345678root@controller-VirtualBox:/home/controller# brctl showmacs br0port no mac addr is local? ageing timer 1 00:0c:29:1f:73:38 no 202.66 1 00:0c:29:f1:60:5d no 248.12 1 00:0e:c6:cf:da:c7 no 0.00 1 00:15:c6:26:4d:22 no 1.22 1 00:15:c6:26:4d:48 no 0.73 1 00:50:56:9f:cc:e2 no 161.74 brctl showstp bbr0查看接口信息123456789101112root@controller-VirtualBox:/home/controller# brctl showstp br0br0 bridge id 8000.0800275f1ba9 designated root 8000.0800275f1ba9 root port 0 path cost 0 max age 20.00 bridge max age 20.00 hello time 2.00 bridge hello time 2.00 forward delay 15.00 bridge forward delay 15.00 ageing time 300.00 hello timer 0.00 tcn timer 0.00 topology change timer 0.00 gc timer 14.22 flags 虚拟命名空间相关命令增加虚拟网络命名空间ip netns add net0 显示所有的虚拟网络命名空间EULER:~ # ip netns listnet0也可通过查看/var/run/netns目录下的文件来listEULER:~ # ls /var/run/netns/net0 进入虚拟机网络环境ip netns exec net0 command 如EULER:~ # ip netns exec net0 bash #打开虚拟网络环境net0的bash窗口EULER:~ # ip addr #显示所有虚拟网络环境的设备EULER:~ # exit #退出该网络虚拟环境exit 增加一对veth虚拟网卡EULER:~ # ip link add type veth 将veth0添加到net0虚拟网络环境ip link set veth0 netns net0 将虚拟网卡veth1改名并添加到net1虚拟网络环境中ip link set dev veth1 name net1-bridge netns net1 设置虚拟网络环境net0的veth0设备处于激活状态ip netns exec net0 ip link set veth0 up 为虚拟网络环境net0的veth0设备增加IP地址ip netns exec net0 ip address add 10.0.1.1/24 dev veth0 ip netns 列出当前网络虚拟的namespace1234root@controller-VirtualBox:/home/controller# ip netnsnet2 (id: 2)net1 (id: 1)net0 (id: 0) ls /var/run/netns/ 列出当前网络虚拟的namespaceroot@controller-VirtualBox:/home/controller# ls /var/run/netns/net0 net1 net2 ip netns monitor监控虚拟网络namespace行为root@controller-VirtualBox:/home/controller# ip netns monitordelete ddd","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"arp相关命令","slug":"arp相关命令","permalink":"http://vcpu.me/tags/arp相关命令/"},{"name":"路由相关命令","slug":"路由相关命令","permalink":"http://vcpu.me/tags/路由相关命令/"},{"name":"地址配置相关命令","slug":"地址配置相关命令","permalink":"http://vcpu.me/tags/地址配置相关命令/"},{"name":"虚拟命名空间相关命令","slug":"虚拟命名空间相关命令","permalink":"http://vcpu.me/tags/虚拟命名空间相关命令/"}]},{"title":"二三层转发-基本原理","slug":"switchtech","date":"2017-07-07T10:00:00.000Z","updated":"2017-07-07T03:47:23.000Z","comments":true,"path":"switchtech/","link":"","permalink":"http://vcpu.me/switchtech/","excerpt":"交换机的工作原理1. 交换机根据收到数据帧中的源MAC地址建立该地址同交换机端口的映射,并将其写入MAC地址表中。 2. 交换机将数据帧中的目的MAC地址同已建立的MAC地址表进行比较,以决定由哪个端口进行转发。 3. 如数据帧中的目的MAC地址不在MAC地址表中,则向所有端口转发。这一过程称为泛洪(flood)。 4. 广播帧和组播帧向所有的端口转发。交换机的工作原理(二、三、四层交换原理) 交换机的三个主要功能 学习:以太网交换机了解每一端口相连设备的MAC地址,并将地址同相应的端口映射起来存放在交换机缓存中的MAC地址表中。 转发/过滤:当一个数据帧的目的地址在MAC地址表中有映射时,它被转发到连接目的节点的端口而不是所有端口(如该数据帧为广播/组播帧则转发至所有端口)。 消除回路:当交换机包括一个冗余回路时,以太网交换机通过生成树协议避免回路的产生,同时允许存在后备路径。 交换机的工作特性 1. 交换机的每一个端口所连接的网段都是一个独立的冲突域。 2. 交换机所连接的设备仍然在同一个广播域内,也就是说,交换机不隔绝广播(惟一的例外是在配有VLAN的环境中)。 3. 交换机依据帧头的信息进行转发,因此说交换机是工作在数据链路层的网络设备(此处所述交换机仅指传统的二层交换设备)。 交换机的分类 依照交换机处理帧时不同的操作模式,主要可分为两类: 存储转发:交换机在转发之前必须接收整个帧,并进行错误校检,如无错误再将这一帧发往目的地址。帧通过交换机的转发时延随帧长度的不同而变化。 直通式:交换机只要检查到帧头中所包含的目的地址就立即转发该帧,而无需等待帧全部的被接收,也不进行错误校验。由于以太网帧头的长度总是固定的,因此帧通过交换机的转发时延也保持不变。","text":"交换机的工作原理1. 交换机根据收到数据帧中的源MAC地址建立该地址同交换机端口的映射,并将其写入MAC地址表中。 2. 交换机将数据帧中的目的MAC地址同已建立的MAC地址表进行比较,以决定由哪个端口进行转发。 3. 如数据帧中的目的MAC地址不在MAC地址表中,则向所有端口转发。这一过程称为泛洪(flood)。 4. 广播帧和组播帧向所有的端口转发。交换机的工作原理(二、三、四层交换原理) 交换机的三个主要功能 学习:以太网交换机了解每一端口相连设备的MAC地址,并将地址同相应的端口映射起来存放在交换机缓存中的MAC地址表中。 转发/过滤:当一个数据帧的目的地址在MAC地址表中有映射时,它被转发到连接目的节点的端口而不是所有端口(如该数据帧为广播/组播帧则转发至所有端口)。 消除回路:当交换机包括一个冗余回路时,以太网交换机通过生成树协议避免回路的产生,同时允许存在后备路径。 交换机的工作特性 1. 交换机的每一个端口所连接的网段都是一个独立的冲突域。 2. 交换机所连接的设备仍然在同一个广播域内,也就是说,交换机不隔绝广播(惟一的例外是在配有VLAN的环境中)。 3. 交换机依据帧头的信息进行转发,因此说交换机是工作在数据链路层的网络设备(此处所述交换机仅指传统的二层交换设备)。 交换机的分类 依照交换机处理帧时不同的操作模式,主要可分为两类: 存储转发:交换机在转发之前必须接收整个帧,并进行错误校检,如无错误再将这一帧发往目的地址。帧通过交换机的转发时延随帧长度的不同而变化。 直通式:交换机只要检查到帧头中所包含的目的地址就立即转发该帧,而无需等待帧全部的被接收,也不进行错误校验。由于以太网帧头的长度总是固定的,因此帧通过交换机的转发时延也保持不变。 二三四层交换机 多种理解的说法: 理解1 二层交换(也称为桥接)是基于硬件的桥接。基于每个末端站点的唯一MAC地址转发数据包。二层交换的高性能可以产生增加各子网主机数量的网络设计。其仍然有桥接所具有的特性和限制。 三层交换是基于硬件的路由选择。路由器和第三层交换机对数据包交换操作的主要区别在于物理上的实施。 四层交换的简单定义是:不仅基于MAC(第二层桥接)或源/目的地IP地址(第三层路由选择),同时也基于TCP/UDP应用端口来做出转发决定的能力。其使网络在决定路由时能够区分应用。能够基于具体应用对数据流进行优先级划分。它为基于策略的服务质量技术提供了更加细化的解决方案。提供了一种可以区分应用类型的方法。 理解2 二层交换机 基于MAC地址 三层交换机 具有VLAN功能 有交换和路由 ///基于IP,就是网络 四层交换机 基于端口,就是应用 理解3 二层交换技术从网桥发展到VLAN(虚拟局域网),在局域网建设和改造中得到了广泛的应用。第二层交换技术是工作在OSI七层网络模型中的第二层,即数据链路层。它按照所接收到数据包的目的MAC地址来进行转发,对于网络层或者高层协议来说是透明的。它不处理网络层的IP地址,不处理高层协议的诸如TCP、UDP的端口地址,它只需要数据包的物理地址即MAC地址,数据交换是靠硬件来实现的,其速度相当快,这是二层交换的一个显著的优点。但是,它不能处理不同IP子网之间的数据交换。传统的路由器可以处理大量的跨越IP子网的数据包,但是它的转发效率比二层低,因此要想利用二层转发效率高这一优点,又要处理三层IP数据包,三层交换技术就诞生了。 三层交换技术的工作原理 第三层交换工作在OSI七层网络模型中的第三层即网络层,是利用第三层协议中的IP包的包头信息来对后续数据业务流进行标记,具有同一标记的业务流的后续报文被交换到第二层数据链路层,从而打通源IP地址和目的IP地址之间的一条通路。这条通路经过第二层链路层。有了这条通路,三层交换机就没有必要每次将接收到的数据包进行拆包来判断路由,而是直接将数据包进行转发,将数据流进行交换 理解4 二层交换技术 二层交换技术是发展比较成熟,二层交换机属数据链路层设备,可以识别数据包中的MAC地址信息,根据MAC地址进行转发,并将这些MAC地址与对应的端口记录在自己内部的一个地址表中。具体的工作流程如下: (1)当交换机从某个端口收到一个数据包,它先读取包头中的源MAC地址,这样它就知道源MAC地址的机器是连在哪个端口上的; (2) 再去读取包头中的目的MAC地址,并在地址表中查找相应的端口; (3) 如表中有与这目的MAC地址对应的端口,把数据包直接复制到这端口上; (4)如表中找不到相应的端口则把数据包广播到所有端口上,当目的机器对源机器回应时,交换机又可以学习一目的MAC地址与哪个端口对应,在下次传送数据时就不再需要对所有端口进行广播了。 不断的循环这个过程,对于全网的MAC地址信息都可以学习到,二层交换机就是这样建立和维护它自己的地址表。 从二层交换机的工作原理可以推知以下三点: (1)由于交换机对多数端口的数据进行同时交换,这就要求具有很宽的交换总线带宽,如果二层交换机有N个端口,每个端口的带宽是M,交换机总线带宽超过N×M,那么这交换机就可以实现线速交换; (2) 学习端口连接的机器的MAC地址,写入地址表,地址表的大小(一般两种表示方式:一为BEFFER RAM,一为MAC表项数值),地址表大小影响交换机的接入容量; (3) 还有一个就是二层交换机一般都含有专门用于处理数据包转发的ASIC (Application specific Integrated Circuit)芯片,因此转发速度可以做到非常快。由于各个厂家采用ASIC不同,直接影响产品性能。 以上三点也是评判二三层交换机性能优劣的主要技术参数,这一点请大家在考虑设备选型时注意比较。 路由技术 路由器工作在OSI模型的第三层—网络层操作,其工作模式与二层交换相似,但路由器工作在第三层,这个区别决定了路由和交换在传递包时使用不同的控制信息,实现功能的方式就不同。工作原理是在路由器的内部也有一个表,这个表所标示的是如果要去某一个地方,下一步应该向那里走,如果能从路由表中找到数据包下一步往那里走,把链路层信息加上转发出去;如果不能知道下一步走向那里,则将此包丢弃,然后返回一个信息交给源地址。 路由技术实质上来说不过两种功能:决定最优路由和转发数据包。路由表中写入各种信息,由路由算法计算出到达目的地址的最佳路径,然后由相对简单直接的转发机制发送数据包。接受数据的下一台路由器依照相同的工作方式继续转发,依次类推,直到数据包到达目的路由器。 而路由表的维护,也有两种不同的方式。一种是路由信息的更新,将部分或者全部的路由信息公布出去,路由器通过互相学习路由信息,就掌握了全网的拓扑结构,这一类的路由协议称为距离矢量路由协议;另一种是路由器将自己的链路状态信息进行广播,通过互相学习掌握全网的路由信息,进而计算出最佳的转发路径,这类路由协议称为链路状态路由协议。 由于路由器需要做大量的路径计算工作,一般处理器的工作能力直接决定其性能的优劣。当然这一判断还是对中低端路由器而言,因为高端路由器往往采用分布式处理系统体系设计。 三层交换技术 近年来的对三层技术的宣传,耳朵都能起茧子,到处都在喊三层技术,有人说这是个非常新的技术,也有人说,三层交换嘛,不就是路由器和二层交换机的堆叠,也没有什么新的玩意,事实果真如此吗?下面先来通过一个简单的网络来看看三层交换机的工作过程。 组网比较简单 使用IP的设备A————三层交换机———–使用IP的设备B 比如A要给B发送数据,已知目的IP,那么A就用子网掩码取得网络地址,判断目的IP是否与自己在同一网段。 如果在同一网段,但不知道转发数据所需的MAC地址,A就发送一个ARP请求,B返回其MAC地址,A用此MAC封装数据包并发送给交换机,交换机起用二层交换模块,查找MAC地址表,将数据包转发到相应的端口。 如果目的IP地址显示不是同一网段的,那么A要实现和B的通讯,在流缓存条目中没有对应MAC地址条目,就将第一个正常数据包发送向一个缺省网关,这个缺省网关一般在操作系统中已经设好,对应第三层路由模块,所以可见对于不是同一子网的数据,最先在MAC表中放的是缺省网关的MAC地址;然后就由三层模块接收到此数据包,查询路由表以确定到达B的路由,将构造一个新的帧头,其中以缺省网关的MAC地址为源MAC地址,以主机B的MAC地址为目的MAC地址。通过一定的识别触发机制,确立主机A与B的MAC地址及转发端口的对应关系,并记录进流缓存条目表,以后的A到B的数据,就直接交由二层交换模块完成。这就通常所说的一次路由多次转发。 以上就是三层交换机工作过程的简单概括,可以看出三层交换的特点: 由硬件结合实现数据的高速转发。 这就不是简单的二层交换机和路由器的叠加,三层路由模块直接叠加在二层交换的高速背板总线上,突破了传统路由器的接口速率限制,速率可达几十Gbit/s。算上背板带宽,这些是三层交换机性能的两个重要参数。 简洁的路由软件使路由过程简化。 大部分的数据转发,除了必要的路由选择交由路由软件处理,都是又二层模块高速转发,路由软件大多都是经过处理的高效优化软件,并不是简单照搬路由器中的软件。 结论: 二层交换机用于小型的局域网络。这个就不用多言了,在小型局域网中,广播包影响不大,二层交换机的快速交换功能、多个接入端口和低谦价格为小型网络用户提供了很完善的解决方案。 路由器的优点在于接口类型丰富,支持的三层功能强大,路由能力强大,适合用于大型的网络间的路由,它的优势在于选择最佳路由,负荷分担,链路备份及和其他网络进行路由信息的交换等等路由器所具有功能。 三层交换机的最重要的功能是加快大型局域网络内部的数据的快速转发,加入路由功能也是为这个目的服务的。如果把大型网络按照部门,地域等等因素划分成一个个小局域网,这将导致大量的网际互访,单纯的使用二层交换机不能实现网际互访;如单纯的使用路由器,由于接口数量有限和路由转发速度慢,将限制网络的速度和网络规模,采用具有路由功能的快速转发的三层交换机就成为首选。 一般来说,在内网数据流量大,要求快速转发响应的网络中,如全部由三层交换机来做这个工作,会造成三层交换机负担过重,响应速度受影响,将网间的路由交由路由器去完成,充分发挥不同设备的优点,不失为一种好的组网策略,当然,前提是客户的腰包很鼓,不然就退而求其次,让三层交换机也兼为网际互连。 第四层交换技术 第四层交换的一个简单定义是:它是一种功能,它决定传输不仅仅依据MAC地址(第二层网桥)或源/目标IP地址(第三层路由),而且依据TCP/UDP(第四层) 应用端口号。第四层交换功能就象是虚IP,指向物理服务器。它传输的业务服从的协议多种多样,有HTTP、FTP、NFS、Telnet或其他协议。这些业务在物理服务器基础上,需要复杂的载量平衡算法。在IP世界,业务类型由终端TCP或UDP端口地址来决定,在第四层交换中的应用区间则由源端和终端IP地址、TCP和UDP端口共同决定。 在第四层交换中为每个供搜寻使用的服务器组设立虚IP地址(VIP),每组服务器支持某种应用。在域名服务器(DNS)中存储的每个应用服务器地址是VIP,而不是真实的服务器地址。 当某用户申请应用时,一个带有目标服务器组的VIP连接请求(例如一个TCP SYN包)发给服务器交换机。服务器交换机在组中选取最好的服务器,将终端地址中的VIP用实际服务器的IP取代,并将连接请求传给服务器。这样,同一区间所有的包由服务器交换机进行映射,在用户和同一服务器间进行传输。 第四层交换的原理 OSI模型的第四层是传输层。传输层负责端对端通信,即在网络源和目标系统之间协调通信。在IP协议栈中这是TCP(一种传输协议)和UDP(用户数据包协议)所在的协议层。 在第四层中,TCP和UDP标题包含端口号(portnumber),它们可以唯一区分每个数据包包含哪些应用协议(例如HTTP、FTP等)。端点系统利用这种信息来区分包中的数据,尤其是端口号使一个接收端计算机系统能够确定它所收到的IP包类型,并把它交给合适的高层软件。端口号和设备IP地址的组合通常称作“插口(socket)”。 1和255之间的端口号被保留,他们称为“熟知”端口,也就是说,在所有主机TCP/IP协议栈实现中,这些端口号是相同的。除了“熟知”端口外,标准UNIX服务分配在256到1024端口范围,定制的应用一般在1024以上分配端口号. 分配端口号的最近清单可以在RFc1700”Assigned Numbers”上找到。TCP/UDP端口号提供的附加信息可以为网络交换机所利用,这是第4层交换的基础。 熟知的端口号举例: 应用协议 端口号 FTP 20(数据),21(控制) TELNET 23 SMTP 25 HTTP 80 NNTP 119 NNMP 16,162(SNMP traps) TCP/UDP端口号提供的附加信息可以为网络交换机所利用,这是第四层交换的基础。 具有第四层功能的交换机能够起到与服务器相连接的“虚拟IP”(VIP)前端的作用。 每台服务器和支持单一或通用应用的服务器组都配置一个VIP地址。这个VIP地址被发送出去并在域名系统上注册。 在发出一个服务请求时,第四层交换机通过判定TCP开始,来识别一次会话的开始。然后它利用复杂的算法来确定处理这个请求的最佳服务器。一旦做出这种决定,交换机就将会话与一个具体的IP地址联系在一起,并用该服务器真正的IP地址来代替服务器上的VIP地址。 每台第四层交换机都保存一个与被选择的服务器相配的源IP地址以及源TCP 端口相关联的连接表。然后第四层交换机向这台服务器转发连接请求。所有后续包在客户机与服务器之间重新影射和转发,直到交换机发现会话为止。 在使用第四层交换的情况下,接入可以与真正的服务器连接在一起来满足用户制定的规则,诸如使每台服务器上有相等数量的接入或根据不同服务器的容量来分配传输流。 本文章来网络","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"二三层转发","slug":"二三层转发","permalink":"http://vcpu.me/tags/二三层转发/"}]},{"title":"集线器、网桥、交换机、中继器原理","slug":"hub","date":"2017-07-07T09:22:34.000Z","updated":"2017-08-02T02:17:55.000Z","comments":true,"path":"hub/","link":"","permalink":"http://vcpu.me/hub/","excerpt":"集线器集线器(HUB),它是工作在物理层的设备,由于它只是工作在物理层的设备,所以它并不关心也不可能关心OSI上面几层所涉及的,它的工作机制流程 是:从一个端口接收到数据包时,会在其他端口把这个包转发一次,因为它不知道也不可能知道这个包是发给谁的(物理层设备只关心电压这些物理概念),它也只能对所有人广播(这里和下文提到的广播该词的意思和ARP请求时的广播有些不同。 这里的广播意思是:使用物理层转发设备,如HUB,导致的广播,可以说这个广播是被逼的,因为设备的问题!是设备转发包引起的广播!而ARP请求的 广播是自己要求的,主动的,因为ARP请求包的目标地址IP是255.255.255.255,但ARP请求的广播涉及IP层的知识,不在这篇文章讨论的 范围,所以这里提到的广播,除非特别说明,否则都是第一个意思,也就说是”因设备转发数据包引起的广播” ),让他们自己处理了。 这样一来会有不少问题,你发的数据其他人都收到了,私隐这总东西是不存在的!别入可以随便监听你信息!所以会话劫持在那个年代相当容易(记得俺第一次接触会话劫持这个概念的时候还是高2,那是2001~2002,呵,那时候集线器还是比较普遍的)。","text":"集线器集线器(HUB),它是工作在物理层的设备,由于它只是工作在物理层的设备,所以它并不关心也不可能关心OSI上面几层所涉及的,它的工作机制流程 是:从一个端口接收到数据包时,会在其他端口把这个包转发一次,因为它不知道也不可能知道这个包是发给谁的(物理层设备只关心电压这些物理概念),它也只能对所有人广播(这里和下文提到的广播该词的意思和ARP请求时的广播有些不同。 这里的广播意思是:使用物理层转发设备,如HUB,导致的广播,可以说这个广播是被逼的,因为设备的问题!是设备转发包引起的广播!而ARP请求的 广播是自己要求的,主动的,因为ARP请求包的目标地址IP是255.255.255.255,但ARP请求的广播涉及IP层的知识,不在这篇文章讨论的 范围,所以这里提到的广播,除非特别说明,否则都是第一个意思,也就说是”因设备转发数据包引起的广播” ),让他们自己处理了。 这样一来会有不少问题,你发的数据其他人都收到了,私隐这总东西是不存在的!别入可以随便监听你信息!所以会话劫持在那个年代相当容易(记得俺第一次接触会话劫持这个概念的时候还是高2,那是2001~2002,呵,那时候集线器还是比较普遍的)。另外一个比较严重的问题是,如果一个大型的局域网,比如有500台机器,全部用HUB连接的,后果会怎么样呢??相当慢,网络的效率极差!为什么? 如果500台机器都发一个包,那就是说每台机器,都需要接收差不多499个无用包…并且如果是需要回应的话……无用的数据包会充斥着整个的局 域网,这就是传说中的广播风暴! 为了减少广播风暴,网桥产生了(注意这里用的时候“减少”,不是“杜绝”,仅仅是减少!如果仅仅用网桥说能杜绝广播风暴,个人觉得还是不太准确,后来交换机的出现才可以说是完全杜绝了广播风暴的发生)! 在介绍网桥之前,还想简单介绍另一个物理层的设备:“中继器”,这种设备的作用是把物理层传输的信号放大,由于长距离的传输,信号会有一定的损耗的,这种设备主要解决的就是这个问题。它和HUB的区别是:HUB主要是为了在物理层上转发数据的,所以它不关心电压值的大小,也不会放大物理信号;而中继器它的作用就是为了放大信号用的,SO….. 网桥网桥又称桥接器,英文名Network Bridge,数据链路层设备。它也是转发数据包的设备,但和HUB不一样的是,它工作在数据链路层,HUB只能看懂物理层上的东西(比如一段物理信号),网桥却能看懂一些帧的信息(在链路层上,把上面传下来的数据封装后,封装好了的数据就是帧,但这里我用“数据包”这样的泛指去代替“帧”这个专业术语)。 在以太网构造的局域网上,最终的寻址是以数据链路层的MAC地址作为标识的(就是用MAC地址可以在局域网上找到一台唯一的机器),网桥能从发来的 数据包中提取MAC信息,并且根据MAC信息对数据包进行有目的的转发,而不采用广播的方式,这样就能减少广播风暴的出现,提升整个网络的效率,在详细说网桥这东西之前,我想先介绍一下交换机。 交换机: 这是大家最熟悉的设备之一 交换机Switch,数据链路层设备,作用是转发数据包。和网桥一样它也是通过识别帧中的MAC地址,然后对特定的端口进行转发的。 网络基础之网桥和交换机的工作原理及区别网桥和交换机区别在叙述前,我们先一起看两幅图: 网桥的连接模式:(红点处为HUB) 交换机连接模式:从图中可以看到,网桥只有2个(输入/出)端口,而交换机有8个。嗯,是的,一开始的时候(那时候只有HUB这种设备),由于硬件水平不是很发达,人们为了提高局域网效率,减少广播风暴的出现,他们生产了网桥(一个只有两个输入/出端口的链路层设备,这时的网桥已经是个比较先进的设备),然后他们把一个局域网一分为2,中间用网桥连接,这样A发给BCD的数据就不会再广播到EFGH了(网桥发现如果数据包不是转发给下面这个子网的,它会自动丢弃此包),只有从A 发到EFGH的数据包才能通过网桥,到达另外一个子网(网桥发现如果数据包是转发给下面这个子网的,它才会把包转发给这个子网)。 这样一来,非必要的传输减少了,整个网络的效率也随之提高可不少!人们发现网桥真是个好东西呀,随着硬件发展,出现了4个,8个端口的链路层设备,这就是交换机,由于交换机可以使得网络更安全(数据不容易被监听,因为数据不再广播了,注意:只是不容易而已,要搞你其实还是可以的),网络效率更高(还是因为数据不再广播!),交换机渐渐替代了HUB,成为组建局域网的重要设备。 所以说,网桥和交换机,基本上是一样的,嗯,只是基本上而已,细看还是有些不一样的,但在说明他们之间有什么不一样之前,我想先简单介绍一下网桥和交换机的工作原理。 网桥的工作原理: 上图是用一个网桥连接的两个网络,网桥的A端口连接A子网,B端口连接B子网,为什么网桥知道哪些数据包该转发,哪些包不该转发呢?那是因为它有两个表A和B,当有数据包进入端口A时,网桥从数据包中提取出源MAC地址和目的MAC地址。 一开始的时候,表A和表B都是空的,没有一条记录,这时,网桥会把数据包转发给B网络,并且在表A中增加一条MAC地址(把源MAC地址记录表中),说明这个MAC地址的机器是A子网的,同理,当B子网发送数据包到B端口时,网桥也会记录源MAC地址到B表。 当网桥工作一段时候后,表A基本上记录了A子网所有的机器的MAC地址,表B同理,当再有一个数据包从A子网发送给网桥时,网桥会先看看数据包的目的MAC 地址是属于A子网还是B子网的,如果从A表中找到对应则,抛弃该包,如果不是,则转发给B子网,然后检查源MAC地址,是否在表中已经存在,如果不存在,在表A中增加一条记录。 噢,或许你现在会问了,为什么需要两张表呢,一张表不行么??嗯~刚才把表一分为二是为了便于理解,实际上,真正的网桥里面存的应该是一张表(当然有可能为了提速,或者其他原因,它也可能把信息存为多张表,这个得看它怎么实现了~),如果是一张信息表,表里记录的应该是:MAC-PortNum。 交换机有些许不同,如图: 交换机也有一张MAC-PORT对应表(这张表的学名为:CAM),和网桥不一样的是,网桥的表是一对多的(一个端口号对多个MAC地址),但交换机的CAM表却是一对一的,如果一个端口有新的MAC地址,它不会新增MAC-PORT记录,而是修改原有的记录 例如:现在交换机记录表里已经有一项:MAC1-Port1,如果此刻端口1又来了一个数据包,里面的源MAC地址是MAC2,此时,交换机会刷新交换机记录表:MAC1-Port1记录被修改为MAC2-Port1,因为交换机认为是端口1的计算机MAC地址变了,如果端口1连接的一台物理机器,MAC一般是不会变的,如果连接的是另外一个交换机,那这个端口的记录会变化得比较频繁(如上图的Port12,它是对外的接口,与一个局域网连接),另外,如果 CAM表中没有找到和数据包的目的MAC地址对应的记录,交换机会对此数据包进行广播,发给本交换机的每一个端口。 网桥和交换机的缓存对比网桥和交换机除了用CAM表指导数据包转发这点和HUB不一样外,网桥和交换机还有一个特点:缓存!对,网桥和交换机都有一定量的缓存,因为网桥和交换机转发数据包需要一些额外的操作,所以可能会占用一些时间,为了避免出现因来不及转发数据,导致大量数据丢失的情况,网桥和交换机就出现了缓存。 当然,缓存不是万能的,当网桥或交换机处理不及并且缓存用完了,以后再来的数据还是会丢失的。还一个网桥需要缓存的原因是:桥接两个传输速率不同的局域网,比如:802.3的传输速率说10mb/s,但实际上并不是真的10Mb/s,而 802.4(几乎)确实为10Mb/s,但两个局域网桥接时候必须是速率一样的,否则会有包丢失,而缓存也正好可以为不同速率网络连接时所使用。 网桥的缓存量是大于交换机的缓存量,因为网桥天生是为了连接两个网络的,两个网络桥接时所要处理的数据量会比一般的交换机需要处理的数据量要多,所以网桥的缓存也一般要比交换机大一点。","categories":[{"name":"网络","slug":"网络","permalink":"http://vcpu.me/categories/网络/"}],"tags":[{"name":"集线器","slug":"集线器","permalink":"http://vcpu.me/tags/集线器/"},{"name":"网桥","slug":"网桥","permalink":"http://vcpu.me/tags/网桥/"},{"name":"中继器","slug":"中继器","permalink":"http://vcpu.me/tags/中继器/"},{"name":"交换机","slug":"交换机","permalink":"http://vcpu.me/tags/交换机/"}]},{"title":"拥塞控制","slug":"TCP拥塞控制","date":"2017-07-04T10:00:00.000Z","updated":"2017-07-05T01:33:00.000Z","comments":true,"path":"TCP拥塞控制/","link":"","permalink":"http://vcpu.me/TCP拥塞控制/","excerpt":"TCP拥塞控制设计概念延迟确认定时器 TCP延时确认定时器是指在一个TCP连接中,当一方收到另一端的数据后,并不是立刻ACK确认,而是等待200ms(2.6内核 40ms),如果这段时间内有新的数据要发往对方,本地奖ACK和数据封装在一个数据包中捎带发送;如果这段时间内没有新的数据要发往对方,200ms后ack确认。 优点:减少传输消耗缺点:增加了延迟 超时重传定时器超时重传是TCP协议栈保障数据可靠性的一个重要机制原理: 发送一个数据后同时开启定时器,在一定时间内如果没有收到对方确认,定时器激活重新发送数据包,直到发送成功或者到达最大重传次数。","text":"TCP拥塞控制设计概念延迟确认定时器 TCP延时确认定时器是指在一个TCP连接中,当一方收到另一端的数据后,并不是立刻ACK确认,而是等待200ms(2.6内核 40ms),如果这段时间内有新的数据要发往对方,本地奖ACK和数据封装在一个数据包中捎带发送;如果这段时间内没有新的数据要发往对方,200ms后ack确认。 优点:减少传输消耗缺点:增加了延迟 超时重传定时器超时重传是TCP协议栈保障数据可靠性的一个重要机制原理: 发送一个数据后同时开启定时器,在一定时间内如果没有收到对方确认,定时器激活重新发送数据包,直到发送成功或者到达最大重传次数。 RTO(重传超时时间) 如果RTO值被设置过大,将会使得发送端经过很长时间等待才能发现报文段丢失,会降低吞吐如果RTO值被设置很小,发送端尽管可以很快的检测出报文段的丢失,但也可能将一些延迟大的报文段误认为是丢失,造成不必要重传,浪费网络资源。 慢启动阈值慢启动阈值是慢启动算法和拥塞避免的分水岭,当拥塞窗口大于慢启动阈值时,就用拥塞避免慢启动阈值门限就用慢启动 SACK机制SACK TCP选项 SACK是TCP一个选项,握手过程中会协商判断是否支持SACK,如果支持会在TCP选项中SACK permitted。SACK 选项格式 TCP不能超过四组SACK边界,因为TCP选项最大支持40个字节 作用描述 SACK通常是由数据接收方产生,收到的SYN包中有SACK-Permitted选项为true,同时自己也支持SACK,可以在接收数据异常时候,产生SACK option。SACK中需要携带接收队列中没有被确认的数据信息,已经确认的数据信息不会通过SACK携带。发送端SACK含义: 第一个block指出是哪个segment触发了SACK 选项,发sack人认为是谁乱序了才导致SACK 尽可能填满SACK SACK需要报告最近接收的不连续的数据块接收端: 数据没有被确认前都会保持在滑动窗口内 每一个数据包都有一个sakced标记,发送数据会忽略被sacked的数据 如果SACK丢失,超时重传后重置所有数据包的SACKed标记SACK分析 适用于多包丢失情况,可以快速退出快速恢复阶段;如果丢包率很低,或者丢包时常常只丢一个包,那么SACK就是多余的。 D-SACKD-SACK主要是使用SACK来告诉发送方有哪些数据被重复接收了,如果是D-SACK;D-SACK option的第一个block代表呗重复发送的序号片段。注意:D_SACK仅仅是接收端的报告重复连续的片段每个重复连续片段只能在第一个block,其它block是接收端已经收到但是还没有ACK的数包汇报重复收到片段 此时数据发送端,可以知道,是对方的ACK丢失了导致此种情况 慢启动过程最初的TCP在建立完成后会向网络中发送大量的数据包,这样很容易导致网络中路由缓存空间耗尽,从而发生拥塞。慢启动就是为解决该问题,其思想是一开始不是发大量的数据包而是根据网络情况逐步增加每次的发送的数量,以避免上述现象的发生。其基本做法为:新建连接时候,CWND初始化1个最大报文段大小,每当一个报文段被确认,拥塞窗口就增价1个mss大小,在发出的报文段均被正常确认情况 下,拥塞窗会随着往返时间指数增长。慢启动并不慢。RTTlogW 时间可占满带宽。 拥塞避免拥塞窗口不能一直增长下去,其受慢启动阈值(一开始为65535)限制;如果超过该阈值,进入拥塞避免阶段。 拥塞避免思想 拥塞窗口的加法增大,拥塞窗口随着RTT开始线性增加,这样可以避免增长过快导致网络拥塞,慢慢调整到网络的最佳值。 如何确定拥塞状态超时重传此时发出去的报文在RTT时间内没有被确认,此时发生拥塞可能性较大,TCP协议栈强烈反应为: 慢启动阈值 降低为当前拥塞窗口的一半拥塞窗口设置为1,从新进入慢启动 快速重传收到3个相同的ACK,TCP在收到乱序包时就会立即发送ACK,TCP利用3个相同的ACK判定数据包失效,发送快速重传,TCP协议栈强烈反应为: 慢启动阈值降低设置为拥塞窗口的一半拥塞窗口设置为慢启动阈值,从新进入拥塞避免阶段 快速恢复当收到三个重复ACK时,随着TCP协议栈改进TCP进入的不是拥塞避免,而是快速恢复。快速恢复思想是: 数据包守恒原则,当老的数据包离开了网络后,才能向网络中发送一个新的数据包。 如果收到一个重复ACK,代表已经有一个数据包离开了网络,于是拥塞窗口加1,此时能向网络中发一个新的数据包。 具体步骤: 当收到3个重复ACK,慢启动阈值会降为拥塞窗口的一半,把拥塞窗口设置为慢启动阈值大小+3(3个数据包离开网络); 再收到重复ACK时,拥塞窗口+1 当收到新的数据包ACK时候,拥塞窗口设置为慢启动阈值,快速恢复结束,可以回到拥塞避免阶段了 上述算法是reno算法,新版本的reno算法差异为: reno算法发再收到一个新的数据ACK退出快速恢复状态,new reno 需要收等到改窗口所有的数据包确认才会推出快速恢复。 TCP协议栈拥塞窗口改变时机拥塞窗口改变只发生在收到ACK和重传定时器超时。 LINUX协议栈拥塞控制状态划分open状态慢启动和拥塞避免状态都是没有发生拥塞,网络畅通的状态,linux协议栈使用open状态来表示慢启动和拥塞避免。 当TCP会话初次建立连接时,还没有发生拥塞,这时慢启动阈值无法估计,linux协议栈设置为极大值0xffffffff;Reno和Cubic算法在慢启动阶段都是当有n个数据报被确认,拥塞窗口就自增n,但是两者在拥塞避免和慢启动阈值计算上,是不同的。 diorder状态在Linux内核TCP实现中,有一个disorder状态,此状态是open状态向快速重传过度的一个状态,收到3个重复确认才开开始快速重传,从收到第一个重复确认到收到第三个重复确认这段时间,处于disorder状态。 设置disorder状态的必要性在于,当收到重复确认比较少时,我们还没法判断当前是否发生丢包,因为对端收到乱序报文,也会发送选择确认。如果重复确认足够多比如3个重复确认,就进入recovery状态,而如果在收到1-2个重复确认再收到数据确认,则回到open状态。 从disorder状态回到open状态时,拥塞窗口和慢启动阈值不发生变化,会继续进行慢启动或者拥塞避免。 reocvery状态在linux内核实现中快速重传(动作)和快速恢复(阶段)用recovery状态表示。快速恢复阶段,指的是从快速重传开始,到网络上没有丢失的报文,可以回到open状态的这段时间,拥塞窗口和慢启动阈值也调整回去。 LOSS状态重传定时器超时以后所处状态就是Loss状态。 重传定时器超时时,代表网络环境已经极差,此时会拥塞窗口变得很小(一般1),同时调整慢启动阈值为一半,重新开始慢启动算法;认为之前发出去的所有数据包均已经丢失,重新开始慢启动算法。 重传定时器超时之前所有已发的数据包被确认后才退出快速loss状态。 Loss状态和Open状态差别 Loss状态是采用慢启动来重传丢失的报文,知道丢失的报文被确认后才发新的数据包,而Open状态没有丢失数据包,一直再发新的数据包。 Linux协议栈拥塞控制的初始化当TCP从syn sent或者syn recv进入 established状态时,初始话拥塞窗口和慢启动阈值。kernel2.6版本拥塞窗口会根据mss大小来进行初始话,mss值越大窗口就越小。 因为网络最初建立时候,网络情况很难估计,一半慢启动阈值会被设置很大0x7fffffff直到察觉到拥塞发生时才做调整。 Linux在TCP实现中,当TCP连接进入TW和LAST_ACK准备关闭连接时候,会选者性地将拥塞窗口和慢启动阈值保存下来。作为下一次同一条线路TCP会话建立时的拥塞窗口和慢启动阈值初始值。 Linux协议栈拥塞控制相关杂谈发送数据包的限制 发送数据包实际上受三个条件限制:条件1:发送端的拥塞窗口条件2:接收端的接收窗口条件3:发送队列上实际数据包量 慢启动阶段拥塞窗口调整策略慢启动阶段,有多少数据包被确认久增加多少数据包。一般的描述为一个传输轮次,拥塞窗口翻倍。这与有多少数据包被确认,拥塞窗口增加多少是一个意思。后者更适合于写代码。拥塞避免阶段拥塞窗口调整 没收到一定个数的ACK拥塞窗口加1reno算法拥塞避免表现为: 拥塞窗口在维持一段时间内的恒定值后,自增1 disorder阶段拥塞窗口调整 disorder是一个观望状态,拥塞窗口和慢启动阈值保持不变。即此时满足in_flight < 拥塞窗口,对端能接收,并且发送队列中还有数据,则会继续发送数据。 disorder收到数据确认时候,先前引发重复确认报文段已经到达了接收端,这时会重新回到open状态,拥塞窗口和慢启动阈值没有发生任何变化。 disorder接收到足够多的重复确认数据包时(一般3个),会进入revocery状态。 快速恢复阶段拥塞窗口调整进入快速恢复之前,调整慢启动阈值,等退出快速恢复阶段时,拥塞窗口等于慢启动阈值。快速恢复阶段不同恢复算法用不同的拥塞窗口调整策略。 快速恢复阶段突降式调整突降方式调整是指进入快速恢复阶段,就立刻调整拥塞窗口。windows采用的就是这种调整方法。 快速恢复阶段比例式拥塞窗口调整拥塞窗口缓慢过度至慢启动阈值。 快速重传阶段继续收到重复确认处理Linux协议栈采用的方法为: 如果拥塞窗口没有降到慢启动阈值,每收到两个重复ACK,拥塞窗口减1,并发送丢失的数据段。否则拥塞窗口不变,每收到一个重复ACK就重传一次。在此期间不发送新的数据段。 快速重传部分确认处理Linux2.6的处理是: ACK确认了新数据时候,有两种可能性。第一种,重传数据到达对端 第二种,正常数据到达对端为了区分这两种情况,协议栈会比较新数据ack的时间戳和发生重传数据时间戳。如果后发的重传数据,则说明数据包没有丢失,只是晚到达了,慢启动阈值会被恢复到之前的值;此时不重传此ACK的数据包。 如果是重传到达对端,导致的新数据ACK产生;此时重传ACK报文,如果拥塞窗口允许,就发数据。 Loss状态收到ACK如果从ack判断出这个ack报文在我们最后一次重传前就已经发出,这种情况代表重传是没有必要重传,其重传数据段已经到达了对端。把慢启动窗口和阈值恢复到loss状态前的值,就当什么也没有发生过,继续发送数据。 否则: 继续重传队列中剩余的数据,重传过程中不发新数据;重传完成后按照拥塞窗口进行发送。进入Loss时候会记录至少重传最大序列号;在确认这些数据传输完成,协议栈从Loss退出进入open。 拥塞控制的误判发生在进入Recovery和Loss状态时,会发送误判。并不是收到三个重复ACK就一定代表需要快速从传,并不是重传定时器超时就一定是网络拥塞。 误判进入recovery状态导致问题:拥塞窗口和慢启动阈值减小是不必要的,会导致不必要的性能下降。 超时误判:由于网络状态的不稳定性,RTT会经常变化,超时时有可能发生的。单相对恢复误判,超时误判发生可能性小,因为重传定时器RTO比RTT大。 误判的判定:recovery误判:如果重传数据到达之前,ack就确认了这个数据包,则说明为误判,直接回到以前状态。 这种误判恢复linux需要时间戳的支持。 误判发生经常伴随着时间戳和sack出现。 重定序临界值这个临界值就是我们经常说的3个重复ACK的值,在kernel的描述;实际上该值是可能变化的,例如误判发生时,该临界值会变大。这也是linux较为保守的实现。 Linux拥塞控制概述 慢启动阶段涨的很快,拥塞避免阶段探测性增长,收到重复确认先观望,收到足够多的重复确认时开始快速重传;在快速恢复阶段争取把可能丢失的报文重传出去,并适当发送新的数据包以维持协议栈正常运转。当退出快速重传时,拥塞窗口和慢启动阈值按预期的值变小。并会到拥塞避免阶段。同时挂一个定时器来处理长时间未收到报文的情况。当重传定时器超时时,批量重传,重新开始慢启动。 慢启动大家基本默认翻倍增长,在一次TCP的生命周期中,只要不断网,超时重传可能性很小,所以拥塞控制算法的主要集中在拥塞避免阶段,偶尔处在recovery状态;即 拥塞避免-recovery-拥塞避免-recovery拥塞控制性能差异,主要来源于: 1 拥塞避免阶段初始窗口大小,即推出快速恢复状态时的慢启动阈值 2.拥塞避免阶段的拥塞窗口调整策略 3.快速恢复状态的快慢(指不支持sack) 拥塞控制状态图 慢启动阶段1.发出去数据均能被ack 拥塞窗口持续指数增加,另外一种描述发出去的数据包被ACK多少就额外增加多少数据包。这个情况下拥塞窗口会持续增加。 2.收到重复ACK但是还没有达到3个 拥塞窗口和慢启动阈值不会变化,此时发包行为取决于当前和拥塞窗口接收端的窗口,以及当前发送队列上的数据量。如果其后续收到了新的数据ack,会继续进行慢启动,就好像什么也没有发生一样。 如果没有收到新数据ack,会引发快速重传。 3.收到足够重复ACK 足够的重复ack引发快速重传,慢启动阈值降低为当前拥塞窗口的一半,拥塞窗口设置为慢启动阈值+3(已经收到3个重复ack,代表3个数据包已经离开网络到达对端);此时此刻会重传丢失的数据包,具体要不要发新数据包完全取决于当前的拥塞窗口。如果丢失的数据包统统确认已经到达对端,会退出快速恢复从新进入open状态;进入open状态后因为拥塞窗口大于慢启动阈值,所以 会进入拥塞避免阶段。 4.发送数据包超时 慢启动阈值变为当前拥塞窗口一半,拥塞窗口减少到1,会按照慢启动发包行为发送认为已经丢失的报文;待这些报文被确认后退出超时阶段从新回到open。按照当前的拥塞窗口和慢启动阈值进行慢启动或者拥塞避免发包。 拥塞避免阶段1.发出去数据均被ack 拥塞窗口会继续线性增长,慢启动阈值保持不变。 2.收到重复ack但是还没有达到3个 此状态属于TCP协议栈的disorder状态,慢启动阈值保持不变,拥塞窗口保持不变,协议栈发包行为收协议栈拥塞窗口的限制。如果后续收到新数据的ACK,将从disorder状态回到拥塞避免阶段,从新线性增长方式发包。 3.收到足够重复ack 足够的重复ack引发快速重传,慢启动阈值降低为当前拥塞窗口的一半,拥塞窗口设置为慢启动阈值+3;此时此刻会重传丢失的数据包,具体要不要发新数据取决于当前的拥塞窗口。如果丢失数据包通通被确认已经到达对端,会退出快速恢复阶段进入拥塞避免阶段。 4.发送数据包超时 慢启动阈值变成当前拥塞窗口一半,拥塞窗口减少到1,会按照慢启动发包行为完成已经发送数据包的重传。待这些报文被确认后退出loss阶段,从新回到open状态,根据慢启动阈值和拥塞窗口大小进行慢启动或者拥塞避免发包。","categories":[{"name":"TCP","slug":"TCP","permalink":"http://vcpu.me/categories/TCP/"}],"tags":[{"name":"拥塞控制","slug":"拥塞控制","permalink":"http://vcpu.me/tags/拥塞控制/"},{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"Linux拥塞控制基本概念","slug":"Linux拥塞控制基本概念","permalink":"http://vcpu.me/tags/Linux拥塞控制基本概念/"}]},{"title":"软中断和硬中断","slug":"interrupt","date":"2017-07-04T04:00:00.000Z","updated":"2017-07-04T03:28:38.000Z","comments":true,"path":"interrupt/","link":"","permalink":"http://vcpu.me/interrupt/","excerpt":"中断概述Linux内核需要对连接到计算机上所有硬件设备进行管理,毫无疑问这是它分内的事情。其通过中断机制让管理的硬件设备主动通知,而不是其主动轮询。中断是一种电信号,由硬件设备产生送入中断控制器的输入引脚,然后中断控制器会想处理器发出信号;处理器收到该信号后,停下当前正在处理的事情,跳到中断处理程序的入口点,进行中断处理。当然处理器会通知操作系统已经产生中断;操作系统也可能会进行适当的处理。 处理器通过中断向量识别产生的中断,linux系统下Intel X86支持256中断向量,中断编号0-255 0-31 异常 非屏蔽 固定不变32-47 屏蔽中断(IO设备)48-25 软中断 硬件中断硬中断是外部设备对CPU的中断,硬中断可抢占软中断,优先级高执行较快。硬中断的本质是接收到中断信号后,跳转到公共段代码执行do_IRQ,并切换到硬中断请求栈,执行中断回调函数。","text":"中断概述Linux内核需要对连接到计算机上所有硬件设备进行管理,毫无疑问这是它分内的事情。其通过中断机制让管理的硬件设备主动通知,而不是其主动轮询。中断是一种电信号,由硬件设备产生送入中断控制器的输入引脚,然后中断控制器会想处理器发出信号;处理器收到该信号后,停下当前正在处理的事情,跳到中断处理程序的入口点,进行中断处理。当然处理器会通知操作系统已经产生中断;操作系统也可能会进行适当的处理。 处理器通过中断向量识别产生的中断,linux系统下Intel X86支持256中断向量,中断编号0-255 0-31 异常 非屏蔽 固定不变32-47 屏蔽中断(IO设备)48-25 软中断 硬件中断硬中断是外部设备对CPU的中断,硬中断可抢占软中断,优先级高执行较快。硬中断的本质是接收到中断信号后,跳转到公共段代码执行do_IRQ,并切换到硬中断请求栈,执行中断回调函数。 硬件中断流程硬中断的汇编处理->do_IRQ->handle_irq->handle_edge_irq(handle_level_irq)->handle_irq_event->具体设备的硬中断处理 嵌套linux下硬件中断可以嵌套,且无优先级别;除同种中断外,一个中断可打断另一个中断。此种机制短时间内可以接受更多的中断,可以有大的设备控制吞吐量;无优先级可以简化内核。同种中断处理机制可以描述为,中断数据结构会设置IRQD_IRQ_INPROGRESS中断不处理标识,本地CPU或者其它CPU如果检查到此种中断的该标记,会直接退出,置上IRQS_PENDING后续处理标记。 软中断软中断是硬中断服务程序对内核的中断,软中断时一种推后执行的机制,软中断是bottom half,上半部在屏蔽中断的上下文中运行,软中断相对来讲不是非常紧急,通常还比较耗时,不会在中断上下文中执行系统会自行安排运行时机。软中断不会抢占另一个软中断。 原理概述1.软中断通过open_softirq注册一个软中断处理函数,在软中断向量表softirq_vec数组中添加新的action函数1234567//定时器init_timers调用初始化软中断调用函数open_softirq(TIMER_SOFTIRQ, run_timer_softirq);...void open_softirq(int nr, void (*action)(struct softirq_action *)){ softirq_vec[nr].action = action;} 2.调用raise_softirq软中断触发函数,即软中断标记为挂起状态12345678910111213141516171819202122232425262728293031/* * This function must run with irqs disabled! */inline void raise_softirq_irqoff(unsigned int nr){ //设置 __raise_softirq_irqoff(nr); /* * If we're in an interrupt or softirq, we're done * (this also catches softirq-disabled code). We will * actually run the softirq once we return from * the irq or softirq. * * Otherwise we wake up ksoftirqd to make sure we * schedule the softirq soon. */ //不能在硬中断,必须要硬中断处理完 //不能在软中断里,软中断不能嵌套 if (!in_interrupt()) wakeup_softirqd();}void raise_softirq(unsigned int nr){ unsigned long flags; //关闭本地CPU中断 local_irq_save(flags); raise_softirq_irqoff(nr); local_irq_restore(flags);} 内核会在一些位置检查是否有挂起状态的软中断,如果有的话调用do_softirq执行软中断处理action函数3.do_softirq完成两件事情(1)切换到软件请求栈,让其处于软中断上下文(2)执行do_softirq4.do_softirq(1)执行软中断处理函数(2)如果软中处理函数超过10个,唤醒内核线程让其处理本地CPU软中断。 软中断本质就是内核在某些位置检查是否有挂起的软中断(local_software_pending()不为0指有挂起软中断),若有则调用do_softirq切换到软中断请求栈,调用__do_softirq。 进程角度看软中断执行过程步骤1:将返回四值和CPU状态寄存器压栈步骤2:修改特权级别(系统程序需要核心态特权才能运行,用户态函数只能通过软中断调用系统API),设置中断事务标记步骤3:唤醒守护线程,检测中断状态寄存器,发现软中断事务步骤4:根据中断号通过查找中断向量表,找到ISR中断服务历程地址,跳转执行步骤5:中断服务程序执行完成后,返回压栈的函数执行点 嵌套软中断不打断软中断,相同软中断可在所有CPU上同时执行 软中断触发时机(1)调用do_IRQ完成I/O中断时调用irq_exit irq_exit->invoke_softirq->do_softirq(2)如果系统使用I/O APIC,在处理完本地时钟中断时(3)local_bh_enable->do_softirq1234void local_bh_enable(void){ _local_bh_enable_ip(_RET_IP_);} (4)在SMP中,当CPU处理完被CALL_FUNCTION_VECTOR处理器间中断所触发的函数时:12345678void smp_trace_call_function_interrupt(struct pt_regs *regs){ smp_entering_irq(); trace_call_function_entry(CALL_FUNCTION_VECTOR); __smp_call_function_interrupt(); trace_call_function_exit(CALL_FUNCTION_VECTOR); exiting_irq();} exiting_irq->irq_exit __do_softirq 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485asmlinkage void __do_softirq(void){ struct softirq_action *h; __u32 pending; //软中断结束时间 unsigned long end = jiffies + MAX_SOFTIRQ_TIME; int cpu; unsigned long old_flags = current->flags; //软中断执行次数10次 int max_restart = MAX_SOFTIRQ_RESTART; /* * Mask out PF_MEMALLOC s current task context is borrowed for the * softirq. A softirq handled such as network RX might set PF_MEMALLOC * again if the socket is related to swap */ current->flags &= ~PF_MEMALLOC; //获得CPU的软中断掩码,这时候仍然是关中断,可安全获得掩码 pending = local_softirq_pending(); //统计信息:进程被中断使用时间 account_irq_enter_time(current); //执行完该函数后,关闭软中断,后续即使硬件再次触发新的软中断,也不会重新进入__do_softirq __local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET); lockdep_softirq_enter();//just for debugging cpu = smp_processor_id(); restart: /* Reset the pending bitmask before enabling irqs */ //中断掩码清0,当然局部变量pending已经存储下来了,开启硬件中断后,可设置上新的软中断了 set_softirq_pending(0); //开硬件中断,由于软中断执行时间一般较长,这里将中断打开避免长时间关中断,这段处理时间硬件中断就不会丢失了 local_irq_enable(); h = softirq_vec; do { if (pending & 1) {//中断挂起 unsigned int vec_nr = h - softirq_vec;//获取中断号 //保存抢占计数,后续无法破坏该计数了 int prev_count = preempt_count(); //软中断在每个核上执行计数 kstat_incr_softirqs_this_cpu(vec_nr); trace_softirq_entry(vec_nr); //执行回调函数 h->action(h); trace_softirq_exit(vec_nr); //软中断回调函数破坏了抢占计数,打印高级别警告信息,并恢复抢占计数 if (unlikely(prev_count != preempt_count())) { printk(KERN_ERR \"huh, entered softirq %u %s %p\" \"with preempt_count %08x,\" \" exited with %08x?\\n\", vec_nr, softirq_to_name[vec_nr], h->action, prev_count, preempt_count()); preempt_count() = prev_count; } rcu_bh_qs(cpu); } //处理下一个软中断 h++; pending >>= 1; } while (pending);//无软中断循环结束//处理完一轮软中断后,因为处理时候中断是开启的,可能发生了硬件中断重新触发了软中断//我们就关中断保障中断掩码再被修改 local_irq_disable(); //如果没有超过10次,且处理时间也在合法范围内,继续处理,否则唤醒ksoftirqd守护线程处理软中断 pending = local_softirq_pending(); if (pending) { if (time_before(jiffies, end) && !need_resched() && --max_restart) goto restart; //调用线程处理剩下的中断 wakeup_softirqd(); } lockdep_softirq_exit(); account_irq_exit_time(current); __local_bh_enable(SOFTIRQ_OFFSET); tsk_restore_flags(current, old_flags, PF_MEMALLOC);} 防止软中断嵌套的流程:关软中断中肯定有一句原子地加1的关键语句,如果当前内核路径A在该原子操作之前被另一个内核路径B打断,则B执行完硬中断和软中断后,返回到A的此处,A接着执行该原子操作,之后的软中断处理应该是空转,因为肯定已经被B处理完了。如果在该原子操作之后被B打断,则B执行完硬中断,不会执行自己的软中断而是会直接退出(因为软中断嵌套了),返回到A的此处,A接着执行,这次A除了处理自己软中断,还会额外地处理B的软中断。对于preempt_count中的软中断位,由上述可以知道,它的作用有两个:防止软中断在单cpu上嵌套;保证了在执行软中断期间不被抢占。 ksoftirqd进程run_ksoftirqd是ksoftirqd线程的核心处理函数123456789101112131415static void run_ksoftirqd(unsigned int cpu){ //1.把当前CPU中断中断关掉 local_irq_disable(); //2.当前CPU是否有软中断 if (local_softirq_pending()) { //3.处理软中断 __do_softirq(); rcu_note_context_switch(cpu); local_irq_enable(); cond_resched(); return; } local_irq_enable();} 该内核线程的优先级较低,且采用关闭中断保护方式,而不是关闭抢占保护方式,让更多的软中断被其它人调用执行。达到ksoftirqd进程的辅助作用。一旦开始执行中断就不允许抢占了,软中断和硬中断都是这个做法,在执行期间不允许调度。","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"软中断","slug":"软中断","permalink":"http://vcpu.me/tags/软中断/"},{"name":"硬中断","slug":"硬中断","permalink":"http://vcpu.me/tags/硬中断/"}]},{"title":"惊群探究","slug":"惊群","date":"2017-06-29T04:00:00.000Z","updated":"2017-06-29T03:06:20.000Z","comments":true,"path":"惊群/","link":"","permalink":"http://vcpu.me/惊群/","excerpt":"惊群发生在多进程或者多线程,等待同一个socket事件,当该事件发生,这些进程或者线程都被唤醒 发生位置2.6版本内核accept已经解决该问题了,但是select/poll或者epool_wait仍然存在该问题 产生影响一个连接来临时,多个子进程同时被唤醒,却只有一个子进程accept成功,其余都失败,重新休眠;产生了没有必要的唤醒和上下文切换,造成性能浪费。 惊群实例分析总体结论:accept linux内核已经解决惊群问题3.10.0-514.16.1.el7.x86_64,具体唤醒方式实现参见socket信号处理博文select/pool和epool内核并没有支持解决惊群问题。之前一直觉得epool和select使用非常平凡,为什么操作系统不直接自己解决惊群问题?应该是epool和select不仅仅使用在socket监听中,会存在需要唤醒多个进程的使用场景。","text":"惊群发生在多进程或者多线程,等待同一个socket事件,当该事件发生,这些进程或者线程都被唤醒 发生位置2.6版本内核accept已经解决该问题了,但是select/poll或者epool_wait仍然存在该问题 产生影响一个连接来临时,多个子进程同时被唤醒,却只有一个子进程accept成功,其余都失败,重新休眠;产生了没有必要的唤醒和上下文切换,造成性能浪费。 惊群实例分析总体结论:accept linux内核已经解决惊群问题3.10.0-514.16.1.el7.x86_64,具体唤醒方式实现参见socket信号处理博文select/pool和epool内核并没有支持解决惊群问题。之前一直觉得epool和select使用非常平凡,为什么操作系统不直接自己解决惊群问题?应该是epool和select不仅仅使用在socket监听中,会存在需要唤醒多个进程的使用场景。 accept实例背景:3.10.0-514.16.1.el7.x86_64内核下,在创建、绑定和监听后,创建多进程accept 建立好的fd,当一个客户端访问该服务端时候,观察进程的连接获取情况123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051#include <sys/types.h>#include <sys/socket.h>#include <unistd.h>#include <arpa/inet.h>#include <stdio.h>#include <stdlib.h>#include <errno.h>#include <strings.h>#define SERV_PORT 9999int main(int argc,char **argv){ int listenfd,connfd; pid_t childpid,childpid2; socklen_t clilen; struct sockaddr_in cliaddr,servaddr; listenfd = socket(AF_INET,SOCK_STREAM,0); bzero(&servaddr,sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl (INADDR_ANY); servaddr.sin_port = htons (SERV_PORT); bind(listenfd, (struct sockaddr *) &servaddr, sizeof(servaddr)); listen(listenfd,1000); clilen = sizeof(cliaddr); if( (childpid = fork()) == 0) { while(1) { connfd = accept(listenfd,(struct sockaddr *) &cliaddr,&clilen); printf(\"fork 1 is [%d],error is %m\\n\",connfd); } } if( (childpid2 = fork()) == 0) { while(1){ connfd = accept(listenfd,(struct sockaddr *) &cliaddr,&clilen); printf(\"fork 2 is [%d],error is %m\\n\",connfd); } } sleep(100); return 1;} 窗口1:编译运行例子 123[root@localhost demo]# gcc jq.c -o jq[root@localhost demo]# ./jqfork 1 is [4],error is Success 窗口2:访问9999端口1[root@localhost ~]# curl http://127.0.0.1:9999 结果:在创建、绑定和监听后,创建多进程监听后,只有一个进程被唤醒接收处理fd,其它进程均在休眠阶段,在linux内核3.10.0-514.16.1.el7.x86_64版本下,多进程accept连接时候不存在惊群现象。 select实例背景:3.10.0-514.16.1.el7.x86_64内核下,在创建、绑定和监听后,创建多进程select监听 建立好的fd,当一个客户端访问该服务端时候,多进程活动情况 jingqunselect.c1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192#include <sys/types.h>#include <sys/socket.h>#include <unistd.h>#include <arpa/inet.h>#include <stdio.h>#include <stdlib.h>#include <errno.h>#include <strings.h>#define SERV_PORT 8888int main(int argc,char **argv){ int listenfd,connfd; pid_t childpid,childpid2; socklen_t clilen; struct sockaddr_in cliaddr,servaddr; listenfd = socket(AF_INET,SOCK_STREAM,0); bzero(&servaddr,sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl (INADDR_ANY); servaddr.sin_port = htons (SERV_PORT); bind(listenfd, (struct sockaddr *) &servaddr, sizeof(servaddr)); listen(listenfd,1000); clilen = sizeof(cliaddr); if( (childpid = fork()) == 0) { int maxsock = listenfd+1; fd_set fdsr; FD_ZERO(&fdsr); FD_SET(listenfd, &fdsr); struct timeval tv; tv.tv_sec = 30; tv.tv_usec = 0; int ret = select(maxsock, &fdsr,NULL,NULL,&tv); if(ret < 0) { printf(\"[%d]child err 1 \\n\",getpid()); } else if(ret == 0) { printf(\"time out\\n\"); } else { printf(\"[%d] rcv singal \\n\",getpid()); while(1) { connfd = accept(listenfd,(struct sockaddr *) &cliaddr,&clilen); printf(\"[%d] [%d],error is %m\\n\",getpid(),connfd); } } } if( (childpid2 = fork()) == 0) { int maxsock = listenfd+1; fd_set fdsr; FD_ZERO(&fdsr); FD_SET(listenfd, &fdsr); struct timeval tv; tv.tv_sec = 30; tv.tv_usec = 0; int ret = select(maxsock, &fdsr,NULL,NULL,&tv); if(ret < 0) { printf(\"[%d]child 2 err\\n\",getpid()); } else if(ret == 0) { printf(\"time out\\n\"); } else { printf(\"[%d] rcv singal \\n\",getpid()); while(1) { connfd = accept(listenfd,(struct sockaddr *) &cliaddr,&clilen); printf(\"[%d] [%d],error is %m\\n\",getpid(),connfd); } } } sleep(100); return 1;} 窗口1:编译运行上述代码123456[root@localhost demo]# gcc jingqunselect.c -o jqselect[root@localhost demo]#[root@localhost demo]# ./jqselect[23954] rcv singal[23955]rcv singal[23954] [4],error is Success 窗口2:[root@localhost ~]# curl http://127.0.0.1:8888结论:有连接建立信号时候两个监听进程均被唤醒,也就是存在惊群问题。 epool实例背景:3.10.0-514.16.1.el7.x86_64内核下,在创建、绑定和监听后,创建10个子进程epool监听 建立好的fd,当一个客户端访问该服务端时候,多进程活动情况123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149[root@localhost demo]# cat epjq.c#include <sys/types.h>#include <sys/socket.h>#include <sys/epoll.h>#include <netdb.h>#include <string.h>#include <stdio.h>#include <unistd.h>#include <fcntl.h>#include <stdlib.h>#include <errno.h>#include <sys/wait.h>#define PROCESS_NUM 10static intcreate_and_bind (char *port){ int fd = socket(PF_INET, SOCK_STREAM, 0); struct sockaddr_in serveraddr; serveraddr.sin_family = AF_INET; serveraddr.sin_addr.s_addr = htonl(INADDR_ANY); serveraddr.sin_port = htons(atoi(port)); bind(fd, (struct sockaddr*)&serveraddr, sizeof(serveraddr)); return fd;} static intmake_socket_non_blocking (int sfd){ int flags, s; flags = fcntl (sfd, F_GETFL, 0); if (flags == -1) { perror (\"fcntl\"); return -1; } flags |= O_NONBLOCK; s = fcntl (sfd, F_SETFL, flags); if (s == -1) { perror (\"fcntl\"); return -1; } return 0;}#define MAXEVENTS 64intmain (int argc, char *argv[]){ int sfd, s; int efd; struct epoll_event event; struct epoll_event *events; sfd = create_and_bind(\"8888\"); if (sfd == -1) abort (); s = make_socket_non_blocking (sfd); if (s == -1) abort (); s = listen(sfd, SOMAXCONN); if (s == -1) { perror (\"listen\"); abort (); } efd = epoll_create(MAXEVENTS); if (efd == -1) { perror(\"epoll_create\"); abort(); } event.data.fd = sfd; //event.events = EPOLLIN | EPOLLET; event.events = EPOLLIN; s = epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &event); if (s == -1) { perror(\"epoll_ctl\"); abort(); } /* Buffer where events are returned */ events = calloc(MAXEVENTS, sizeof event); int k; for(k = 0; k < PROCESS_NUM; k++) { int pid = fork(); if(pid == 0) { /* The event loop */ while (1) { int n, i; n = epoll_wait(efd, events, MAXEVENTS, -1); printf(\"process %d return from epoll_wait!\\n\", getpid()); /* sleep here is very important!*/ sleep(2); for (i = 0; i < n; i++) { if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || (!(events[i].events & EPOLLIN))) { /* An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) */ fprintf (stderr, \"epoll error\\n\"); close (events[i].data.fd); continue; } else if (sfd == events[i].data.fd) { /* We have a notification on the listening socket, which means one or more incoming connections. */ struct sockaddr in_addr; socklen_t in_len; int infd; char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV]; in_len = sizeof in_addr; infd = accept(sfd, &in_addr, &in_len); if (infd == -1) { printf(\"process %d accept failed!\\n\", getpid()); break; } printf(\"process %d accept successed!\\n\", getpid()); /* Make the incoming socket non-blocking and add it to the list of fds to monitor. */ close(infd); } } } } } int status; wait(&status); free (events); close (sfd); return EXIT_SUCCESS;} 窗口1:编译运行epooldemo[root@localhost demo]# ./epjqprocess 24197 return from epoll_wait!process 24198 return from epoll_wait!process 24196 return from epoll_wait!process 24195 return from epoll_wait!process 24194 return from epoll_wait!process 24193 return from epoll_wait!process 24192 return from epoll_wait!process 24191 return from epoll_wait!process 24190 return from epoll_wait!process 24189 return from epoll_wait!process 24193 accept successed!process 24194 accept failed!process 24197 accept failed!process 24195 accept failed!process 24192 accept failed!process 24191 accept failed!process 24196 accept failed!process 24198 accept failed!process 24189 accept failed!process 24190 accept failed! 窗口2:[root@localhost ~]# curl http://127.0.0.1:8888 结论:epoll_wait监听事件时候没有解决惊群问题;所有监听进程均会被打扰惊醒,进行上下文切换后然后进入睡眠。 how to slove ?多进程需要从某一个端口获取连接,为了高性能我们摒弃直接accept而根据具体使用场景选用epool/poll/select等多socket管控机制,这种管控机制会带来惊群问题(具体内容上述实验已经表达)。目前标准解决方案有两种: 1.锁机制(见后文详细说明)2.复用端口 kernel 3.9增加了SO_REUSEPORT socket option,该选项允许服务端socket复用端口,通过hash机制将连接分配客户端到具体的进程;而这一切都是内核在处理。 实例背景:3.10.0-514.16.1.el7.x86_64内核,fork多个进程,通过设置SO_REUSEPORT标记,多进程一起监听端口8888,具体实验demo程序如下:123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160#include <sys/types.h>#include <sys/socket.h>#include <sys/epoll.h>#include <netdb.h>#include <string.h>#include <stdio.h>#include <unistd.h>#include <fcntl.h>#include <stdlib.h>#include <errno.h>#include <sys/wait.h>#define PROCESS_NUM 10static intcreate_and_bind (char *port){ int fd = socket(PF_INET, SOCK_STREAM, 0); int optval = 1; setsockopt(fd,SOL_SOCKET,SO_REUSEPORT,&optval,sizeof(optval)); struct sockaddr_in serveraddr; serveraddr.sin_family = AF_INET; serveraddr.sin_addr.s_addr = htonl(INADDR_ANY); serveraddr.sin_port = htons(atoi(port)); bind(fd, (struct sockaddr*)&serveraddr, sizeof(serveraddr)); return fd;} static intmake_socket_non_blocking (int sfd){ int flags, s; flags = fcntl (sfd, F_GETFL, 0); if (flags == -1) { perror (\"fcntl\"); return -1; } flags |= O_NONBLOCK; s = fcntl (sfd, F_SETFL, flags); if (s == -1) { perror (\"fcntl\"); return -1; } return 0;}#define MAXEVENTS 64static void socket_proc(){ int sfd, s; int efd; struct epoll_event event; struct epoll_event *events; sfd = create_and_bind(\"8888\"); if (sfd == -1) abort (); s = make_socket_non_blocking (sfd); if (s == -1) abort (); s = listen(sfd, SOMAXCONN); if (s == -1) { perror (\"listen\"); abort (); } efd = epoll_create(MAXEVENTS); if (efd == -1) { perror(\"epoll_create\"); abort(); } event.data.fd = sfd; //event.events = EPOLLIN | EPOLLET; event.events = EPOLLIN; s = epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &event); if (s == -1) { perror(\"epoll_ctl\"); abort(); } /* Buffer where events are returned */ events = calloc(MAXEVENTS, sizeof event); /* The event loop */ while (1) { int n, i; n = epoll_wait(efd, events, MAXEVENTS, -1); printf(\"process %d return from epoll_wait!\\n\", getpid()); /* sleep here is very important!*/ // sleep(2); for (i = 0; i < n; i++) { if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || (!(events[i].events & EPOLLIN))) { /* An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) */ fprintf (stderr, \"epoll error\\n\"); close (events[i].data.fd); continue; } else if (sfd == events[i].data.fd) { /* We have a notification on the listening socket, which means one or more incoming connections. */ struct sockaddr in_addr; socklen_t in_len; int infd; char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV]; in_len = sizeof in_addr; infd = accept(sfd, &in_addr, &in_len); if (infd == -1) { printf(\"process %d accept failed!\\n\", getpid()); break; } printf(\"process %d accept successed!\\n\", getpid()); /* Make the incoming socket non-blocking and add it to the list of fds to monitor. */ close(infd); } } } free (events); close (sfd);}intmain (int argc, char *argv[]){ int k; for(k = 0; k < PROCESS_NUM; k++) { int pid = fork(); if(pid == 0) { socket_proc(); } } int status; wait(&status); return EXIT_SUCCESS;} 编译运行:1234567[root@localhost demo]#[root@localhost demo]# gcc epjqreuseport.c -o e.out[root@localhost demo]# ./e.outprocess 31071 return from epoll_wait!process 31071 accept successed!process 31075 return from epoll_wait!process 31075 accept successed! 总结:SO_REUSEPORT允许多进程共同bind同一个端口,内核会按照一定机制分配访问连接到不同的进程 nginx采用epool模型,怎么解决惊群? 如果进程并没有处于过载状态,那么就会去争用锁,当然,实际上是争用监听套接口的监控权. 争锁成功就会把所有监听套接口加入到自身的事件监控机制里(如果原本不在) 争锁失败就会把监听套接口从自身的事件监控机制里删除(如果原本在) 争抢成功的进程,可以调用epoll_ctl把所有要监听的端口加入该进程的epool事件中,然后epool_wait阻塞及时获取客户端的新建tcp事件,如果获取到相应事件,该进程调用accept正式建立建立连接;然后释放锁。当锁被释放后所有进程可以共同争抢锁了。 也就是说,因为锁的原因,同一时间只能有一个进程拥有监听端口的监控权利(将监听端口放入自己epool中中控制并且调用epoolwait监控新建事件)。这种机制保障了不会有多进程共同拥有套接口的监控权,从而避免了惊群问题。","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"nginx","slug":"nginx","permalink":"http://vcpu.me/tags/nginx/"},{"name":"惊群","slug":"惊群","permalink":"http://vcpu.me/tags/惊群/"}]},{"title":"epool基本用法","slug":"epool","date":"2017-06-28T04:00:00.000Z","updated":"2017-07-04T03:23:55.000Z","comments":true,"path":"epool/","link":"","permalink":"http://vcpu.me/epool/","excerpt":"epool基本概念是什么?改进的pool,一种I/O多路复用技术,可管理大批量文件描述符。 工作原理? 内核中,一切皆文件,epoll向内核注册一个文件系统,用于存储要管理的文件描述符号。调用epoll_create时,会在虚拟文件系统中创建一个file节点服务epool同时也会创建就绪事件list链表。操作系统启动后,会开辟出自己的高速cache,socket问价描述符会以红黑树存入cache,方便查找、插入、删除。 epool_ctl,把socket放到epool文件系统里file对应的红黑树,也会注册一个回调函数,文件描述符有信号后,会调用该组册函数,内核把网卡数据copy到内核中把socket插入就绪列表中。 epoll_wait调用时候,看一眼就绪列表,所以效率很高。监控百万描述符,但是准备就绪fd却很少。","text":"epool基本概念是什么?改进的pool,一种I/O多路复用技术,可管理大批量文件描述符。 工作原理? 内核中,一切皆文件,epoll向内核注册一个文件系统,用于存储要管理的文件描述符号。调用epoll_create时,会在虚拟文件系统中创建一个file节点服务epool同时也会创建就绪事件list链表。操作系统启动后,会开辟出自己的高速cache,socket问价描述符会以红黑树存入cache,方便查找、插入、删除。 epool_ctl,把socket放到epool文件系统里file对应的红黑树,也会注册一个回调函数,文件描述符有信号后,会调用该组册函数,内核把网卡数据copy到内核中把socket插入就绪列表中。 epoll_wait调用时候,看一眼就绪列表,所以效率很高。监控百万描述符,但是准备就绪fd却很少。 适用场景?非常适用大量并发连接中只有少量活跃连接情况,且在该情况下CPU适用率很低。 可能缺点?所有socket基本都是活跃的,比如在一个高速的LAN环境,使用epool可能会比select/pool效率低 分为LT和ETLT和ET作用在epool_wait过程中,LT模式下,只要一个文件描述符没有处理完,后续再次调用epool_wait时也会返回。实现过程为,内核会把socket事件插入就绪链表,epool_wait调用会被把就绪的文件描述符拷入用户态,清空就绪链表,如果是ET则额外检测如果存在没有处理文件描述符,则将再次放入就绪列表中。 epool例子epoll_create函数用途:创建一个epool事件管理并返回描述符号12#include <sys/epoll.h>int epoll_create(int size); 参数:size 最大fd数返回值:epool使用的文件描述符 -1 失败 >= 0 成功 epoll_ctl函数用途:控制epoll事件,添加修改删除事件12#include <sys/epoll.h>int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); 参数: epfd:epoll_create的返回值op:要进行的操作例如注册事件,可能的取值EPOLL_CTL_ADD 注册、EPOLL_CTL_MOD 修 改、EPOLL_CTL_DEL 删除fd:要在epool事件管理上加入删除或者修改的文件描述符event:event.data.fd 要处理的文件描述符event:event.events = EPOLLIN|EPOLLET;EPOLLIN :表示对应的文件描述符可以读;EPOLLOUT:表示对应的文件描述符可以写;EPOLLPRI:表示对应的文件描述符有紧急的数据可读EPOLLERR:表示对应的文件描述符发生错误;EPOLLHUP:表示对应的文件描述符被挂断;EPOLLET:边缘触发; 返回值:成功返回0,失败返回-1 When successful, epoll_ctl() returns zero. When an error occurs, epoll_ctl() returns -1 and errno is set appropriately epool_wait函数用途:返回IO事件就绪的fd12int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); 参数:epfd:epoll_create的返回值events:取出内核结果的事件数组maxevents:要处理的事件数timeout:等待IO发生超时值 -1 阻塞直到有事件 0 非阻塞 >0: 阻塞时间,单位毫秒 epoool函数实例123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161#include <unistd.h>#include <sys/types.h> /* basic system data types */#include <sys/socket.h> /* basic socket definitions */#include <netinet/in.h> /* sockaddr_in{} and other Internet defns */#include <arpa/inet.h> /* inet(3) functions */#include <sys/epoll.h> /* epoll function */#include <fcntl.h> /* nonblocking */#include <sys/resource.h> /*setrlimit */#include <stdlib.h>#include <errno.h>#include <stdio.h>#include <string.h>#define MAXEPOLLSIZE 10000#define MAXLINE 10240int handle(int connfd);int setnonblocking(int sockfd){ if (fcntl(sockfd, F_SETFL, fcntl(sockfd, F_GETFD, 0)|O_NONBLOCK) == -1) { return -1; } return 0;}int main(int argc, char **argv){ int servPort = 8080; int listenq = 1024; int listenfd, connfd, kdpfd, nfds, n, nread, curfds,acceptCount = 0; struct sockaddr_in servaddr, cliaddr; socklen_t socklen = sizeof(struct sockaddr_in); struct epoll_event ev; struct epoll_event events[MAXEPOLLSIZE]; struct rlimit rt; char buf[MAXLINE]; /* 设置每个进程允许打开的最大文件数 */ rt.rlim_max = rt.rlim_cur = MAXEPOLLSIZE; if (setrlimit(RLIMIT_NOFILE, &rt) == -1) { perror(\"setrlimit error\"); return -1; } bzero(&servaddr, sizeof(servaddr)); servaddr.sin_family = AF_INET; servaddr.sin_addr.s_addr = htonl (INADDR_ANY); servaddr.sin_port = htons (servPort); listenfd = socket(AF_INET, SOCK_STREAM, 0); if (listenfd == -1) { perror(\"can't create socket file\"); return -1; } int opt = 1; setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); if (setnonblocking(listenfd) < 0) { perror(\"setnonblock error\"); } if (bind(listenfd, (struct sockaddr *) &servaddr, sizeof(struct sockaddr)) == -1) { perror(\"bind error\"); return -1; } if (listen(listenfd, listenq) == -1) { perror(\"listen error\"); return -1; } /* 创建 epoll 句柄,把监听 socket 加入到 epoll 集合里 */ kdpfd = epoll_create(MAXEPOLLSIZE); ev.events = EPOLLIN | EPOLLET; ev.data.fd = listenfd; if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, listenfd, &ev) < 0) { fprintf(stderr, \"epoll set insertion error: fd=%d\\n\", listenfd); return -1; } curfds = 1; printf(\"epollserver startup,port %d, max connection is %d, backlog is %d\\n\", servPort, MAXEPOLLSIZE, listenq); for (;;) { /* 等待有事件发生 */ nfds = epoll_wait(kdpfd, events, curfds, -1); if (nfds == -1) { perror(\"epoll_wait\"); continue; } /* 处理所有事件 */ for (n = 0; n < nfds; ++n) { if (events[n].data.fd == listenfd) { connfd = accept(listenfd, (struct sockaddr *)&cliaddr,&socklen); if (connfd < 0) { perror(\"accept error\"); continue; } sprintf(buf, \"accept form %s:%d\\n\", inet_ntoa(cliaddr.sin_addr), cliaddr.sin_port); printf(\"%d:%s\", ++acceptCount, buf); if (curfds >= MAXEPOLLSIZE) { fprintf(stderr, \"too many connection, more than %d\\n\", MAXEPOLLSIZE); close(connfd); continue; } if (setnonblocking(connfd) < 0) { perror(\"setnonblocking error\"); } ev.events = EPOLLIN | EPOLLET; ev.data.fd = connfd; if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, connfd, &ev) < 0) { fprintf(stderr, \"add socket '%d' to epoll failed: %s\\n\", connfd, strerror(errno)); return -1; } curfds++; continue; } // 处理客户端请求 if (handle(events[n].data.fd) < 0) { epoll_ctl(kdpfd, EPOLL_CTL_DEL, events[n].data.fd,&ev); curfds--; } } } close(listenfd); return 0;}int handle(int connfd) { int nread; char buf[MAXLINE]; nread = read(connfd, buf, MAXLINE);//读取客户端socket流 if (nread == 0) { printf(\"client close the connection\\n\"); close(connfd); return -1; } if (nread < 0) { perror(\"read error\"); close(connfd); return -1; } printf(\"recv:%s\\n\",buf); write(connfd, buf, nread);//响应客户端 printf(\"send:%s\\n\",buf); return 0;}","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"epoll","slug":"epoll","permalink":"http://vcpu.me/tags/epoll/"}]},{"title":"socket读写条件","slug":"socket_rw_proc","date":"2017-06-26T11:10:34.000Z","updated":"2017-06-26T08:55:50.000Z","comments":true,"path":"socket_rw_proc/","link":"","permalink":"http://vcpu.me/socket_rw_proc/","excerpt":"","text":"概念说明1.接收缓存区低水位标记(用于读)和发送缓存区低水位标记(用于写):每个套接字有一个接收低水位和一个发送低水位。他们由select函数使用。接收低水位标记是让select返回”可读”时套接字接收缓冲区中所需的数据量。对于TCP,其默认值为1。发送低水位标记是让select返回”可写”时套接字发送缓冲区中所需的可用空间。对于TCP,其默认值常为2048 通俗的解释一下,缓存区我们当成一个大小为 n bytes的空间,那么:接收区缓存的作用就是,接收对面的数据放在缓存区,供应用程序读。当然了,只有当缓存区可读的数据量(接收低水位标记)到达一定程度(eg:1)的时候,我们才能读到数据,不然不就读不到数据了吗。发送区缓存的作用就是,发送应用程序的数据到缓存区,然后一起发给对面。当然了,只有当缓存区剩余一定空间(发送低水位标记)(eg:2048),你才能写数据进去,不然可能导致空间不够。 2.FIN: (结束标志,Finish)用来结束一个TCP回话.但对应端口仍处于开放状态,准备接收后续数据. 特别纠正说明: 经过测试在3.10.0-514.16.1.el7.x86_64 内核情况下SO_SNDLOWAT/SO_RCVLOWAT默认发送和接收最低水位均为1,也就是说默认情况下低水位值可以不考虑,系统缓冲区有数据就读有空闲就写 1234567int value =0;int v_len = sizeof(value);getsockopt(sockfd, SOL_SOCKET, SO_SNDLOWAT, (void*)&value, &v_len);printf(\"snd low at value: %d\\n\",value);value = 0;getsockopt(sockfd, SOL_SOCKET, SO_RCVLOWAT, (void*)&value, &v_len);printf(\"rcv low at value: %d\\n\",value); socket可读的条件下列四个条件中的任何一个满足时,socket准备好读:1.socket的接收缓冲区中的数据字节大于等于该socket的接收缓冲区低水位标记的当前大小。对这样的socket的读操作将不阻塞并返回一个大于0的值 {也就是返回准备好读入的数据}。我们可以用SO_RCVLOWATsocket选项来设置该socket的低水位标记。对于TCP和UDP .socket而言,其缺省值为1 2.该连接的读这一半关闭{也就是接收了FIN的TCP连接}。对这样的socket的读操作将不阻塞并返回0 3.socket是一个用于监听的socket,并且已经完成的连接数为非0.这样的soocket处于可读状态,是因为socket收到了对方的connect请求,执行了三次握手的第一步:对方发送SYN请求过来,使监听socket处于可读状态;正常情况下,这样的socket上的accept操作不会阻塞; 4.有一个socket有异常错误条件待处理。对于这样的socket的读操作将不会阻塞,并且返回一个错误-1,errno则设置成明确的错误条件。这些待处理的错误也可通过指定socket选项SO_ERROR调用getsockopt来取得并清除 socket可写的条件1.socket发送缓冲区中数据字节大于等于该socket发送缓冲区低水位大小。写操作不会被阻塞,会直接返回写入的数据大小,SO_SNDLOWAT socket选项设置socket可写低水位大小。经过测试默认大小为1。2.如果你已经关闭连接,或者主动fin半关闭。这种情况下socket再些将要产生SIGPIPE信号,你没有对这个信号处理的话,你的进程将会被关闭。3.有socket异常错误待处理,这种情况下写操作并不会被阻塞会直接返回一个错误-1,想知道错误原因erron可以帮助你。 参考地址http://blog.csdn.net/szcarewell/article/details/51227540","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"}]},{"title":"socket信号处理","slug":"socket信号处理","date":"2017-06-26T11:10:34.000Z","updated":"2017-06-26T09:01:27.000Z","comments":true,"path":"socket信号处理/","link":"","permalink":"http://vcpu.me/socket信号处理/","excerpt":"socket I/O事件处理以TCP socket为例kernel: 3.10.0-514.16.1.el7.x86_64 socket IO处理函数1234567891011struct sock { ... struct socket_wq __rcu *sk_wq; /*等待队列和异步队列*/ ... void (*sk_state_change)(struct sock *sk); void (*sk_data_ready)(struct sock *sk, int bytes); void (*sk_write_space)(struct sock *sk); void (*sk_error_report)(struct sock *sk); int (*sk_backlog_rcv)(struct sock *sk); ...}; sk_wq 含有等待队列用来睡眠唤醒程序使用,异步队列异步socket使用 sk_state_change 从SYN_SEND或者SYN_RECV到ES状态,从ES到CLOSE_WAIT状态,当协议栈遇到这些事件时候会调用 sk_data_ready sk_write_space sock有数据可读和可写时候调用 sk_error_report sock上存在错误时调用,比如收到RST包","text":"socket I/O事件处理以TCP socket为例kernel: 3.10.0-514.16.1.el7.x86_64 socket IO处理函数1234567891011struct sock { ... struct socket_wq __rcu *sk_wq; /*等待队列和异步队列*/ ... void (*sk_state_change)(struct sock *sk); void (*sk_data_ready)(struct sock *sk, int bytes); void (*sk_write_space)(struct sock *sk); void (*sk_error_report)(struct sock *sk); int (*sk_backlog_rcv)(struct sock *sk); ...}; sk_wq 含有等待队列用来睡眠唤醒程序使用,异步队列异步socket使用 sk_state_change 从SYN_SEND或者SYN_RECV到ES状态,从ES到CLOSE_WAIT状态,当协议栈遇到这些事件时候会调用 sk_data_ready sk_write_space sock有数据可读和可写时候调用 sk_error_report sock上存在错误时调用,比如收到RST包 处理函数初始化1234567----------------START------------------------- 0xffffffff81557ed0 : sock_init_data+0x0/0x220 [kernel] 0xffffffff815ec9f4 : inet_create+0x154/0x360 [kernel] 0xffffffff81555200 : __sock_create+0x110/0x260 [kernel] 0xffffffff81556521 : SyS_socket+0x61/0xf0 [kernel] 0xffffffff81697189 : system_call_fastpath+0x16/0x1b [kernel]----------------END------------------------- 步骤1:通用初始化socket->SyS_socket->__sock_create->inet_create->sock_init_data 1234567sock_init_data:sk->sk_state_change = sock_def_wakeup;sk->sk_data_ready = sock_def_readable;sk->sk_write_space = sock_def_write_space;sk->sk_error_report = sock_def_error_report;sk->sk_destruct = sock_def_destruct; 步骤2:对于TCP socket,特有更新1234560xffffffff815be170 : tcp_init_sock+0x0/0x200 [kernel]0xffffffff815d4212 : tcp_v4_init_sock+0x12/0x30 [kernel]0xffffffff815eca71 : inet_create+0x1d1/0x360 [kernel]0xffffffff81555200 : __sock_create+0x110/0x260 [kernel]0xffffffff81556521 : SyS_socket+0x61/0xf0 [kernel]0xffffffff81697189 : system_call_fastpath+0x16/0x1b [kernel] inet_create->tcp_v4_init_sock->tcp_init_sock12tcp_init_sock:sk->sk_write_space = sk_stream_write_space; sock_def_wakeup信号触发时机 无论是作为客户端还是服务端socket TCP协议栈进入到ES或者CLOSE_WAIT时候,会触发sock_def_wakeup通知用户态进程TCP状态变更具体来讲:sock_def_wakeup可以唤醒connect或者accept,或者因收到结束喜欢fin而正常结束send/recv返回值为0 作为客户端主动连接对方获取资源访问方式:curl -v http://180.97.33.107 123456789101112131415161718192021[root@localhost socketdemo]# curl -v http://180.97.33.107* About to connect() to 180.97.33.107 port 80 (#0)* Trying 180.97.33.107...* Connected to 180.97.33.107 (180.97.33.107) port 80 (#0)> GET / HTTP/1.1> User-Agent: curl/7.29.0> Host: 180.97.33.107> Accept: */*>< HTTP/1.1 200 OK< Server: bfe/1.0.8.18< Date: Fri, 23 Jun 2017 10:02:40 GMT< Content-Type: text/html< Content-Length: 2381< Last-Modified: Mon, 23 Jan 2017 13:28:20 GMT< Connection: Keep-Alive< ETag: \"588604f4-94d\"< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform< Pragma: no-cache< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/< Accept-Ranges: bytes systemtap探测sock_def_wakeup被调用情况,并打出调用栈,如下: 1234567891011121314151617181920212223242526272829303132333435调用情况:收到对端回复synack后,发出ack时候客户端connect结束,从SYN_SEND跳转到ES状态唤醒用户态进程,此时连接已经成功,可以发送数据了sock_def_wakeup:[2017/6/23,18:00:59]local=10.0.2.15:60162,remote=180.97.33.107:80 state:ESTABLISHED 0xffffffff81558150 : sock_def_wakeup+0x0/0x40 [kernel] 0xffffffff815cbc09 : tcp_finish_connect+0xc9/0x120 [kernel] 0xffffffff815cc297 : tcp_rcv_state_process+0x637/0xf20 [kernel] 0xffffffff815d5ffb : tcp_v4_do_rcv+0x17b/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc [e1000]WARNING: Missing unwind data for a module, rerun with 'stap -d e1000'情况2:访问的服务端主动关闭连接,则客户端从ES进入CLOSE_WAIT,通知用户态进程sock_def_wakeup:[2017/6/23,18:00:59]local=10.0.2.15:60162,remote=180.97.33.107:80 state:CLOSE_WAIT 0xffffffff81558150 : sock_def_wakeup+0x0/0x40 [kernel] 0xffffffff815c5ca9 : tcp_fin+0x169/0x1e0 [kernel] 0xffffffff815c84f8 : tcp_data_queue+0x7f8/0xdd0 [kernel] 0xffffffff815cb4a7 : tcp_rcv_established+0x217/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc [e1000] 作为服务端结论作为服务端堵塞在accept时,收到客户端请求,三次握手建立完成后,服务端状态进入ES状态,会调用sock_def_wakeup通知用户态进程123456789101112131415sock_def_wakeup:[2017/6/26,10:47:00]local=192.168.55.178:8080,remote=192.168.55.165:50536 state:ESTABLISHED 0xffffffff81558150 : sock_def_wakeup+0x0/0x40 [kernel] 0xffffffff815cc3bf : tcp_rcv_state_process+0x75f/0xf20 [kernel] 0xffffffff815d7dde : tcp_child_process+0x3e/0x130 [kernel] 0xffffffff815d60d5 : tcp_v4_do_rcv+0x255/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000] 服务端测试程序如下1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768 //#include <sys/types.h>#include <sys/socket.h>//#include <sys/wait.h>#include <stdio.h>#include <errno.h>#include <string.h>#include <netinet/in.h>#include <fcntl.h>#define SERVPORT 8080#define BACKLOG 10#define MAX_CONNECTED_NO 10#define MAXDATASIZE 100int main(){ struct sockaddr_in server_sockaddr,client_sockaddr; int sin_size,recvbytes,flags; int sockfd,client_fd; char buf[MAXDATASIZE]; if((sockfd = socket(AF_INET,SOCK_STREAM,0))==-1){ perror(\"socket\"); return 0; } printf(\"socket success!,sockfd=%d\\n\",sockfd); server_sockaddr.sin_family=AF_INET; server_sockaddr.sin_port=htons(SERVPORT); server_sockaddr.sin_addr.s_addr=INADDR_ANY; bzero(&(server_sockaddr.sin_zero),8); if(bind(sockfd,(struct sockaddr *)&server_sockaddr,sizeof(struct sockaddr))==-1){ perror(\"bind\"); return 0; } printf(\"bind success!\\n\"); if(listen(sockfd,BACKLOG)==-1){ perror(\"listen\"); return 0; } printf(\"listening....\\n\"); if((flags=fcntl( sockfd, F_SETFL, 0))<0) perror(\"fcntl F_SETFL\"); flags |= O_ASYNC; if(fcntl( sockfd, F_SETFL,flags)<0) perror(\"fcntl\"); while(1){ sin_size=sizeof(struct sockaddr_in); if((client_fd=accept(sockfd,(struct sockaddr*)&client_sockaddr,&sin_size))==-1){ perror(\"accept\"); return 0; } printf(\"%d\\n\",client_sockaddr.sin_port); if((recvbytes=recv(client_fd,buf,MAXDATASIZE,0))==-1){ perror(\"recv\"); return 0; } printf(\"recvbytes: %d %s \\n \",recvbytes,buf); close(client_fd); return; }} 运行过程:123456[root@localhost socketdemo]# gcc server.c -o server[root@localhost socketdemo]#[root@localhost socketdemo]# ./serversocket success!,sockfd=3bind success!listening.... 服务端stap探测脚本123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123[root@localhost stp]# cat socketsingtal.stp%{ #include <linux/tcp.h> #include<linux/rtc.h> #include <net/tcp.h> static const char tcp_state_array[][16] = { \"NULL\", \"ESTABLISHED\", \"SYN_SENT\", \"SYN_RECV\", \"FIN_WAIT1\", \"FIN_WAIT2\", \"TIME_WAIT\", \"CLOSE\", \"CLOSE_WAIT\", \"LAST_ACK\", \"LISTEN\", \"CLOSING\" };%}function get_short_time:string()%{ struct timeval tv; struct rtc_time tm; unsigned long time; do_gettimeofday(&tv); time = tv.tv_sec + 8 * 3600; rtc_time_to_tm(time, &tm); sprintf(STAP_RETVALUE, \"%02d:%02d:%02d\", tm.tm_hour, tm.tm_min, tm.tm_sec);%}function get_full_time:string()%{ struct timeval tv; struct rtc_time tm; unsigned long time; do_gettimeofday(&tv); time = tv.tv_sec + 8 * 3600; rtc_time_to_tm(time, &tm); sprintf(STAP_RETVALUE, \"%d/%d/%d,%02d:%02d:%02d\", tm.tm_year+1900, tm.tm_mon+1, tm.tm_mday, tm.tm_hour, tm.tm_min, tm.tm_sec);%}function get_conn_lifetime:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct stap_info *info = sk->sk_protinfo; STAP_RETVALUE = jiffies_to_msecs(tcp_time_stamp - info->estab_t);%}function get_conn_data:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct tcp_sock *tp = tcp_sk(sk); struct stap_info *info = sk->sk_protinfo; u32 len = tp->snd_nxt - info->isn; STAP_RETVALUE = len ? len - 1 : len;%}function filter_http_transtime:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct stap_info *info = sk->sk_protinfo; STAP_RETVALUE = info->http_filter;%}function get_socket_addr:string (sk:long){ laddr = tcpmib_local_addr(sk) lport = tcpmib_local_port(sk) raddr = tcpmib_remote_addr(sk) rport = tcpmib_remote_port(sk) local_addr = sprintf(\"%s:%d\", ip_ntop(htonl(laddr)), lport) remote_addr = sprintf(\"%s:%d\", ip_ntop(htonl(raddr)), rport) return sprintf(\"local=%s,remote=%s\", local_addr, remote_addr)}function get_socket_state:string (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; sprintf(STAP_RETVALUE, \"%s\", tcp_state_array[sk->sk_state]);%}function get_socket_sk_sndbuf:string(sk:long)%{ struct sock *sk=(struct sock*)STAP_ARG_sk; sprintf(STAP_RETVALUE,\"%d:%d\", sk->sk_wmem_queued, sk->sk_sndbuf);%}function socket_state_num2str:string (state:long)%{ sprintf(STAP_RETVALUE, \"%s\", tcp_state_array[STAP_ARG_state]);%}function sshfilter:long(sk:long){ lport = tcpmib_local_port(sk) if(lport == 22) return 1 return 0}probe kernel.function(\"sock_def_wakeup\").call{ if(sshfilter($sk)) next printf(\"sock_def_wakeup:[%s]%s state:%s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk)) print_backtrace()} 运行过程:1[root@localhost stp]# stap -g socketsingtal.stp sock_def_wakeup状态改变事件实现分析12345678910111213141516171819//判断等待队列释放存在进程static inline bool wq_has_sleeper(struct socket_wq *wq){ //同步使用,具体实现未分析 smp_mb(); return wq && waitqueue_active(&wq->wait);}static void sock_def_wakeup(struct sock *sk){ struct socket_wq *wq; rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); //如果等待队列有进程,全部唤醒 if (wq_has_sleeper(wq)) wake_up_interruptible_all(&wq->wait); rcu_read_unlock();} 唤醒进程实现如下wake_up_interruptible_all ->wake_up->wake_up_common比较特殊点是,__wake_up的nr_exclusive为0时候唤醒所有进程。其它说明是:nr_exclusive为1时候,是为了惊鸿设置的,只唤醒一个进程123456789101112#define wake_up_interruptible_all(x) __wake_up(x, TASK_INTERRUPTIBLE, 0, NULL)void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key){ unsigned long flags; spin_lock_irqsave(&q->lock, flags); __wake_up_common(q, mode, nr_exclusive, 0, key); spin_unlock_irqrestore(&q->lock, flags);}EXPORT_SYMBOL(__wake_up); __wake_up_common 参数nr_exclusive为0时候,break不可能被执行12345678910111213static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key){ wait_queue_t *curr, *next; list_for_each_entry_safe(curr, next, &q->task_list, task_list) { unsigned flags = curr->flags; if (curr->func(curr, mode, wake_flags, key) && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; }} __wake_up_common 中curr->func是什么呢? 是autoremove_wake_function,将socket睡眠时候,会调用DEFINE_WAIT将autoremove_wake_function设置12345678#define DEFINE_WAIT_FUNC(name, function)\\ wait_queue_t name = { \\ .private = current, \\ .func = function, \\ .task_list = LIST_HEAD_INIT((name).task_list),x\\ }#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function) autoremove_wake_function 干了什么?1:default_wake_function ->try_to_wake_up 把进程状态设置为TASK_RUNNING,并把其插入CPU运行队列,从而唤醒睡眠进程2:待进程状态唤醒后,把等待事件从等待队列中删除 1234567891011121314151617int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key){ int ret = default_wake_function(wait, mode, sync, key); if (ret) list_del_init(&wait->task_list);//等待队列删除 return ret;}EXPORT_SYMBOL(autoremove_wake_function);int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags, void *key){ //把进程状态设置为TASK_RUNNING,并把其插入CPU运行队列,从而唤醒睡眠进程 return try_to_wake_up(curr->private, mode, wake_flags);}EXPORT_SYMBOL(default_wake_function); sock_def_readablesock_def_readable调用时机,sock数据可读会调用此函数唤醒进程 作为服务端123456789101112131415161718192021222324252627282930313233//收到syn包后sock_def_wakeup:[2017/6/26,11:52:07]local=0.0.0.0:8080,remote=0.0.0.0:0 state:LISTEN 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815d7eb8 : tcp_child_process+0x118/0x130 [kernel] 0xffffffff815d60d5 : tcp_v4_do_rcv+0x255/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000]//收到数据包后sock_def_wakeup:[2017/6/26,11:52:07]local=192.168.55.178:8080,remote=192.168.55.165:50843 state:ESTABLISHED 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815c8197 : tcp_data_queue+0x497/0xdd0 [kernel] 0xffffffff815cb4a7 : tcp_rcv_established+0x217/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000][root@localhost stp]# 作为客户端 收到ES状态服务端回复的数据在tcp_data_queue/tcp_rcv_established中调用 具体实现后续分析。123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960sock_def_wakeup:[2017/6/26,13:44:35]local=10.0.2.15:43188,remote=180.97.33.107:80 state:ESTABLISHED 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815cb6c3 : tcp_rcv_established+0x433/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000]sock_def_wakeup:[2017/6/26,13:44:35]local=10.0.2.15:43188,remote=180.97.33.107:80 state:ESTABLISHED 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815c8197 : tcp_data_queue+0x497/0xdd0 [kernel] 0xffffffff815cb4a7 : tcp_rcv_established+0x217/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000]sock_def_wakeup:[2017/6/26,13:44:35]local=10.0.2.15:43188,remote=180.97.33.107:80 state:ESTABLISHED 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815cb6c3 : tcp_rcv_established+0x433/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000] 服务端主动关闭连接,作为客户端TCP状态机处于CLOSE_WAITsock_def_wakeup:[2017/6/26,13:44:35]local=10.0.2.15:43188,remote=180.97.33.107:80 state:CLOSE_WAIT 0xffffffff81558220 : sock_def_readable+0x0/0x70 [kernel] 0xffffffff815c8197 : tcp_data_queue+0x497/0xdd0 [kernel] 0xffffffff815cb4a7 : tcp_rcv_established+0x217/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00a72fc [e1000] sock_def_readable123456789101112131415161718192021222324252627282930313233343536static void sock_def_readable(struct sock *sk, int len){ struct socket_wq *wq; rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); if (wq_has_sleeper(wq)) //阻塞队列通知 wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND); //异步队列通知 sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); rcu_read_unlock();}#define wake_up_interruptible_sync_poll(x, m)\\ __wake_up_sync_key((x), TASK_INTERRUPTIBLE, 1, (void *) (m)) void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key){ unsigned long flags; int wake_flags = WF_SYNC; if (unlikely(!q)) return; if (unlikely(!nr_exclusive)) wake_flags = 0; spin_lock_irqsave(&q->lock, flags); __wake_up_common(q, mode, nr_exclusive, wake_flags, key); spin_unlock_irqrestore(&q->lock, flags);}EXPORT_SYMBOL_GPL(__wake_up_sync_key); wake_up_interruptible_sync_poll传递的nr_exclusive为1,表示只允许唤醒一个等待进程。 sk_stream_write_spacesk->sk_write_space的实例为sock_def_write_space()。如果socket是SOCK_STREAM类型的,那么函数指针的值会更新为sk_stream_write_space()。sk_stream_write_space()在TCP中的调用路径为:tcp_rcv_established / tcp_rcv_state_process tcp_data_snd_check tcp_check_space tcp_new_space 12345678910111213141516171819202122232425void sk_stream_write_space(struct sock *sk){ struct socket *sock = sk->sk_socket; struct socket_wq *wq; //发送缓存大小,当要发送数据没有到达发送缓存的2/3 下限 //尚未发送数据缓冲区大小数据,没有大于用户设定值sysctl_tcp_notsent_lowat时候才能触发写数据 //针对于下限,要发送数据还不多 //针对于上限,要发送数据已经很多的话,不变要再发了,否则会使用过多内存 if (sk_stream_is_writeable(sk) && sock) { //经过sk_stream_is_writeable判别说明缓冲是足够的SOCK_NOSPACE标记清除 clear_bit(SOCK_NOSPACE, &sock->flags); rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); if (wq_has_sleeper(wq)) //唤醒等待队列一个进程 wake_up_interruptible_poll(&wq->wait, POLLOUT | POLLWRNORM | POLLWRBAND); //异步队列允许发送数据,通知异步队列 if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT); rcu_read_unlock(); }}","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"socket信号处理","slug":"socket信号处理","permalink":"http://vcpu.me/tags/socket信号处理/"}]},{"title":"select","slug":"select","date":"2017-06-26T09:00:57.000Z","updated":"2017-06-26T09:00:57.000Z","comments":true,"path":"select/","link":"","permalink":"http://vcpu.me/select/","excerpt":"1234567891011#include <sys/time.h>#include <sys/types.h>#include <unistd.h>int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);void FD_CLR(int fd, fd_set *set);int FD_ISSET(int fd, fd_set *set);void FD_SET(int fd, fd_set *set);void FD_ZERO(fd_set *set); nfds 是最大文件描述符号 +1 怎么可能这么简单,它限制的是最大值而不是个数 readfds 用来记录可读fd集合 writefds 用来记录可写fd集合 exceptfds 用来检查带外数据 timeout 决定select等待I/O时间 1.timeout该值为NULL,会阻塞一定等到监控的文件描述符集合中产生状态变化(可读,可写等)2.timeout值为0分0毫秒,非阻塞,不关注文件描述符是否变化立刻返回3.timeout正常值,timeout这段时间内阻塞,如果监控集合中有信号来临,select将返回,否则超时返回","text":"1234567891011#include <sys/time.h>#include <sys/types.h>#include <unistd.h>int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);void FD_CLR(int fd, fd_set *set);int FD_ISSET(int fd, fd_set *set);void FD_SET(int fd, fd_set *set);void FD_ZERO(fd_set *set); nfds 是最大文件描述符号 +1 怎么可能这么简单,它限制的是最大值而不是个数 readfds 用来记录可读fd集合 writefds 用来记录可写fd集合 exceptfds 用来检查带外数据 timeout 决定select等待I/O时间 1.timeout该值为NULL,会阻塞一定等到监控的文件描述符集合中产生状态变化(可读,可写等)2.timeout值为0分0毫秒,非阻塞,不关注文件描述符是否变化立刻返回3.timeout正常值,timeout这段时间内阻塞,如果监控集合中有信号来临,select将返回,否则超时返回 返回值: 0 出发信号的fd数目 =0 超时 -1 出错 作用: 用来管理fd集合,实现多fd集合监听操作 select用户态用法123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151#include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <errno.h>#include <string.h>#include <sys/types.h>#include <sys/socket.h>#include <netinet/in.h>#include <arpa/inet.h>#define MYPORT 1234 // the port users will be connecting to#define BACKLOG 5 // how many pending connections queue will hold#define BUF_SIZE 200int fd_A[BACKLOG]; // accepted connection fdint conn_amount; // current connection amountvoid showclient(){ int i; printf(\"client amount: %d\\n\", conn_amount); for (i = 0; i < BACKLOG; i++) { printf(\"[%d]:%d \", i, fd_A[i]); } printf(\"\\n\\n\");}int main(void){ int sock_fd, new_fd; // listen on sock_fd, new connection on new_fd struct sockaddr_in server_addr; // server address information struct sockaddr_in client_addr; // connector's address information socklen_t sin_size; int yes = 1; char buf[BUF_SIZE]; int ret; int i; if ((sock_fd = socket(AF_INET, SOCK_STREAM, 0)) == -1) { perror(\"socket\"); exit(1); } if (setsockopt(sock_fd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(int)) == -1) { perror(\"setsockopt\"); exit(1); } server_addr.sin_family = AF_INET; server_addr.sin_port = htons(MYPORT); server_addr.sin_addr.s_addr = INADDR_ANY; memset(server_addr.sin_zero, '\\0', sizeof(server_addr.sin_zero)); if (bind(sock_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) == -1) { perror(\"bind\"); exit(1); } if (listen(sock_fd, BACKLOG) == -1) { perror(\"listen\"); exit(1); } printf(\"listen port %d\\n\", MYPORT); fd_set fdsr; int maxsock; struct timeval tv; conn_amount = 0; sin_size = sizeof(client_addr); maxsock = sock_fd; while (1) { // initialize file descriptor set FD_ZERO(&fdsr); FD_SET(sock_fd, &fdsr); // timeout setting tv.tv_sec = 30; tv.tv_usec = 0; // add active connection to fd set for (i = 0; i < BACKLOG; i++) { if (fd_A[i] != 0) { FD_SET(fd_A[i], &fdsr); } } ret = select(maxsock + 1, &fdsr, NULL, NULL, &tv); if (ret < 0) { perror(\"select\"); break; } else if (ret == 0) { printf(\"timeout\\n\"); continue; } // check every fd in the set for (i = 0; i < conn_amount; i++) { if (FD_ISSET(fd_A[i], &fdsr)) { ret = recv(fd_A[i], buf, sizeof(buf), 0); if (ret <= 0) { // client close printf(\"client[%d] close\\n\", i); close(fd_A[i]); FD_CLR(fd_A[i], &fdsr); fd_A[i] = 0; } else { // receive data if (ret < BUF_SIZE) memset(&buf[ret], '\\0', 1); printf(\"client[%d] send:%s\\n\", i, buf); } } } // check whether a new connection comes if (FD_ISSET(sock_fd, &fdsr)) { new_fd = accept(sock_fd, (struct sockaddr *)&client_addr, &sin_size); if (new_fd <= 0) { perror(\"accept\"); continue; } // add to fd queue if (conn_amount < BACKLOG) { fd_A[conn_amount++] = new_fd; printf(\"new connection client[%d] %s:%d\\n\", conn_amount, inet_ntoa(client_addr.sin_addr), ntohs(client_addr.sin_port)); if (new_fd > maxsock) maxsock = new_fd; } else { printf(\"max connections arrive, exit\\n\"); send(new_fd, \"bye\", 4, 0); close(new_fd); break; } } showclient(); } // close other connections for (i = 0; i < BACKLOG; i++) { if (fd_A[i] != 0) { close(fd_A[i]); } } exit(0);} 代码实现采用select用法描述: 1.select作为服务端使用2.select监听服务的fd,如果有客户端连接此服务端时候,服务端fd会被触发,然后调用accept完成连接3.select监听服务端和客户端建立好连接的fd,如果客户端发送数据过来,select可监听到读信号,然后recv读出数据。 select实现分析用户态select 系统调用 sys_select 调用栈如下: 0xffffffff81213f80 : sys_select+0x0/0x110 [kernel] 0xffffffff81697189 : system_call_fastpath+0x16/0x1b [kernel] 实现代码位于:fs/select.c SYSCALL_DEFINE5(select,… select功能概述sys_select实现分析分析结论: sys_select1234567891011121314151617181920212223242526SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp, fd_set __user *, exp, struct timeval __user *, tvp){ struct timespec end_time, *to = NULL; struct timeval tv; int ret; //用户态时间处理,将用户态时间拷入内核态并将参数规整为struct timespec以供调用 if (tvp) { if (copy_from_user(&tv, tvp, sizeof(tv))) return -EFAULT; to = &end_time; if (poll_select_set_timeout(to, tv.tv_sec + (tv.tv_usec / USEC_PER_SEC), (tv.tv_usec % USEC_PER_SEC) * NSEC_PER_USEC)) return -EINVAL; } //select的核心实现 ret = core_sys_select(n, inp, outp, exp, to); //该函数会将剩余的时间拷入到用户态的tvp 中 ret = poll_select_copy_remaining(&end_time, tvp, 1, ret); return ret;} 1.将用户态select时间参数拷入内核2.调用core_sys_select3.将select退出后剩余时间结果拷入用户态时间参数中 core_sys_select1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user *exp, struct timespec *end_time){ fd_set_bits fds; void *bits; int ret, max_fds; unsigned int size; struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; //用户态给予参数nfds < 0 ,直接返并报告参数非法 -EINVAL ret = -EINVAL; if (n < 0) goto out_nofds; /* max_fds can increase, so grab it once to avoid race */ rcu_read_lock(); fdt = files_fdtable(current->files); max_fds = fdt->max_fds; rcu_read_unlock(); if (n > max_fds) n = max_fds; /* * We need 6 bitmaps (in/out/ex for both incoming and outgoing), * since we used fdset we need to allocate memory in units of * long-words. */ //以一个文件描述符占1bit,传递进来的这么多fd共占多数字 size = FDS_BYTES(n); bits = stack_fds; //检查默认静态数据资源是否够用 if (size > sizeof(stack_fds) / 6) { /* Not enough space in on-stack array; must use kmalloc */ ret = -ENOMEM; bits = kmalloc(6 * size, GFP_KERNEL); if (!bits) goto out_nofds; } //fds用来指向具体的存储空间 fds.in = bits; fds.out = bits + size; fds.ex = bits + 2*size; fds.res_in = bits + 3*size; fds.res_out = bits + 4*size; fds.res_ex = bits + 5*size;//将用户空间的inp outp exp 拷入内核空间 if ((ret = get_fd_set(n, inp, fds.in)) || (ret = get_fd_set(n, outp, fds.out)) || (ret = get_fd_set(n, exp, fds.ex))) goto out; //存放返回状态的字段清零,后续可用作返回结果使用 zero_fd_set(n, fds.res_in); zero_fd_set(n, fds.res_out); zero_fd_set(n, fds.res_ex); //select核心逻辑处理函数 ret = do_select(n, &fds, end_time); //存在错误 if (ret < 0) goto out; //超时情况 if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current)) goto out; ret = 0; } //把结果集拷入用户空间 if (set_fd_set(n, inp, fds.res_in) || set_fd_set(n, outp, fds.res_out) || set_fd_set(n, exp, fds.res_ex)) ret = -EFAULT;out: //释放辅助内存 if (bits != stack_fds) kfree(bits);out_nofds: return ret;} 1.检验nfds,如果其小于0,参数异常返回;并规整nfds(最大不能超过当前进程的max_fds)2.将用户态fd集合拷入内核态3.运行do_select4.将do_select检测结果拷入用户空间5.释放select运算中辅助内存 do_select123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161int do_select(int n, fd_set_bits *fds, struct timespec *end_time){ ktime_t expire, *to = NULL; struct poll_wqueues table; poll_table *wait; int retval, i, timed_out = 0; unsigned long slack = 0; unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0; unsigned long busy_end = 0; rcu_read_lock(); //检查fd对应file状态,且找出最大fd retval = max_select_fd(n, fds); rcu_read_unlock(); if (retval < 0) return retval; n = retval; poll_initwait(&table); wait = &table.pt; //传入的时间为0s 0ms time_out标记为1 这种情况不阻塞直接返回 if (end_time && !end_time->tv_sec && !end_time->tv_nsec) { wait->_qproc = NULL; timed_out = 1; } //正常情况处理。 超时时间转换 if (end_time && !timed_out) slack = select_estimate_accuracy(end_time); retval = 0; for (;;) { unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp; bool can_busy_loop = false; inp = fds->in; outp = fds->out; exp = fds->ex; rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex; //所有监听的fd大循环 for (i = 0; i < n; ++rinp, ++routp, ++rexp) { unsigned long in, out, ex, all_bits, bit = 1, mask, j; unsigned long res_in = 0, res_out = 0, res_ex = 0; //32个文件描述符号,没有任何状态被检测,进入下一轮32个 in = *inp++; out = *outp++; ex = *exp++; all_bits = in | out | ex; if (all_bits == 0) { i += BITS_PER_LONG; continue; } // 这一轮32个fd存在需要检测的状态 for (j = 0; j < BITS_PER_LONG; ++j, ++i, bit <<= 1) { struct fd f; //超过最大待检测fd n直接退出循环 if (i >= n) break; //跳过没有状态检测的fd if (!(bit & all_bits)) continue; f = fdget(i); if (f.file) { const struct file_operations *f_op; f_op = f.file->f_op; //设置fd检测事件掩码,poll相关情况处理 mask = DEFAULT_POLLMASK; if (f_op && f_op->poll) { //设置用户需要探查的标记 wait_key_set(wait, in, out, bit, busy_flag); //获取fd当前对应的信号掩码 mask = (*f_op->poll)(f.file, wait); } fdput(f); //可读 if ((mask & POLLIN_SET) && (in & bit)) { res_in |= bit; retval++; wait->_qproc = NULL; } //可写 if ((mask & POLLOUT_SET) && (out & bit)) { res_out |= bit; retval++; wait->_qproc = NULL; } if ((mask & POLLEX_SET) && (ex & bit)) { res_ex |= bit; retval++; wait->_qproc = NULL; } /* got something, stop busy polling */ if (retval) { can_busy_loop = false; busy_flag = 0; /* * only remember a returned * POLL_BUSY_LOOP if we asked for it */ } else if (busy_flag & mask) can_busy_loop = true; } } //将检测结果存下来 if (res_in) *rinp = res_in; if (res_out) *routp = res_out; if (res_ex) *rexp = res_ex; //增加抢占点 该抢占点可达到效果是:判断是否有进程需要抢占当前进程,如果是将立即发生调度 //已经检查过的fd如果此时被唤醒,则会在此产生调度 cond_resched(); } wait->_qproc = NULL; if (retval || timed_out || signal_pending(current)) break; //设备就绪异常超时终止灯信号触发,直接break,可跳出大循环结束程序 if (table.error) { retval = table.error; break; } /* only if found POLL_BUSY_LOOP sockets && not out of time */ if (can_busy_loop && !need_resched()) { if (!busy_end) { busy_end = busy_loop_end_time(); continue; } if (!busy_loop_timeout(busy_end)) continue; } busy_flag = 0; /* * If this is the first loop and we have a timeout * given, then we convert to ktime_t and set the to * pointer to the expiry value. */ if (end_time && !to) { expire = timespec_to_ktime(*end_time); to = &expire; } //当前用户进程从这里进入睡眠,超时后timed_out 置1 直接退出 if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE, to, slack)) timed_out = 1; } poll_freewait(&table); return retval; } do_select为select的核心实现,其处理过程如下: 1.调用poll_initwait初始化poll_wqueues对象table,包括其成员poll_table; 2.如果用户传入的timeout不为NULL,但是设定的时间为0,那么设置poll_table指针wait(即 &table.pt)为NULL;当&table.pt为NULL,它并不会被加到等到队列中。 3.将in,out和exception进行或运算,得到all_bits,然后遍历all_bits中bit为1的fd,根据进程的fd_table查找到file指针filp,然后设置wait的key值(POLLEX_SET, POLLIN_SET,POLLIN_SET三者的或运算,取决于用户输入),并调用filp->poll(filp, wait),获得返回值mask。 再根据mask值检查该文件是否立即满足条件,如果满足,设置res_in/res_out/res_exception的值,执行retval++, 并设置wait为NULL。 4.在每遍历32(取决于long型整数的位数)个文件后,调用1次cond_resched(),主动寻求调度,可以等待已经遍历过的文件是否有唤醒的; 5.在遍历完所有文件之后,设置wait为NULL,并检查是否有满足条件的文件(retval值是否为0),或者是否超时,或者是否有未决信号,如果有那么直接跳出循环,进入步骤7; 6.否则调用poll_schedule_timeout,使进程进入睡眠,直到超时(如果未设置超时,那么是直接调用的schedule())。如果是超时后进程继续执行,那么设置pwq->triggered为0;如果是被文件对应的驱动程序唤醒的,那么pwq->triggered被设置为1. 7.最终,函数调用poll_freewait,将本进程从所有文件的等待队列中删掉,并删除分配的poll_table_page对象,回收内存,并返回retval值。 8.拷贝res_in, res_out和res_exception到传入的in, out, exception,并返回ret。 select睡眠过程do_select … 步骤1 poll_initwait(&table); wait = &table.pt;… 步骤2 if (f_op && f_op->poll) { wait_key_set(wait, in, out, bit, busy_flag); //如果是socket此处调用的是sock_poll mask = (*f_op->poll)(f.file, wait);} 步骤3… if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE, to, slack)) 步骤1:初始化table struct poll_wqueues table; 12345678910void poll_initwait(struct poll_wqueues *pwq){ init_poll_funcptr(&pwq->pt, __pollwait); pwq->polling_task = current; pwq->triggered = 0; pwq->error = 0; pwq->table = NULL; pwq->inline_index = 0;}EXPORT_SYMBOL(poll_initwait); 将当前进程标志current给table让其记录下来 将__pollwait给table->pt->_qproc让其记录下来 步骤2:调用sock_poll最终调用tcp_pool12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697/* No kernel lock held - perfect */static unsigned int sock_poll(struct file *file, poll_table *wait){ unsigned int busy_flag = 0; struct socket *sock; /* * We can't return errors to poll, so it's either yes or no. */ sock = file->private_data; if (sk_can_busy_loop(sock->sk)) { /* this socket can poll_ll so tell the system call */ busy_flag = POLL_BUSY_LOOP; /* once, only if requested by syscall */ if (wait && (wait->_key & POLL_BUSY_LOOP)) sk_busy_loop(sock->sk, 1); } //针对于tcpsocket来讲此处调用tcp_pool return busy_flag | sock->ops->poll(file, sock, wait);}/* * Wait for a TCP event. * * Note that we don't need to lock the socket, as the upper poll layers * take care of normal races (between the test and the event) and we don't * go look at any of the socket buffers directly. */unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait){ unsigned int mask; struct sock *sk = sock->sk; const struct tcp_sock *tp = tcp_sk(sk); sock_rps_record_flow(sk); sock_poll_wait(file, sk_sleep(sk), wait); if (sk->sk_state == TCP_LISTEN) return inet_csk_listen_poll(sk); /* Socket is not locked. We are protected from async events * by poll logic and correct handling of state changes * made by other threads is impossible in any case. */ mask = 0; if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) mask |= POLLHUP; if (sk->sk_shutdown & RCV_SHUTDOWN) mask |= POLLIN | POLLRDNORM | POLLRDHUP; /* Connected or passive Fast Open socket? */ if (sk->sk_state != TCP_SYN_SENT && (sk->sk_state != TCP_SYN_RECV || tp->fastopen_rsk != NULL)) { int target = sock_rcvlowat(sk, 0, INT_MAX); if (tp->urg_seq == tp->copied_seq && !sock_flag(sk, SOCK_URGINLINE) && tp->urg_data) target++; /* Potential race condition. If read of tp below will * escape above sk->sk_state, we can be illegally awaken * in SYN_* states. */ if (tp->rcv_nxt - tp->copied_seq >= target) mask |= POLLIN | POLLRDNORM; if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { if (sk_stream_is_writeable(sk)) { mask |= POLLOUT | POLLWRNORM; } else { /* send SIGIO later */ set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); /* Race breaker. If space is freed after * wspace test but before the flags are set, * IO signal will be lost. */ if (sk_stream_is_writeable(sk)) mask |= POLLOUT | POLLWRNORM; } } else mask |= POLLOUT | POLLWRNORM; if (tp->urg_data & TCP_URG_VALID) mask |= POLLPRI; } /* This barrier is coupled with smp_wmb() in tcp_reset() */ smp_rmb(); if (sk->sk_err) mask |= POLLERR; return mask;} 收集信号状态以mask方式返回 调用sock_poll_wait然后poll_wait最终调用_qproc也就是__pollwait __pollwait 123456789101112131415/* Add a new entry */static void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p){ struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt); struct poll_table_entry *entry = poll_get_entry(pwq); if (!entry) return; entry->filp = get_file(filp); entry->wait_address = wait_address; entry->key = p->_key; init_waitqueue_func_entry(&entry->wait, pollwake); entry->wait.private = pwq; add_wait_queue(wait_address, &entry->wait);} 为每个fd对应文件分配 poll_table_entry 将fd对应poll_table_entry加入到等待队列中 步骤3: poll_schedule_timeout,作用是使进程进入睡眠,直到超时或者被唤醒 如果超时后进程继续执行设置pwq->triggered为0 如果是被文件对应的驱动程序唤醒pwq->triggered为1 1234567891011121314151617181920212223242526int poll_schedule_timeout(struct poll_wqueues *pwq, int state, ktime_t *expires, unsigned long slack){ int rc = -EINTR; set_current_state(state); if (!pwq->triggered) rc = freezable_schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS); __set_current_state(TASK_RUNNING); /* * Prepare for the next iteration. * * The following set_mb() serves two purposes. First, it's * the counterpart rmb of the wmb in pollwake() such that data * written before wake up is always visible after wake up. * Second, the full barrier guarantees that triggered clearing * doesn't pass event check of the next iteration. Note that * this problem doesn't exist for the first iteration as * add_wait_queue() has full barrier semantics. */ set_mb(pwq->triggered, 0); return rc;} select唤醒过程 0xffffffff81213130 : pollwake+0x0/0x90 [kernel] 0xffffffff810ba628 : wake_up_common+0x58/0x90 [kernel] 0xffffffff810bc4a4 : wake_up_sync_key+0x44/0x60 [kernel] 0xffffffff8155825a : sock_def_readable+0x3a/0x70 [kernel] 0xffffffff815c8197 : tcp_data_queue+0x497/0xdd0 [kernel] 0xffffffff815cb4a7 : tcp_rcv_established+0x217/0x760 [kernel] 0xffffffff815d5f8a : tcp_v4_do_rcv+0x10a/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc [e1000] pollwake ->__pollwake->default_wake_function->try_to_wake_up try_to_wake_up会把进程的状态设置为TASK_RUNNING,并把进程插入CPU运行队列,来唤醒睡眠的进程 linux select 1024限制魔咒__FD_SETSIZE 默认最大为1024,一个int占用4个byte,也就是32个bit,所以使用了一个int数组大小为32位来表示了我们要操作的fd的数值,每个bit代表了一个handle数值 需要注意的问题是,这里的最大为1024,如果handle数值为1025是不能处理的(而且很容易导致破坏堆栈),不是说可以容纳1024个网络客户端句柄,而是最大的handle数值为1024,再算上系统本身使用的stdout,stdin, stderr默认的3个,因此最多也就是1021个,再算上程序打开的文件句柄等等,实际上使用可能要比1024少上好多。 另外,ulimit对每个进程打开的句柄也有限制。 why 1024 ?内核参数适用结构体是fd_set 123SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp, fd_set __user *, exp, struct timeval __user *, tvp){ fd_set是 __kernel_fd_set 1typedef __kernel_fd_set fd_set; __kernel_fd_set 中fds_bits 最大只能1024 12345#define __FD_SETSIZE 1024typedef struct { unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))];} __kernel_fd_set; 我该怎么办才能突破1024限制?修改掉此宏重新编译吧,当然还有其他办法,但是没必要这么复杂,直接用pool或者epool解决吧当然你也可以多进程或者多线程,每个进程/线程 分别select select缺点总结 select效率低下,用户空间和内核空间来回拷贝,select内部吧存进程上下文切换,大型项目不适用可同时监听的文件数量有限,linux平台1024个每次调用select都要遍历完成所有的fd,每隔32fd需要调度一次多个fd情况下,如果小的fs一直可读,会导致大的fd信号不会被收集到需要在用户态和内核态来回拷贝fd_set,睡眠唤醒机制需要为fd分配poll_table_entry","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"select","slug":"select","permalink":"http://vcpu.me/tags/select/"}]},{"title":"TIME_WAIT状态分析","slug":"TIME_WAIT状态分析","date":"2017-06-23T03:16:59.000Z","updated":"2017-06-23T03:16:59.000Z","comments":true,"path":"TIME_WAIT状态分析/","link":"","permalink":"http://vcpu.me/TIME_WAIT状态分析/","excerpt":"TIME_WAIT状态分析之所以起这样一个题目是因为很久以前我曾经写过一篇介绍TIME_WAIT的文章,不过当时基本属于浅尝辄止,并没深入说明问题的来龙去脉,碰巧这段时间反复被别人问到相关的问题,让我觉得有必要全面总结一下,以备不时之需。 讨论前大家可以拿手头的服务器摸摸底,记住「ss」比「netstat」快:1ss -ant | awk 'NR>1 {++s[$1]} END {for(k in s) print k,s[k]} 更简单方法: 1cat /proc/net/sockstat","text":"TIME_WAIT状态分析之所以起这样一个题目是因为很久以前我曾经写过一篇介绍TIME_WAIT的文章,不过当时基本属于浅尝辄止,并没深入说明问题的来龙去脉,碰巧这段时间反复被别人问到相关的问题,让我觉得有必要全面总结一下,以备不时之需。 讨论前大家可以拿手头的服务器摸摸底,记住「ss」比「netstat」快:1ss -ant | awk 'NR>1 {++s[$1]} END {for(k in s) print k,s[k]} 更简单方法: 1cat /proc/net/sockstat 我猜你一定被巨大无比的TIME_WAIT网络连接总数吓到了!以我个人的经验,对于一台繁忙的Web服务器来说,如果主要以短连接为主,那么其TIME_WAIT网络连接总数很可能会达到几万,甚至十几万。虽然一个TIME_WAIT网络连接耗费的资源无非就是一个端口、一点内存,但是架不住基数大,所以这始终是一个需要面对的问题。 TIMEWAIT是什么因为TCP连接是双向的,所以在关闭连接的时候,两个方向各自都需要关闭。先发FIN包的一方执行的是主动关闭;后发FIN包的一方执行的是被动关闭。主动关闭的一方会进入TIME_WAIT状态,并且在此状态停留两倍的MSL时长。穿插一点MSL的知识:MSL指的是报文段的最大生存时间,如果报文段在网络活动了MSL时间,还没有被接收,那么会被丢弃。关于MSL的大小,RFC 793协议中给出的建议是两分钟,不过实际上不同的操作系统可能有不同的设置,以Linux为例,通常是半分钟,两倍的MSL就是一分钟,也就是60秒,并且这个数值是硬编码在内核中的,也就是说除非你重新编译内核,否则没法修改它: #define TCP_TIMEWAIT_LEN (60*HZ) 如果每秒的连接数是一千的话,那么一分钟就可能会产生六万个TIME_WAIT。为什么主动关闭的一方不直接进入CLOSED状态,而是进入TIME_WAIT状态,并且停留两倍的MSL时长呢?这是因为TCP是建立在不可靠网络上的可靠的协议。例子:主动关闭的一方收到被动关闭的一方发出的FIN包后,回应ACK包,同时进入TIME_WAIT状态,但是因为网络原因,主动关闭的一方发送的这个ACK包很可能延迟,从而触发被动连接一方重传FIN包。极端情况下,这一去一回,就是两倍的MSL时长。如果主动关闭的一方跳过TIME_WAIT直接进入CLOSED,或者在TIME_WAIT停留的时长不足两倍的MSL,那么当被动关闭的一方早先发出的延迟包到达后,就可能出现类似下面的问题: ▪ 旧的TCP连接已经不存在了,系统此时只能返回RST包 ▪ 新的TCP连接被建立起来了,延迟包可能干扰新的连接不管是哪种情况都会让TCP不再可靠,所以TIME_WAIT状态有存在的必要性。 如何控制TIME_WAIT的数量?从前面的描述我们可以得出这样的结论:TIME_WAIT这东西没有的话不行,不过太多可能也是个麻烦事。下面让我们看看有哪些方法可以控制TIME_WAIT数量,这里只说一些常规方法,另外一些诸如SO_LINGER之类的方法太过偏门,略过不谈。ip_conntrack:顾名思义就是跟踪连接。一旦激活了此模块,就能在系统参数里发现很多用来控制网络连接状态超时的设置,其中自然也包括TIME_WAIT:shell> modprobe ip_conntrackshell> sysctl net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait我们可以尝试缩小它的设置,比如十秒,甚至一秒,具体设置成多少合适取决于网络情况而定,当然也可以参考相关的案例。不过就我的个人意见来说,ip_conntrack引入的问题比解决的还多,比如性能会大幅下降,所以不建议使用。 tcp_tw_recycle:顾名思义就是回收TIME_WAIT连接。可以说这个内核参数已经变成了大众处理TIME_WAIT的万金油,如果你在网络上搜索TIME_WAIT的解决方案,十有八九会推荐设置它,不过这里隐藏着一个不易察觉的陷阱:当多个客户端通过NAT方式联网并与服务端交互时,服务端看到的是同一个IP,也就是说对服务端而言这些客户端实际上等同于一个,可惜由于这些客户端的时间戳可能存在差异,于是乎从服务端的视角看,便可能出现时间戳错乱的现象,进而直接导致时间戳小的数据包被丢弃。(tcp_tw_recycle和tcp_timestamps导致connect失败问题。同时开启情况下,60s内同一源ip主机socket 请求中timestamp必须是递增的) tcp_tw_reuse:顾名思义就是复用TIME_WAIT连接。当创建新连接的时候,如果可能的话会考虑复用相应的TIME_WAIT连接。通常认为「tcp_tw_reuse」比「tcp_tw_recycle」安全一些,这是因为一来TIME_WAIT创建时间必须超过一秒才可能会被复用;二来只有连接的时间戳是递增的时候才会被复用。官方文档里是这样说的:如果从协议视角看它是安全的,那么就可以使用。这简直就是外交辞令啊!按我的看法,如果网络比较稳定,比如都是内网连接,那么就可以尝试使用。不过需要注意的是在哪里使用,既然我们要复用连接,那么当然应该在连接的发起方使用,而不能在被连接方使用。举例来说:客户端向服务端发起HTTP请求,服务端响应后主动关闭连接,于是TIME_WAIT便留在了服务端,此类情况使用「tcp_tw_reuse」是无效的,因为服务端是被连接方,所以不存在复用连接一说。让我们延伸一点来看,比如说服务端是PHP,它查询另一个MySQL服务端,然后主动断开连接,于是TIME_WAIT就落在了PHP一侧,此类情况下使用「tcp_tw_reuse」是有效的,因为此时PHP相对于MySQL而言是客户端,它是连接的发起方,所以可以复用连接。说明:如果使用tcp_tw_reuse,请激活tcp_timestamps,否则无效。 tcp_max_tw_buckets:顾名思义就是控制TIME_WAIT总数。官网文档说这个选项只是为了阻止一些简单的DoS攻击,平常不要人为的降低它。如果缩小了它,那么系统会将多余的TIME_WAIT删除掉,日志里会显示:「TCP: time wait bucket table overflow」。需要提醒大家的是物极必反,曾经看到有人把「tcp_max_tw_buckets」设置成0,也就是说完全抛弃TIME_WAIT,这就有些冒险了,用一句围棋谚语来说:入界宜缓。…有时候,如果我们换个角度去看问题,往往能得到四两拨千斤的效果。前面提到的例子:客户端向服务端发起HTTP请求,服务端响应后主动关闭连接,于是TIME_WAIT便留在了服务端。这里的关键在于主动关闭连接的是服务端!在关闭TCP连接的时候,先出手的一方注定逃不开TIME_WAIT的宿命,套用一句歌词:把我的悲伤留给自己,你的美丽让你带走。如果客户端可控的话,那么在服务端打开KeepAlive,尽可能不让服务端主动关闭连接,而让客户端主动关闭连接,如此一来问题便迎刃而解了。 原文连接于https://huoding.com/2013/12/31/316","categories":[{"name":"TCP","slug":"TCP","permalink":"http://vcpu.me/categories/TCP/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"}]},{"title":"Linux系统调用","slug":"Linux系统调用","date":"2017-06-22T10:10:14.000Z","updated":"2017-06-22T10:10:14.000Z","comments":true,"path":"Linux系统调用/","link":"","permalink":"http://vcpu.me/Linux系统调用/","excerpt":"","text":"什么事系统调用 linux虚拟地址空间分为用户空间和内核空间 用户空间不可直接访问内核空间,帝王班的内核空间可直接访问用户空间 用户空间只能通过系统调用访问内核空间 系统调用时内核提供的一组函数接口,使得用户空间上进程可以和内核空间交互 系统调用过程 执行用户程序 根据glibc中实现,取得系统调用号,将其存入EAX并执行int $0x80(128号中断) 用户态可以传递变量、参数值给内核,内核态运行时候会保存用户进程的一些寄存器值等(上下文环境) 触发中断后内核根据系统调用号执行对应的中断处理函数 系统调用结束将访问址存入EAX,返回中断处理函数 中断处理函数根据存储用户态进程上下文环境恢复用户态,同时用户态就获取了内核态函数执行的返回值 系统调用汇编123456// pid = fork();asm volatile( "mov $0x2, %%eax\\n\\t" // 将fork的系统调用号2存到eax寄存器 "int $0x80\\n\\t" // 产生int 0x80中断 "mov %%eax,%0\\n\\t" // 将结果存入pid中 : "=m" (pid) 系统调用实现分析待续 添加系统调用待续","categories":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/categories/linux/"}],"tags":[{"name":"系统调用","slug":"系统调用","permalink":"http://vcpu.me/tags/系统调用/"}]},{"title":"tcp socket发送缓冲区","slug":"tcp_sndbuf","date":"2017-06-20T10:31:48.000Z","updated":"2017-06-20T10:31:48.000Z","comments":true,"path":"tcp_sndbuf/","link":"","permalink":"http://vcpu.me/tcp_sndbuf/","excerpt":"tcp socket发送缓冲区探究结论 1: 未设置SO_SNDBUF时,sk->sk_sndbuf值由tcp_finish_connect->tcp_init_buffer_space->tcp_sndbuf_expand决定,TCP协议栈会自己计算一个值出来46080,sk_sndbuf是46080和net.ipv4.tcp_wmem[2](4194304)的较小值 2: 设置SO_SNDBUF后,tcp_sndbuf_expand将不会再被调用,其值情况完全由sock_setsockopt决定 2-1: 设置值较小 value < 2304 { SOCK_MIN_SNDBUF(4608)/2 } sk_sndbuf = 4608 2-2: 设置值适中 { SOCK_MIN_SNDBUF(4608)/2 } < value < net.core.wmem_max sk_sndbuf = value*2 2-3: 设置值较大 value > net.core.wmem_max sk_sndbuf = net.core.wmem_max* 2","text":"tcp socket发送缓冲区探究结论 1: 未设置SO_SNDBUF时,sk->sk_sndbuf值由tcp_finish_connect->tcp_init_buffer_space->tcp_sndbuf_expand决定,TCP协议栈会自己计算一个值出来46080,sk_sndbuf是46080和net.ipv4.tcp_wmem[2](4194304)的较小值 2: 设置SO_SNDBUF后,tcp_sndbuf_expand将不会再被调用,其值情况完全由sock_setsockopt决定 2-1: 设置值较小 value < 2304 { SOCK_MIN_SNDBUF(4608)/2 } sk_sndbuf = 4608 2-2: 设置值适中 { SOCK_MIN_SNDBUF(4608)/2 } < value < net.core.wmem_max sk_sndbuf = value*2 2-3: 设置值较大 value > net.core.wmem_max sk_sndbuf = net.core.wmem_max* 2 默认情况下(未设置SO_SNDBUF)net.core.wmem_default = 212992net.core.wmem_max = 212992net.ipv4.tcp_wmem = 4096 16384 4194304 TCPsocket未connect之前 sendbuf:16384 sk->sk_sndbuf是sysctl_tcp_wmem[1]的值 connect之后,sendbuf:46080 通过调试机制可知,sendbuf默认大小为sysctl_tcp_wmem[1] 为16384connect连接连接到服务端后,sendbuf变为46080,该值不是尚书配置中任何一个值 原因探究阶段1:tcp_init_sock初始化,sk->sk_sndbuf = sysctl_tcp_wmem[1] 阶段2:主动连接进入ES状态时候,状态切换时候调用tcp_sndbuf_expand调整sk_sndbuf stp脚本探测结果如下:123456789101112131415161718tcp_v4_connect[2017/6/20,10:57:56]local=0.0.0.0:3000,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:16384tcp_v4_connect return [2017/6/20,10:57:56]local=192.168.55.178:3000,remote=180.97.33.108:80 state:SYN_SENT,sndbubf 1280:16384tcp_input:302 return [2017/6/20,10:57:56]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHED,sndbubf 0:16384 sndmem : 46080 permss 2304 0xffffffff815c3527 : tcp_sndbuf_expand+0x67/0x90 [kernel] 0xffffffff815c7ba8 : tcp_init_buffer_space+0x178/0x190 [kernel] 0xffffffff815cbbae : tcp_finish_connect+0x6e/0x120 [kernel] 0xffffffff815cc297 : tcp_rcv_state_process+0x637/0xf20 [kernel] 0xffffffff815d5ffb : tcp_v4_do_rcv+0x17b/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : __netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : __netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc [e1000] 12345678910111213141516171819202122232425262728static void tcp_sndbuf_expand(struct sock *sk){ const struct tcp_sock *tp = tcp_sk(sk); int sndmem, per_mss; u32 nr_segs; /* Worst case is non GSO/TSO : each frame consumes one skb * and skb->head is kmalloced using power of two area of memory */ per_mss = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache) + MAX_TCP_HEADER + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); per_mss = roundup_pow_of_two(per_mss) + SKB_DATA_ALIGN(sizeof(struct sk_buff)); nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd); nr_segs = max_t(u32, nr_segs, tp->reordering + 1); /* Fast Recovery (RFC 5681 3.2) : * Cubic needs 1.7 factor, rounded to 2 to include * extra cushion (application might react slowly to POLLOUT) */ sndmem = 2 * nr_segs * per_mss; if (sk->sk_sndbuf < sndmem) sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);} 设置发送缓冲区大小为较小值123456789socklen_t sendbuflen = 0;socklen_t len = sizeof(sendbuflen);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"default,sendbuf:%d\\n\", sendbuflen);socklen_t sendbuflen = 100;setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"now,sendbuf:%d\\n\", sendbuflen); 输出信息如下:default,sendbuf:16384now,sendbuf:4608 输出信息总结:设置sendbubf为100时,没有生效,反而设置出来一个较大的值4608 原因探究12345678910111213141516171819202122232425262728293031323334/* * This is meant for all protocols to use and covers goings on * at the socket level. Everything here is generic. */int sock_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen){ ··· case SO_SNDBUF: /* Don't error on this BSD doesn't and if you think * about it this is right. Otherwise apps have to * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ val = min_t(u32, val, sysctl_wmem_max);set_sndbuf: sk->sk_userlocks |= SOCK_SNDBUF_LOCK; sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF); /* Wake up sending tasks if we upped the value. */ sk->sk_write_space(sk); break; ··· default: ret = -ENOPROTOOPT; break; } release_sock(sk); return ret;}#define TCP_SKB_MIN_TRUESIZE (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))#define SOCK_MIN_SNDBUF (TCP_SKB_MIN_TRUESIZE * 2) 设置socket选项SO_SNDBUF会触发系统调用最终调用sock_setsockopt函数,其处理设置选项过程如上:其会将用户设置的缓冲区大小乘以2,然后和SOCK_MIN_SNDBUF(4608)比较,取较大值因此最终较小的缓冲区设置值200没有生效,生效的是4608 设置发送缓冲区大小为中间值缓冲区系统设置值大小:net.core.wmem_max = 212992net.ipv4.tcp_wmem = 4096 16384 4194304 实验动作将缓冲区大小设置为3000123456789socklen_t sendbuflen = 0;socklen_t len = sizeof(sendbuflen);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"default,sendbuf:%d\\n\", sendbuflen);socklen_t sendbuflen = 3000;setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"now,sendbuf:%d\\n\", sendbuflen); 实验程序输出:default,sendbuf:16384now,sendbuf:6000输出信息总结:设置大小3000生效,sndbuf大小会被设置成为3000*2 设置发送缓冲区大小威较大值缓冲区系统设置值大小:net.core.wmem_max = 212992net.ipv4.tcp_wmem = 4096 16384 4194304 实验动作将缓冲区大小设置为230000123456789socklen_t sendbuflen = 0;socklen_t len = sizeof(sendbuflen);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"default,sendbuf:%d\\n\", sendbuflen);socklen_t sendbuflen = 230000;setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len);getsockopt(fd, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);printf(\"now,sendbuf:%d\\n\", sendbuflen); 实现程序输出:default,sendbuf:16384now,sendbuf:425984实验结果分析:设置大小23000(大于系统212992),sendbuf最终结果为212992*2 原因探究12345678case SO_SNDBUF: val = min_t(u32, val, sysctl_wmem_max);set_sndbuf: sk->sk_userlocks |= SOCK_SNDBUF_LOCK; sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF); /* Wake up sending tasks if we upped the value. */ sk->sk_write_space(sk); break; val为用户set的值,其在选择时候会同sysctl_wmem_max比较,选取一个较小的值,如果设置值大于sysctl_wmem_max值的话,val就取系统wmem的最大值。 如上可知:230000 > net.core.wmem_max ,所以用户设置SO_SNDBUF选项最大只能取net.core.wmem_max,所以最终sk_sndbubf值为net.core.wmem_max*2 即425984 其它说明tcp socket记录当前发送队列的占用缓冲区大小的变量为sk_wmem_queued和发送缓冲区判断函数如下:12345678static inline bool sk_stream_memory_free(const struct sock *sk){ if (sk->sk_wmem_queued >= sk->sk_sndbuf) return false; return sk->sk_prot->stream_memory_free ? sk->sk_prot->stream_memory_free(sk) : true;} 从上述判别中我们可以知道,发送缓冲区记录和比对单位均是字节","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"socket","slug":"socket","permalink":"http://vcpu.me/tags/socket/"}]},{"title":"PF_INET和AF_INET区别","slug":"pf_inet","date":"2017-06-20T10:12:51.000Z","updated":"2017-06-20T10:12:51.000Z","comments":true,"path":"pf_inet/","link":"","permalink":"http://vcpu.me/pf_inet/","excerpt":"","text":"PF_INET和AF_INET区别 在初始化socket时候socket(PF_INET,SOCK_SRTEAM,0) 用PF_INET,表示ip协议 指定地址协议族时候用AF_INET,表示地址为IP协议 Linux AF_INET和PF_INET值相同均为2 123456socket.h#define AF_INET 2 /* Internet IP Protocol *...#define PF_INET AF_INET socket通信协议类型 SOCKET_STREAM: 面向连接TCP SOCK_DGRAM: 无保障UDP","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"PF_INET","slug":"PF-INET","permalink":"http://vcpu.me/tags/PF-INET/"},{"name":"AF_INET","slug":"AF-INET","permalink":"http://vcpu.me/tags/AF-INET/"}]},{"title":"sockaddr_in和sockaddr的区别","slug":"sockaddr_in","date":"2017-06-20T07:18:11.000Z","updated":"2017-06-20T07:18:11.000Z","comments":true,"path":"sockaddr_in/","link":"","permalink":"http://vcpu.me/sockaddr_in/","excerpt":"sockaddr_in在头文件/usr/include/netinet/in.h123456789101112131415161718192021222324252627282930/* Structure describing an Internet socket address. */struct sockaddr_in { __SOCKADDR_COMMON (sin_); in_port_t sin_port; /* Port number. */ struct in_addr sin_addr; /* Internet address. */ /* Pad to size of `struct sockaddr'. */ unsigned char sin_zero[sizeof (struct sockaddr) - __SOCKADDR_COMMON_SIZE - sizeof (in_port_t) - sizeof (struct in_addr)]; }; or struct sockaddr_in { short int sin_family; /* Address family */ unsigned short int sin_port; /* Port number */ struct in_addr sin_addr; /* Internet address */ unsigned char sin_zero[8]; /* Same size as struct sockaddr */}; struct in_addr { union { struct { u_char s_b1,s_b2,s_b3,s_b4; } S_un_b; struct { u_short s_w1,s_w2; } S_un_w; u_long S_addr; } S_un; #define s_addr S_un.S_addr }; 组成包含协议家族、端口、地址、填充 端口和地址,需要是网络字节序号 inet_addr(“127.0.0.1”)把字符串点分十进制地址按照网络字节序转换为4字节的地址","text":"sockaddr_in在头文件/usr/include/netinet/in.h123456789101112131415161718192021222324252627282930/* Structure describing an Internet socket address. */struct sockaddr_in { __SOCKADDR_COMMON (sin_); in_port_t sin_port; /* Port number. */ struct in_addr sin_addr; /* Internet address. */ /* Pad to size of `struct sockaddr'. */ unsigned char sin_zero[sizeof (struct sockaddr) - __SOCKADDR_COMMON_SIZE - sizeof (in_port_t) - sizeof (struct in_addr)]; }; or struct sockaddr_in { short int sin_family; /* Address family */ unsigned short int sin_port; /* Port number */ struct in_addr sin_addr; /* Internet address */ unsigned char sin_zero[8]; /* Same size as struct sockaddr */}; struct in_addr { union { struct { u_char s_b1,s_b2,s_b3,s_b4; } S_un_b; struct { u_short s_w1,s_w2; } S_un_w; u_long S_addr; } S_un; #define s_addr S_un.S_addr }; 组成包含协议家族、端口、地址、填充 端口和地址,需要是网络字节序号 inet_addr(“127.0.0.1”)把字符串点分十进制地址按照网络字节序转换为4字节的地址 sockaddr为通用的socket地址1234struct sockaddr { unsigned short sa_family; // address family, AF_INET char sa_data[14]; // 14 bytes of protocol address }; bind、connect、recv、send等socket参数使用的就是这个结构体","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"sockaddr_in","slug":"sockaddr-in","permalink":"http://vcpu.me/tags/sockaddr-in/"},{"name":"sockaddr","slug":"sockaddr","permalink":"http://vcpu.me/tags/sockaddr/"}]},{"title":"systemtap使用调试记录(二)","slug":"socket_stp","date":"2017-06-20T06:52:23.000Z","updated":"2017-06-20T06:52:23.000Z","comments":true,"path":"socket_stp/","link":"","permalink":"http://vcpu.me/socket_stp/","excerpt":"socket sendbubf探究stp脚本 centos7 3.10.0-514.16.1.el7.x86_64 该systap脚本是在调用协议栈sk->sk_sndbuf可能改变的位置增加探测点,探究snd_buf变更规律使用","text":"socket sendbubf探究stp脚本 centos7 3.10.0-514.16.1.el7.x86_64 该systap脚本是在调用协议栈sk->sk_sndbuf可能改变的位置增加探测点,探究snd_buf变更规律使用 脚本socket.stp123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217%{ #include <linux/tcp.h> #include<linux/rtc.h> #include <net/tcp.h> static const char tcp_state_array[][16] = { \"NULL\", \"ESTABLISHED\", \"SYN_SENT\", \"SYN_RECV\", \"FIN_WAIT1\", \"FIN_WAIT2\", \"TIME_WAIT\", \"CLOSE\", \"CLOSE_WAIT\", \"LAST_ACK\", \"LISTEN\", \"CLOSING\" };%}function get_short_time:string()%{ struct timeval tv; struct rtc_time tm; unsigned long time; do_gettimeofday(&tv); time = tv.tv_sec + 8 * 3600; rtc_time_to_tm(time, &tm); sprintf(STAP_RETVALUE, \"%02d:%02d:%02d\", tm.tm_hour, tm.tm_min, tm.tm_sec);%}function get_full_time:string()%{ struct timeval tv; struct rtc_time tm; unsigned long time; do_gettimeofday(&tv); time = tv.tv_sec + 8 * 3600; rtc_time_to_tm(time, &tm); sprintf(STAP_RETVALUE, \"%d/%d/%d,%02d:%02d:%02d\", tm.tm_year+1900, tm.tm_mon+1, tm.tm_mday, tm.tm_hour, tm.tm_min, tm.tm_sec);%}function get_conn_lifetime:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct stap_info *info = sk->sk_protinfo; STAP_RETVALUE = jiffies_to_msecs(tcp_time_stamp - info->estab_t);%}function get_conn_data:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct tcp_sock *tp = tcp_sk(sk); struct stap_info *info = sk->sk_protinfo; u32 len = tp->snd_nxt - info->isn; STAP_RETVALUE = len ? len - 1 : len;%}function filter_http_transtime:long (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; struct stap_info *info = sk->sk_protinfo; STAP_RETVALUE = info->http_filter;%}function get_socket_addr:string (sk:long){ laddr = tcpmib_local_addr(sk) lport = tcpmib_local_port(sk) raddr = tcpmib_remote_addr(sk) rport = tcpmib_remote_port(sk) local_addr = sprintf(\"%s:%d\", ip_ntop(htonl(laddr)), lport) remote_addr = sprintf(\"%s:%d\", ip_ntop(htonl(raddr)), rport) return sprintf(\"local=%s,remote=%s\", local_addr, remote_addr)}function get_socket_state:string (sk:long)%{ struct sock *sk = (struct sock *)STAP_ARG_sk; sprintf(STAP_RETVALUE, \"%s\", tcp_state_array[sk->sk_state]);%}function get_socket_sk_sndbuf:string(sk:long)%{ struct sock *sk=(struct sock*)STAP_ARG_sk; sprintf(STAP_RETVALUE,\"%d:%d\", sk->sk_wmem_queued, sk->sk_sndbuf);%}function socket_state_num2str:string (state:long)%{ sprintf(STAP_RETVALUE, \"%s\", tcp_state_array[STAP_ARG_state]);%}function sshfilter:long(sk:long){ lport = tcpmib_local_port(sk) if(lport == 22) return 1 return 0}probe kernel.function(\"tcp_send_ack\").call{ if(sshfilter($sk)) next printf(\"tcp_send_ack[%s]%s state:%s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk));}probe kernel.function(\"tcp_sendmsg\").call{ if(sshfilter($sk)) next printf(\"tcp_sendmsg[%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"tcp_sendmsg\").return{ if(sshfilter($sk)) next printf(\"tcp_sendmsg return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.statement(\"*@net/core/sock.c:711\"){ if(sshfilter($sk)) next printf(\"sock:711 return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.statement(\"*@net/core/sock.c:715\"){ if(sshfilter($sk)) next printf(\"sock:715 return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.statement(\"*@net/ipv4/ip_output.c:1581\"){ if(sshfilter($sk)) next printf(\"ip_output:1581 return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.statement(\"*@net/ipv4/ip_output.c:1583\"){ if(sshfilter($sk)) next printf(\"ip_output:1583 return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.statement(\"*@net/ipv4/tcp_input.c:304\"){ if(sshfilter($sk)) next printf(\"tcp_input:304 return [%s]%s state:%s,sndbubf %s sndmem : %d permss %d\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk),$sndmem,$per_mss); print_backtrace()}probe kernel.function(\"__sk_mem_schedule\").call{ if(sshfilter($sk)) next printf(\"__sk_mem_schedule[%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"__sk_mem_schedule\").return{ if(sshfilter($sk)) next printf(\"__sk_mem_schedule return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"sk_page_frag_refill\").call{ if(sshfilter($sk)) next printf(\"sk_page_frag_refill[%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"sk_page_frag_refill\").return{ if(sshfilter($sk)) next printf(\"sk_page_frag_refill return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"sk_stream_alloc_skb\").call{ if(sshfilter($sk)) next printf(\"sk_stream_alloc_skb[%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"sk_stream_alloc_skb\").return{ if(sshfilter($sk)) next printf(\"sk_stream_alloc_skb return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"tcp_v4_connect\").call{ if(sshfilter($sk)) next printf(\"tcp_v4_connect[%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));}probe kernel.function(\"tcp_v4_connect\").return{ if(sshfilter($sk)) next printf(\"tcp_v4_connect return [%s]%s state:%s,sndbubf %s\\n\",get_full_time(),get_socket_addr($sk),get_socket_state($sk),get_socket_sk_sndbuf($sk));} 执行步骤stap -g socket.stp 执行结果123456789101112131415161718192021222324252627282930313233[root@localhost stp]# stap -g socket.stpWARNING: Eliding unused function 'filter_http_transtime': identifier 'filter_http_transtime' at socket.stp:68:10 source: function filter_http_transtime:long (sk:long) ^WARNING: Eliding unused function 'get_conn_data': identifier 'get_conn_data' at :58:10 source: function get_conn_data:long (sk:long) ^WARNING: Eliding unused function 'get_conn_lifetime': identifier 'get_conn_lifetime' at :51:10 source: function get_conn_lifetime:long (sk:long) ^WARNING: Eliding unused function 'get_short_time': identifier 'get_short_time' at :22:10 source: function get_short_time:string() ^WARNING: Eliding unused function 'socket_state_num2str': identifier 'socket_state_num2str' at :104:10 source: function socket_state_num2str:string (state:long) ^sock:711 return [2017/6/20,14:42:35]local=0.0.0.0:0,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:16384sock:715 return [2017/6/20,14:42:35]local=0.0.0.0:0,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:32768tcp_v4_connect[2017/6/20,14:42:35]local=0.0.0.0:3000,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:32768tcp_v4_connect return [2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:SYN_SENT,sndbubf 1280:32768tcp_send_ack[2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHEDtcp_sendmsg[2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHED,sndbubf 0:32768sk_stream_alloc_skb[2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHED,sndbubf 0:32768sk_stream_alloc_skb return [2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHED,sndbubf 0:32768tcp_sendmsg return [2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHED,sndbubf 2304:32768tcp_send_ack[2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHEDtcp_send_ack[2017/6/20,14:42:35]local=192.168.55.178:3000,remote=180.97.33.108:80 state:ESTABLISHEDip_output:1583 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992ip_output:1581 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992ip_output:1583 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992ip_output:1581 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992ip_output:1583 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992ip_output:1581 return [2017/6/20,14:42:35]local=0.0.0.0:6,remote=0.0.0.0:0 state:CLOSE,sndbubf 0:212992","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"socket","slug":"socket","permalink":"http://vcpu.me/tags/socket/"},{"name":"systemtap","slug":"systemtap","permalink":"http://vcpu.me/tags/systemtap/"}]},{"title":"socket send","slug":"socketsend1","date":"2017-06-19T04:43:31.000Z","updated":"2017-06-19T04:43:31.000Z","comments":true,"path":"socketsend1/","link":"","permalink":"http://vcpu.me/socketsend1/","excerpt":"用户态发送函数列表1234567891011ssize_t send(int sockfd, const void *buf, size_t len, int flags);ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen);ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags); ssize_t write(int fd, const void *buf, size_t count);","text":"用户态发送函数列表1234567891011ssize_t send(int sockfd, const void *buf, size_t len, int flags);ssize_t sendto(int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen);ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags); ssize_t write(int fd, const void *buf, size_t count); 发送函数之间差别 send 有连接协议发送数据使用,send第四个参数flags为0时候,等价于write send(sockfd, buf, len, 0) 等价 write(sockfd, buf, len) send是sendto一部分,send可被sendto替换 send(sockfd, buf, len, flags) 等价于 sendto(sockfd, buf, len, flags, NULL, 0) sendto 无连接和有连接发包都可以使用 sendmsg 可替换上树所有的发包函数 123456789struct msghdr { void *msg_name; /* optional address */ socklen_t msg_namelen; /* size of address */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data, see below */ size_t msg_controllen; /* ancillary data buffer len */ int msg_flags; /* flags (unused) */ }; /proc/sys/net/core/optmem_max可控制每个socket的msg_control大小 sendmsg不使用msg_flags参数 send发包过程概述 阻塞模式下 调用send函数时候,比较要发送数据和套接字发送缓冲区长度(net.ipv4.tcp_wmem);如果发送缓冲区较小,函数直接返回SOCKET_ERR; 1234567891011121314151617181920if send_len <= tcp_wmem{ if is sending{ wait if network err return SCOKET_ERR } else{ if len > tcp_wmem left{ wait if network err return SCOKET_ERR } else{ copy data to tcp buf if copy err return SCOKET_ERR return copy data size } }} 剩余缓冲区能容纳发送数据,则直接将数据拷贝到缓冲区中,send直接返回。如果剩余缓冲区不足,发送端阻塞等待,对端在协议栈层接收到数据后会发送ack确认,发送方接收到ack后释放缓冲区空间;如果此时剩余缓冲区大小可放置要发送数据,则直接将数据拷入缓冲区,返回。 Tips:阻塞模式下,数据发送正常,其返回的数据长度一定是发送数据的长度。 非阻塞模式下 send函数将数据拷入协议栈缓冲区,如果缓冲区不足,则send尽力拷贝,并返回拷贝大小;如果缓冲区满则返回-1,同时errno为EAGAIN,让发送端再次尝试发送数据。 发送缓冲区设置socklen_t sendbuflen = 0; socklen_t len = sizeof(sendbuflen); getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len); printf("default,sendbuf:%d\\n", sendbuflen); sendbuflen = 10240; setsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len); getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len); printf("now,sendbuf:%d\\n", sendbuflen); send发包实例解析实际socket使用过程中,常用的是非阻塞模式,我们就以非阻塞模式为例进行分析,预设多种场景如下: 场景1:发送端10k数据已经安全放入缓冲区,已实际发出2k(收到对端ack),接收端正在处理数据,此时发送端因为10k数据发送完毕,关闭了socket。 场景分析: 发送端关闭socket,主动fin告诉对端发送端数据发送完毕想关闭TCP连接,发送完fin后发送端处于fin wait1状态等待接收端ack确认;发送端协议栈剩余8k数据依然在独立发送,待数据发送完成后,协议栈才会把fin发给接收端;接收端在接收ack完10k数据后,且收到fin信号后,接收端回复ack确认fin信号,两者协商关闭socket。 场景2:发送端预期发送10k数据,已将2k数据拷入缓冲区并实际发出拷入的2k数据(收到对端ack),接收端正在处理数据,此时发送端又发送了8k新数据;(缓冲区充足(8k新数据会被拷入缓冲区)情况我们不讨论)缓冲区不足时候会发生什么? 场景分析 新发送的10k数据会尽力拷入缓冲区,send返回拷入缓冲区数据长度2k,如果此时缓冲区剩余空间为0时候,客户端强制send数据,会收到EAGAIN信号;其实这种情况客户端正确处理方式是读出缓冲区可写信号再发送数据,而不是自己进行发送尝试。 场景3:发送端10k数据已经安全放入缓冲区,已实际发出2k(收到对端ack),接收端正在处理接收到1k数据,处理完成后数据接收端关闭了socket,会发发生什么? 场景分析 数据发送端有监听机制,数据发送端用户态会得到接收端端关闭信号(socket可读信号),这时候用户正确打开方式是调用close关闭socket 如果数据发送端未处理该关闭信号,且数据接收端没有rst强制关闭连接,数据发送端仍然可正常发送数据 如果数据发送端未处理该关闭信号,但是数据接收端已经rst强制关闭连接,数据发送端仍然在send发送数据,send将返回-1 如果是阻塞情况,但是因缓冲区满正在阻塞,如果接收端发送rst,阻塞发送端会退出阻塞返回,发送成功字节数,如果在此调用send,将返回-1 场景4:发送端10k数据已经安全放入缓冲区,已实际发出2k(收到对端ack),接收端正在处理接收到1k数据,此时网络出现异常 场景分析 接收应用程序在处理完已收到的1k数据后,会继续从缓存区读取余下的1k数据,然后就表现为无数据可读的现象,这种情况需要应用程序来处理超时.一般做法是设定一个select等待的最大时间,如果超出这个时间依然没有数据可读,则认为socket已不可用.发送应用程序会不断的将余下的数据发送到网络上,但始终得不到确认,所以缓存区的可用空间持续为0,这种情况也需要应用程序来处理.如果不由应用程序来处理这种情况超时的情况,也可以通过tcp协议本身来处理,具体可以查看sysctl项中的:net.ipv4.tcp_keepalive_intvlnet.ipv4.tcp_keepalive_probesnet.ipv4.tcp_keepalive_time send特点 send只是将数据放入缓冲区中,并不是真正已经发给对方 非阻塞发送字节可以是1-n,其发送多少完全依赖于剩余的发送缓冲区 socket发送函数解析发送流程图 sendsendtosendmmsgsendmsg 上述流程调用过程如下:->socketcall ->sock_sendmsg -> __sock_sendmsg -> sock->ops->sendmsg(inet_sendmsg)->[tcp_prot]tcp_sendmsg 内核系统调用send 、sendto、sendmsg、sendmmsg发送函数由glibc提供,声明于/usr/include/sys/socket.h用户态在调用后会进入到sys_socketcall系统调用中,下面代码部分就是其入口1234567891011121314151617181920212223242526SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args){... switch (call) { ... case SYS_SEND: err = sys_send(a0, (void __user *)a1, a[2], a[3]); break; case SYS_SENDTO: err = sys_sendto(a0, (void __user *)a1, a[2], a[3], (struct sockaddr __user *)a[4], a[5]); break; ... case SYS_SENDMSG: err = sys_sendmsg(a0, (struct msghdr __user *)a1, a[2]); break; case SYS_SENDMMSG: err = sys_sendmmsg(a0, (struct mmsghdr __user *)a1, a[2], a[3]); break; ... default: err = -EINVAL; break; } return err;} send 是sendto的一种特殊情况,(sendto发送地址为NULL发送地址长度为0) 12345SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len, unsigned int, flags){ return sys_sendto(fd, buff, len, flags, NULL, 0);} sendto -> sock_sendmsg -> __sock_sendmsg -> sock->ops->sendmsg(inet_sendmsg) 123456789101112SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len, unsigned int, flags, struct sockaddr __user *, addr, int, addr_len){ ... err = sock_sendmsg(sock, &msg, len);out_put: fput_light(sock->file, fput_needed);out: return err;} sendmsg 和sendmmsg 完成用户态数据拷贝到内核态后,最终也是调用inet_sendmsg处理,在此就拿sendto情况详细分析 sendto源码实现分析sendto -> sock_sendmsg -> “sock_sendmsg” ->”sock_sendmsg_nosec” -> sock->ops->sendmsg(inet_sendmsg) 首先分析sock_sendmsg实现 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253int sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t size){ struct kiocb iocb; struct sock_iocb siocb; int ret; /*异步IO控制块初始化*/ init_sync_kiocb(&iocb, NULL); iocb.private = &siocb; /*异步控制块调用完毕后,可调用__sock_sendmsg发送数据*/ ret = __sock_sendmsg(&iocb, sock, msg, size); if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&iocb); return ret;}static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size){ int err = security_socket_sendmsg(sock, msg, size); /*调用__sock_sendmsg_nosec*/ return err ?: __sock_sendmsg_nosec(iocb, sock, msg, size);}static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size){ struct sock_iocb *si = kiocb_to_siocb(iocb); si->sock = sock; si->scm = NULL; si->msg = msg; si->size = size; /*调用inet_sendnsg*/ return sock->ops->sendmsg(iocb, sock, msg, size);}int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size){ struct sock *sk = sock->sk; sock_rps_record_flow(sk); /*如果连接没有分配本地端口且允许分配本地端口,我们就给连接绑定一个本地端口 */ /* We may need to bind the socket. */ if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind && inet_autobind(sk)) return -EAGAIN; /*传输层是TCP情况下,调用tcp_sendmsg()*/ return sk->sk_prot->sendmsg(iocb, sk, msg, size);} 其次分析inet_autobind ,获取可用端口并给,获取后的端口会赋值给inet->inet_sport/inet_num 1234567891011121314151617181920static int inet_autobind(struct sock *sk){ struct inet_sock *inet; /* We may need to bind the socket. */ lock_sock(sk); inet = inet_sk(sk); if (!inet->inet_num) { /*针对于TCP情况sk->sk_prot->get_port调用的是inet_csk_get_port * inet_csk_get_port工作获取端口,并将其赋值给inet->inet_num */ if (sk->sk_prot->get_port(sk, 0)) { release_sock(sk); return -EAGAIN; } /*获取inet->inet_num赋值给inet->inet_sport*/ inet->inet_sport = htons(inet->inet_num); } release_sock(sk); return 0;} 最后分析tcp_sendmsg 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size){ struct iovec *iov; struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int iovlen, flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0, offset = 0; bool sg; long timeo; lock_sock(sk); flags = msg->msg_flags; if (flags & MSG_FASTOPEN) { err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size); if (err == -EINPROGRESS && copied_syn > 0) goto out; else if (err) goto out_err; offset = copied_syn; } /* * 获取数据发送超时时间 */ timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); /* Wait for a connection to finish. One exception is TCP Fast Open * (passive side) where data is allowed to be sent before a connection * is fully established. */ /* * TCP状态检查,ES和CLOSE_WAIT状态才能发送数据,其它状态都要等待连接建立起来 * 否则直接返回错误 * * 随着协议栈进步,增加一种情况tcp_passive_fastopen即tcp被动快速打开时候,不区分当前TCP处于状态 */ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) && !tcp_passive_fastopen(sk)) { /*等待连接建立,连接建立成功则返回0*/ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) goto do_error; } /*开启repair功能处理*/ if (unlikely(tp->repair)) { if (tp->repair_queue == TCP_RECV_QUEUE) { copied = tcp_send_rcvq(sk, msg, size); goto out_nopush; } err = -EINVAL; if (tp->repair_queue == TCP_NO_QUEUE) goto out_err; /* 'common' sending to sendq */ } /**/ /* This should be in poll */ clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); /*获取发送mss*/ mss_now = tcp_send_mss(sk, &size_goal, flags); /* Ok commence sending. */ iovlen = msg->msg_iovlen;//应用层要发送数据块个数 iov = msg->msg_iov;//要发送数据地址 copied = 0;//已经放到缓冲区的数据长度 err = -EPIPE; /*在发送数据前,如果sk已经关闭或者出现err,直接返回 -EPIPE*/ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto out_err; /*网卡分散聚合*/ sg = !!(sk->sk_route_caps & NETIF_F_SG); while (--iovlen >= 0) { /*获取用户态数据长度和数据指针并指向下一个用户态要发送数据块*/ size_t seglen = iov->iov_len; unsigned char __user *from = iov->iov_base; iov++; /*TCP fast open涉及*/ if (unlikely(offset > 0)) { /* Skip bytes copied in SYN */ if (offset >= seglen) { offset -= seglen; continue; } seglen -= offset; from += offset; offset = 0; } while (seglen > 0) { int copy = 0; int max = size_goal; /*从发送队列尾部取skb,尝试将用户态数据放入skb->data剩余空间*/ skb = tcp_write_queue_tail(sk); if (tcp_send_head(sk)) { /*另一种mss情况,GSO*/ if (skb->ip_summed == CHECKSUM_NONE) max = mss_now; copy = max - skb->len; } if (copy <= 0) {/*skb已经装满数据,后续会申请新的skb来发送数据*/new_segment: /* Allocate new segment. If the interface is SG, * allocate skb fitting to single page. */ if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; /*申请内存大小为select_size(线性数据区+协议头),申请失败或者不合法,睡眠等待*/ skb = sk_stream_alloc_skb(sk, select_size(sk, sg), sk->sk_allocation); if (!skb) goto wait_for_memory; /* * Check whether we can use HW checksum. * 检查释放网卡硬件释放可以计算校验和 */ if (sk->sk_route_caps & NETIF_F_CSUM_MASK) skb->ip_summed = CHECKSUM_PARTIAL; /*将新分配的skb入sk_write_queue数据发送队列*/ skb_entail(sk, skb); copy = size_goal; max = size_goal; /* All packets are restored as if they have * already been sent. skb_mstamp isn't set to * avoid wrong rtt estimation. * TCP repair */ if (tp->repair) TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED; } /* Try to append data to the end of skb. */ if (copy > seglen) copy = seglen; /* Where to copy to? */ /*如果数据还有线性区间,直接将数据拷入冰计算校验和*/ if (skb_availroom(skb) > 0) { /* We have some space in skb head. Superb! */ copy = min_t(int, copy, skb_availroom(skb)); err = skb_add_data_nocache(sk, skb, from, copy); if (err) goto do_fault; } else {/*如果没有了线性空间*/ /* * 数据会被复制到分页中 * */ bool merge = true; /*取得当前SKB的分片段数*/ int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); /*检查分也可用空间,如果没有就申请新的页,如果系统内存不足就睡眠等待*/ if (!sk_page_frag_refill(sk, pfrag)) goto wait_for_memory; /*如果不能将数据最佳到最后一个分片*/ if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { /*分页已经达到最大规格,将当前数据发出去,跳到new_segment重新申请skb*/ if (i == MAX_SKB_FRAGS || !sg) { tcp_mark_push(tp, skb); goto new_segment; } merge = false; } copy = min_t(int, copy, pfrag->size - pfrag->offset); /*系统对发送缓冲区申请合法性判断*/ if (!sk_wmem_schedule(sk, copy)) goto wait_for_memory; /*拷贝用户空间数据,同时计算校验和,更新数据skb长度和缓存*/ err = skb_copy_to_page_nocache(sk, from, skb, pfrag->page, pfrag->offset, copy); if (err) goto do_error; /* Update the skb. */ /*最后一个分页可以放数据数据页被放入了,就更新分也大小记录*/ if (merge) { skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); } else { /*如果不能分页就新增页,并初始化*/ skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy); get_page(pfrag->page); } pfrag->offset += copy; } /*如果复制数据长度为0,不用加PSH标记*/ if (!copied) TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; /*更新发送队列中最后一个序号,数据包的最后一个序号*/ tp->write_seq += copy; TCP_SKB_CB(skb)->end_seq += copy; skb_shinfo(skb)->gso_segs = 0; /*已经拷入了copy大小数据,用户态指针后移且更新已经拷贝数据增加*/ from += copy; copied += copy; /*所有数据处理完毕,直接退出*/ if ((seglen -= copy) == 0 && iovlen == 0) goto out; /*如果skbb还可以继续填充数据或者是带外数据或者是有REPAIR选项,继续使用skb*/ if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair)) continue; /*检查释放必须立即发送,即检查自上次发送后产生的数据是否已经超过对方通告过的最大接收窗口的一半。如果必须发送则设置紧急数据标示,然后将数据发出去*/ if (forced_push(tp)) { tcp_mark_push(tp, skb); __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); } else if (skb == tcp_send_head(sk)) /*数据不必立即发送,且数据上只存在这段数据,则将这段数据发出*/ tcp_push_one(sk, mss_now); continue;wait_for_sndbuf:/*套接口缓冲区大小超过限制,此时无法再申请skb放数据,我们设置socket满标志*/ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); /*系统内存不足处理*/wait_for_memory: /*skb分配失败了,已经拷入发送队列数据,直接调用tcp_push发出去 ~MSG_MORE表示无更多数据 TCP_NAGLE_PUSH 选项调用NAGLE,尽量减少小字节发送数据 */ if (copied) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH, size_goal); /*等待内存空闲,超过timeo时间后返回错误*/ if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) goto do_error; /*啊,内存来了,重新获取MSS和TSO,继续将用户态数据拷入缓冲区*/ mss_now = tcp_send_mss(sk, &size_goal, flags); } }out: /*如果数据已经拷入发送队列,则立即发送*/ if (copied) tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);out_nopush: release_sock(sk); return copied + copied_syn;do_fault: /*复制数据异常时才进入这里 * skb无负载数据,从发送队列上去除,并更新发送队列等参数*/ if (!skb->len) { tcp_unlink_write_queue(skb, sk); /* It is the one place in all of TCP, except connection * reset, where we can be unlinking the send_head. */ tcp_check_send_head(sk, skb); sk_wmem_free_skb(sk, skb); }do_error: /*如果已经复制了部分数据,即使发生了错误也可以发送,跳到out就是去发送数据去了*/ if (copied + copied_syn) goto out;out_err: err = sk_stream_error(sk, flags, err); release_sock(sk); return err;} tcp_sendmsg()做了以下事情: 如果使用了TCP Fast Open,则会在发送SYN包的同时携带上数据。 如果连接尚未建立好,不处于ESTABLISHED或者CLOSE_WAIT状态, 那么进程进行睡眠,等待三次握手的完成。 获取当前的MSS、网络设备支持的最大数据长度size_goal。 如果支持GSO,size_goal会是MSS的整数倍。 遍历用户层的数据块数组: 4.1 获取发送队列的最后一个skb,如果是尚未发送的,且长度尚未达到size_goal,那么可以往此skb继续追加数据。 4.2 否则需要申请一个新的skb来装载数据。 4.2.1 如果发送队列的总大小sk_wmem_queued大于等于发送缓存的上限sk_sndbuf,或者发送缓存中尚未发送的数据量超过了用户的设置值: 设置同步发送时发送缓存不够的标志。 如果此时已有数据复制到发送队列了,就尝试立即发送。 等待发送缓存,直到sock有发送缓存可写事件唤醒进程,或者等待超时。 4.2.2 申请一个skb,其线性数据区的大小为:通过select_size()得到的线性数据区中TCP负荷的大小 + 最大的协议头长度。 如果申请skb失败了,或者虽然申请skb成功,但是从系统层面判断此次申请不合法, 等待可用内存,等待时间为2~202ms之间的一个随机数。 4.2.3 如果以上两步成功了,就更新skb的TCP控制块字段,把skb加入到sock发送队列的尾部,增加发送队列的大小,减小预分配缓存的大小。 4.3 接下来就是拷贝消息头中的数据到skb中了。如果skb的线性数据区还有剩余空间,就复制数据到线性数据区中,同时计算校验和。 4.4 如果skb的线性数据区已经用完了,那么就使用分页区: 4.4.1 检查分页是否有可用空间,如果没有就申请新的page。如果申请失败,说明系统内存不足。之后会设置TCP内存压力标志,减小发送缓冲区的上限,睡眠等待内存。 4.4.2 判断能否往最后一个分页追加数据。不能追加时,检查分页数是否达到了上限、或网卡不支持分散聚合。如果是的话,就为此skb设置PSH标志。 然后跳转到4.2处申请新的skb,来继续填装数据。 4.4.3 从系统层面判断此次分页发送缓存的申请是否合法。 4.4.4 拷贝用户空间的数据到skb的分页中,同时计算校验和。更新skb的长度字段,更新sock的发送队列大小和预分配缓存。 4.4.5 如果把数据追加到最后一个分页了,更新最后一个分页的数据大小。否则初始化新的分页。 4.5 拷贝成功后更新:发送队列的最后一个序号、skb的结束序号、已经拷贝到发送队列的数据量。 4.6 尽可能的将发送队列中的skb发送出去。 参考 http://blog.csdn.net/zhangskd/article/details/48207553","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"send","slug":"send","permalink":"http://vcpu.me/tags/send/"},{"name":"sendto","slug":"sendto","permalink":"http://vcpu.me/tags/sendto/"},{"name":"sendmsg","slug":"sendmsg","permalink":"http://vcpu.me/tags/sendmsg/"},{"name":"sendmmsg","slug":"sendmmsg","permalink":"http://vcpu.me/tags/sendmmsg/"}]},{"title":"centos环境下脚本执行顺序探究","slug":"centos脚本执行顺序","date":"2017-06-15T07:42:10.000Z","updated":"2017-06-15T07:42:10.000Z","comments":true,"path":"centos脚本执行顺序/","link":"","permalink":"http://vcpu.me/centos脚本执行顺序/","excerpt":"centos脚本执行顺序通用角度分析,centos 7 系统中存在如下以下5种常用的脚本路径/etc/rc.d/rc.local/etc/profile/etc/bashrc~/.bash_profile~/.bashrc 通过在除rc外的脚本中加入echo信息,reboot虚拟机并ssh登陆用户,打印出顺序如下I am etc profilei am etc bashrci am ~ bash rci am ~ bash profile","text":"centos脚本执行顺序通用角度分析,centos 7 系统中存在如下以下5种常用的脚本路径/etc/rc.d/rc.local/etc/profile/etc/bashrc~/.bash_profile~/.bashrc 通过在除rc外的脚本中加入echo信息,reboot虚拟机并ssh登陆用户,打印出顺序如下I am etc profilei am etc bashrci am ~ bash rci am ~ bash profile 脚本执行顺序和执行时机 脚本路径 执行顺序 执行时机 /etc/rc.d/rc.local 1 系统起机执行一次,后续均不执行 /etc/profile 2 ssh/su/界面登陆时执行 /etc/bashrc 3 ssh/su/界面登陆时执行 ~/.bash_profile 4 ssh/su/界面登陆以当前用户身份登陆 ~/.bashrc 5 ssh/su/界面登陆以当前用户身份登陆 脚本含义rc.local脚本centos启动时候执行脚本,可以用作默认启动/etc/profile和/etc/bashrc 属于系统的全局变量设置~/profile和~/bashrc 属于给予某一个用户的变量设置 profile和bashrc区别 profile 是用户唯一用来设置环境变量的地方,因为用户可能有多种shell(bash、sh、zsh),环境变量没有必要在每种shell都初始化,只需要统一初始化就行,很显然,profile就是这样的地方 bashrc 是专门给bash做初始化设置的,相对应来讲,其它shell会有专门的shrc、zshrc文件存放 开机启动脚本其它说明centos7 默认是没有执行权限的,想在此处加执行脚本,执行脚本不会执行到,需要增添执行权限 123456[root@localhost rc.d]# ls -alt rc.local-rw-r--r--. 1 root root 491 Jun 13 22:24 rc.local[root@localhost rc.d]# chmod +x rc.local[root@localhost qinlong]# ls -alt /etc/rc.d/rc.local-rwxr-xr-x. 1 root root 535 Jun 13 22:48 /etc/rc.d/rc.local","categories":[{"name":"centos","slug":"centos","permalink":"http://vcpu.me/categories/centos/"}],"tags":[{"name":"linux","slug":"linux","permalink":"http://vcpu.me/tags/linux/"},{"name":"centos","slug":"centos","permalink":"http://vcpu.me/tags/centos/"}]},{"title":"f-stack安装运行","slug":"f-stack安装","date":"2017-06-12T09:50:55.000Z","updated":"2018-02-07T13:13:17.000Z","comments":true,"path":"f-stack安装/","link":"","permalink":"http://vcpu.me/f-stack安装/","excerpt":"mac os virtual box跑f-stack环境搭建基本环境kernel版本3.10.0-514.el7.x86_64CentOS-7-x86_64-Minimal-1611.isoMac osxOreacle Virtual Box5.1.22kernel-devel 操作步骤在Oreacle Virtual Box上安装centos 7虚拟机配置开启NAT网卡和桥接网卡 网卡配置芯片类型选择Add two more virtual network adapters with “Intel PRO/1000 MT Server (82545EM)” type in order to provide virtual network hardware to the virtual machine that is supported by Intel DPDK. 上述说明来自于一篇国外文档说明,使用的Inetl 82545EM,但是经过尝试Intel PRO/1000MT桌面(82540EN)也是可以的。所以不要太迷信啦。","text":"mac os virtual box跑f-stack环境搭建基本环境kernel版本3.10.0-514.el7.x86_64CentOS-7-x86_64-Minimal-1611.isoMac osxOreacle Virtual Box5.1.22kernel-devel 操作步骤在Oreacle Virtual Box上安装centos 7虚拟机配置开启NAT网卡和桥接网卡 网卡配置芯片类型选择Add two more virtual network adapters with “Intel PRO/1000 MT Server (82545EM)” type in order to provide virtual network hardware to the virtual machine that is supported by Intel DPDK. 上述说明来自于一篇国外文档说明,使用的Inetl 82545EM,但是经过尝试Intel PRO/1000MT桌面(82540EN)也是可以的。所以不要太迷信啦。 桥接网卡用来ssh登陆管理串口,NAT网卡用来运行DPDK驱动,跑nginx 在Virtual Box上制作地址映射 centos7虚拟机上网口配置信息 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061[root@localhost ~]# cat /etc/sysconfig/network-scripts/ifcfg-enp0s17TYPE="Ethernet"BOOTPROTO="dhcp"DEFROUTE="yes"PEERDNS="yes"PEERROUTES="yes"IPV4_FAILURE_FATAL="no"IPV6INIT="yes"IPV6_AUTOCONF="yes"IPV6_DEFROUTE="yes"IPV6_PEERDNS="yes"IPV6_PEERROUTES="yes"IPV6_FAILURE_FATAL="no"IPV6_ADDR_GEN_MODE="stable-privacy"NAME="enp0s17"UUID="2ea1ed66-7bcd-4153-a495-39c25d5f0ff9"DEVICE="enp0s17"ONBOOT="yes"[root@localhost ~]# cat /etc/sysconfig/network-scripts/ifcfg-enp0s8TYPE=EthernetBOOTPROTO=dhcpDEFROUTE=yesPEERDNS=yesPEERROUTES=yesIPV4_FAILURE_FATAL=noIPV6INIT=yesIPV6_AUTOCONF=yesIPV6_DEFROUTE=yesIPV6_PEERDNS=yesIPV6_PEERROUTES=yesIPV6_FAILURE_FATAL=noIPV6_ADDR_GEN_MODE=stable-privacyNAME=enp0s8UUID=6c930d05-bc17-4316-998e-f01a7233cbd3DEVICE=enp0s8ONBOOT=yes[root@localhost ~]# ifconfigenp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.55.183 netmask 255.255.255.0 broadcast 192.168.55.255 inet6 fe80::f97d:539:4010:eaff prefixlen 64 scopeid 0x20<link> ether 08:00:27:d5:ee:00 txqueuelen 1000 (Ethernet) RX packets 521 bytes 58437 (57.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 155 bytes 23680 (23.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s17: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255 inet6 fe80::a15d:3b87:fec0:f3c1 prefixlen 64 scopeid 0x20<link> ether 08:00:27:28:39:6c txqueuelen 1000 (Ethernet) RX packets 2 bytes 1180 (1.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 10 bytes 1308 (1.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 开启CPU flags(SSE 4.1/SSE 4.2)12VBoxManage setextradata "VM name" VBoxInternal/CPUM/SSE4.1 1VBoxManage setextradata "VM name" VBoxInternal/CPUM/SSE4.2 1 Note: 上述CPU flags默认是不开启的,启动f-stack上ngx会err日志如下 如果不开启sse cpu选项,在启动ngx会报如下问题/usr/local/nginx_fstack/sbin/nginx /data/f-stack/config.ini -c 1 –proc-type=primary –num-procs=1 –proc-id=0ERROR: This system does not support “SSE4_1”.Please check that RTE_MACHINE is set correctly. 做NAT网卡10.0.2.15需要在vbox上做主机地址和端口映射才能访问 安装详细步骤1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253yum install -y git gcc openssl-devel bccd /datayum install kernel-devel-`uname -r` -ymkdir /data/f-stackgit clone https://github.com/F-Stack/f-stack.git /data/f-stack# Compile DPDKcd /data/f-stack/dpdkmake config T=x86_64-native-linuxapp-gccmake# set hugepage echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepagesmkdir /mnt/hugemount -t hugetlbfs nodev /mnt/huge# insmod komodprobe uioinsmod build/kmod/igb_uio.koinsmod build/kmod/rte_kni.ko# set ip addressexport myaddr=`ifconfig enp0s17 | grep "inet" | grep -v ":" | awk -F ' ' '{print $2}'`export mymask=`ifconfig enp0s17 | grep "netmask" | awk -F ' ' '{print $4}'`export mybc=`ifconfig enp0s17 | grep "broadcast" | awk -F ' ' '{print $6}'`export myhw=`ifconfig enp0s17 | grep "ether" | awk -F ' ' '{print $2}'`export mygw=`route -n | grep 0.0.0.0 | grep enp0s17 | grep UG | awk -F ' ' '{print $2}'`sed "s/addr=192.168.1.2/addr=${myaddr}/" -i /data/f-stack/config.inised "s/netmask=255.255.255.0/netmask=${mymask}/" -i /data/f-stack/config.inised "s/broadcast=192.168.1.255/broadcast=${mybc}/" -i /data/f-stack/config.inised "s/gateway=192.168.1.1/gateway=${mygw}/" -i /data/f-stack/config.ini# Compile F-Stack libexport FF_PATH=/data/f-stackexport FF_DPDK=/data/f-stack/dpdk/buildcd /data/f-stack/libmake# Compile Nginxcd ../app/nginx-1.11.10./configure --prefix=/usr/local/nginx_fstack --with-ff_modulemakemake install# offload NIC(if there is only one NIC,the follow commands must run in a script)ifconfig enp0s17 downpython /data/f-stack/dpdk/tools/dpdk-devbind.py --bind=igb_uio enp0s17# start Nginxcd ../.../start.sh -b /usr/local/nginx_fstack/sbin/nginx -c config.ini 测试在vbox主机上访问映射地址和端口192.168.55.165:8080 -> 10.0.2.15:80curl http://192.168.55.165:808012345678910111213141516171819202122232425<!DOCTYPE html><html><head><title>Welcome to nginx!</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>Welcome to nginx!</h1><p>If you see this page, the nginx web server is successfully installed andworking. Further configuration is required.</p><p>For online documentation and support please refer to<a href=\"http://nginx.org/\">nginx.org</a>.<br/>Commercial support is available at<a href=\"http://nginx.com/\">nginx.com</a>.</p><p><em>Thank you for using nginx.</em></p></body></html> 附录f-stack ngx配置文件/data/f-stack/config.ini123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869[dpdk]## Hexadecimal bitmask of cores to run on.lcore_mask=3## Port mask, enable and disable ports.## Default: all ports are enabled.#port_mask=1channel=4## Number of ports.nb_ports=1promiscuous=1numa_on=1## TCP segment offload, default: disabled.tso=0## Port config section## According to dpdk.nb_ports: port0, port1...[port0]addr=10.0.2.15netmask=255.255.255.0broadcast=10.0.2.255gateway=10.0.2.2## Packet capture path, this will hurt performance#pcap=./a.pcap## Kni config: if enabled and method=reject,## all packets that do not belong to the following tcp_port and udp_port## will transmit to kernel; if method=accept, all packets that belong to## the following tcp_port and udp_port will transmit to kernel.#[kni]#enable=1#method=reject#tcp_port=80,443#udp_port=53## FreeBSD network performance tuning configurations.## Most native FreeBSD configurations are supported.[freebsd.boot]hz=100kern.ipc.maxsockets=262144net.inet.tcp.syncache.hashsize=4096net.inet.tcp.syncache.bucketlimit=100net.inet.tcp.tcbhashsize=65536[freebsd.sysctl]kern.ipc.somaxconn=32768kern.ipc.maxsockbuf=16777216net.inet.tcp.fast_finwait2_recycle=1net.inet.tcp.sendspace=16384net.inet.tcp.recvspace=8192net.inet.tcp.nolocaltimewait=1net.inet.tcp.cc.algorithm=htcpnet.inet.tcp.sendbuf_max=16777216net.inet.tcp.recvbuf_max=16777216net.inet.tcp.sendbuf_auto=1net.inet.tcp.recvbuf_auto=1net.inet.tcp.sendbuf_inc=16384net.inet.tcp.recvbuf_inc=524288net.inet.tcp.inflight.enable=0net.inet.tcp.sack=1net.inet.tcp.blackhole=1net.inet.tcp.msl=2000net.inet.tcp.delayed_ack=0net.inet.udp.blackhole=1net.inet.ip.redirect=0 f-stack ngx正常启动信息[root@localhost f-stack]# ./start.sh -b /usr/local/nginx_fstack/sbin/nginx -c config.ini/usr/local/nginx_fstack/sbin/nginx config.ini -c 1 –proc-type=primary –num-procs=1 –proc-id=0[dpdk]: lcore_mask=3[dpdk]: channel=4[dpdk]: nb_ports=1[dpdk]: promiscuous=1[dpdk]: numa_on=1[dpdk]: tso=0[port0]: addr=10.0.2.15[port0]: netmask=255.255.255.0[port0]: broadcast=10.0.2.255[port0]: gateway=10.0.2.2[freebsd.boot]: hz=100[freebsd.boot]: kern.ipc.maxsockets=262144[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100[freebsd.boot]: net.inet.tcp.tcbhashsize=65536[freebsd.sysctl]: kern.ipc.somaxconn=32768[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1[freebsd.sysctl]: net.inet.tcp.sendspace=16384[freebsd.sysctl]: net.inet.tcp.recvspace=8192[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1[freebsd.sysctl]: net.inet.tcp.cc.algorithm=htcp[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288[freebsd.sysctl]: net.inet.tcp.inflight.enable=0[freebsd.sysctl]: net.inet.tcp.sack=1[freebsd.sysctl]: net.inet.tcp.blackhole=1[freebsd.sysctl]: net.inet.tcp.msl=2000[freebsd.sysctl]: net.inet.tcp.delayed_ack=0[freebsd.sysctl]: net.inet.udp.blackhole=1[freebsd.sysctl]: net.inet.ip.redirect=0EAL: Detected 1 lcore(s)EAL: Probing VFIO support…EAL: PCI device 0000:00:08.0 on NUMA socket -1EAL: probe driver: 8086:100f rte_em_pmdEAL: PCI device 0000:00:11.0 on NUMA socket -1EAL: probe driver: 8086:100f rte_em_pmdcreate mbuf pool on socket 0create ring:arp_ring_0_0 success, 2047 ring entries are now free!Port 0 MAC: 08 00 27 28 39 6cTSO is disabledset port 0 to promiscuous mode ok Checking link status………………..donePort 0 Link Up - speed 1000 Mbps - full-duplexlink_elf_lookup_symbol: missing symbol hash tablelink_elf_lookup_symbol: missing symbol hash tablenetisr_init: forcing maxthreads from 1 to 0Timecounters tick every 10.000 msecTimecounter “ff_clock” frequency 100 Hz quality 1f-stack-0: Ethernet address: 08:00:27:28:39:6c f-stack 环境安装完成后重启后应该重新设置的参数下述命令可放入/etc/rd.c/rd.local启动文件中,在机器重启后执行一次下面环境设置12345678echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepagesmount -t hugetlbfs nodev /mnt/hugemodprobe uioinsmod /data/f-stack/dpdk/build/kmod/igb_uio.koinsmod /data/f-stack/dpdk/build/kmod/rte_kni.koifconfig enp0s17 downpython /data/f-stack/dpdk/tools/dpdk-devbind.py --bind=igb_uio enp0s17/data/f-stack/start.sh -b /usr/local/nginx_fstack/sbin/nginx -c /data/f-stack/config.ini 如果kernel-devel yum无法找到内核对应版本可去centos官网查找下载wget https://buildlogs.centos.org/c7.1511.00/kernel/20151119220809/3.10.0-327.el7.x86_64/kernel-devel-3.10.0-327.el7.x86_64.rpmrpm -ivh kernel-devel-3.10.0-327.el7.x86_64.rpm","categories":[{"name":"DPDK","slug":"DPDK","permalink":"http://vcpu.me/categories/DPDK/"}],"tags":[{"name":"DPDK","slug":"DPDK","permalink":"http://vcpu.me/tags/DPDK/"},{"name":"f-stack","slug":"f-stack","permalink":"http://vcpu.me/tags/f-stack/"},{"name":"nginx","slug":"nginx","permalink":"http://vcpu.me/tags/nginx/"}]},{"title":"bind()实现源码分析","slug":"bind","date":"2017-06-12T09:35:01.000Z","updated":"2017-06-12T09:35:01.000Z","comments":true,"path":"bind/","link":"","permalink":"http://vcpu.me/bind/","excerpt":"bind()内核版本:3.10.0-514.16.1.el7.x86_64下述源码分析均以tcp socket为背景 123#include <sys/types.h>#include <sys/socket.h>int bind(int sockfd, struct sockaddr *my_addr, socklen_t addrlen); socket文件描述符 要绑定的承载地址和端口的结构体 struct sockaddr 第二个参数struct sockaddr的长度 该函数负责绑定套接字的地址和端口,按照绑定者身份来分,会存在两种情况 情况1:绑定者为客户端,主动发起请求方,绑定地址和端口成功后,会使用该地址和端口进行发包一般情况下,客户端的地址和端口都是其自动选择的,不需要绑定动作。情况2:绑定者为服务端,被动连接接收方,绑定地址和端口成功后,客户端只能向该地址和端口发送连接请求。服务端往往需要绑定地址和端口。如果服务端存在多网卡情况,其只需要绑定服务端口即可,其目的地址就是客户端访问的目的地址。","text":"bind()内核版本:3.10.0-514.16.1.el7.x86_64下述源码分析均以tcp socket为背景 123#include <sys/types.h>#include <sys/socket.h>int bind(int sockfd, struct sockaddr *my_addr, socklen_t addrlen); socket文件描述符 要绑定的承载地址和端口的结构体 struct sockaddr 第二个参数struct sockaddr的长度 该函数负责绑定套接字的地址和端口,按照绑定者身份来分,会存在两种情况 情况1:绑定者为客户端,主动发起请求方,绑定地址和端口成功后,会使用该地址和端口进行发包一般情况下,客户端的地址和端口都是其自动选择的,不需要绑定动作。情况2:绑定者为服务端,被动连接接收方,绑定地址和端口成功后,客户端只能向该地址和端口发送连接请求。服务端往往需要绑定地址和端口。如果服务端存在多网卡情况,其只需要绑定服务端口即可,其目的地址就是客户端访问的目的地址。 sys_bind12345678910111213141516171819202122SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen){ struct socket *sock; struct sockaddr_storage address; int err, fput_needed; sock = sockfd_lookup_light(fd, &err, &fput_needed); if (sock) { err = move_addr_to_kernel(umyaddr, addrlen, &address); if (err >= 0) { err = security_socket_bind(sock, (struct sockaddr *)&address, addrlen); if (!err) err = sock->ops->bind(sock, (struct sockaddr *) &address, addrlen);//inet_bind } fput_light(sock->file, fput_needed); } return err;} sockfd_lookup_light 和move_addr_to_kernel分别为根据fd从当前进程取出socket和把参数从用户空间考入地址空间 bind系统调用最重要函数为sock->ops->bind 在TCP协议情况下inet_stream_ops中bind成员函数为inet_bind 后续为对此函数的分析 inet_bind实现较为复杂,现在版本和原始版本相比,支持端口复用了123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len){ struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; struct sock *sk = sock->sk; struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; int err; /* If the socket has its own bind function then use it. (RAW) */ /*raw socket才会用到,tcp_proc无此函数*/ if (sk->sk_prot->bind) { err = sk->sk_prot->bind(sk, uaddr, addr_len); goto out; } err = -EINVAL; /*地址长度检验*/ if (addr_len < sizeof(struct sockaddr_in)) goto out; /*bind地址中协议检查,必须是下面两种情况 * 1.绑定的地址协议为AF_INET * 2.绑定协议为0(AF_UNSPEC)同时地址也为0 * 否则直接退出inet_bind ,返回地址不支持错误码 */ if (addr->sin_family != AF_INET) { /* Compatibility games : accept AF_UNSPEC (mapped to AF_INET) * only if s_addr is INADDR_ANY. */ err = -EAFNOSUPPORT; if (addr->sin_family != AF_UNSPEC || addr->sin_addr.s_addr != htonl(INADDR_ANY)) goto out; } /*获取根据IP地址得出地址类型 RTN_LOCAL 本机地址 RTN_MULTICAST 多播 RTN_BROADCAST 广播 RTN_UNICAST */ chk_addr_ret = inet_addr_type(net, addr->sin_addr.s_addr); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since * allowing applications to make a non-local bind solves * several problems with systems using dynamic addressing. * (ie. your servers still start up even if your ISDN link * is temporarily down) */ err = -EADDRNOTAVAIL; /* 地址类型必须是本机,多播,组播中的一个,否则直接返回,报地址参数异常 * */ if (!net->ipv4_sysctl_ip_nonlocal_bind && !(inet->freebind || inet->transparent) && addr->sin_addr.s_addr != htonl(INADDR_ANY) && chk_addr_ret != RTN_LOCAL && chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST) goto out; snum = ntohs(addr->sin_port); err = -EACCES; /* * 要绑定的端口小于1024时候,要求运行该应用程序的为超级权限 * 否则返回并报权限不运行的错误 */ if (snum && snum < PROT_SOCK && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) goto out; /* We keep a pair of addresses. rcv_saddr is the one * used by hash lookups, and saddr is used for transmit. * * In the BSD API these are the same except where it * would be illegal to use them (multicast/broadcast) in * which case the sending device address is used. */ lock_sock(sk); /* Check these errors (active socket, double bind). */ err = -EINVAL; /*bind动作发生在最初状态,其TCP状态是CLOSE且没有绑定过 * 否则直接判别为异常 */ if (sk->sk_state != TCP_CLOSE || inet->inet_num) goto out_release_sock; /*inet_rcv_saddr 用作hash表查找使用 *inet_saddr作为发包源地址 *当为广播和组播时候发送地址为0 */ inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr; if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST) inet->inet_saddr = 0; /* Use device */ /* Make sure we are allowed to bind here. */ /* TCP时候该函数负责查询该端口是否被使用,没有被使用返回0,否则返回非0 *如果已经被使用,则退出bind函数,并返回地址和端口已经被使用错误-EADDRINUSE *sk->sk_prot->get_port= inet_csk_get_port */ if (sk->sk_prot->get_port(sk, snum)) { inet->inet_saddr = inet->inet_rcv_saddr = 0; err = -EADDRINUSE; goto out_release_sock; } /* * 更新sk->sk_userlocks标记,表明本地地址和端口已经绑定 */ if (inet->inet_rcv_saddr) sk->sk_userlocks |= SOCK_BINDADDR_LOCK; if (snum) sk->sk_userlocks |= SOCK_BINDPORT_LOCK; inet->inet_sport = htons(inet->inet_num); inet->inet_daddr = 0; inet->inet_dport = 0; sk_dst_reset(sk); err = 0;out_release_sock: release_sock(sk);out: return err;}EXPORT_SYMBOL(inet_bind); 绑定地址长度和协议检查 长度异常返回-EINVAL 表示参数异常,协议不支持 -EAFNOSUPPORT 对绑定地址进行类型检查inet_addr_type,必须是本机地址,组播和广播地址类型 -EADDRNOTAVAIL 否则报地址参数异常 如果端口小于1024 ,必须为超级权限ns_capable 否则 err = -EACCES 权限不允许 sk->sk_prot->get_port = inet_csk_get_port 四层端口检查,看是否被使用 更新sk->skuserlocks标记,代表地址和端口已经被绑定 扩展函数: inet_csk_get_port TCP四层端口检查 inet_addr_type 地址类型判别 ns_capable 超级权限检查 inet_csk_get_port123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202int inet_csk_get_port(struct sock *sk, unsigned short snum){ struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; struct inet_bind_hashbucket *head; struct inet_bind_bucket *tb; int ret, attempts = 5; struct net *net = sock_net(sk); int smallest_size = -1, smallest_rover; kuid_t uid = sock_i_uid(sk); int attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; /*禁止上下半部,防止进程冲突*/ local_bh_disable(); /* * 如果没有bind端口 */ if (!snum) {/*没有指定端口会自动选择端口*/ int remaining, rover, low, high;again: /*获取端口的取值范围*/ inet_get_local_port_range(net, &low, &high);/*后文辉对其进行分析*/ if (attempt_half) { int half = low + ((high - low) >> 1); if (attempt_half == 1) high = half; else low = half; } /*取值范围内端口数*/ remaining = (high - low) + 1; /*随机选择端口*/ smallest_rover = rover = net_random() % remaining + low; smallest_size = -1; do { /*保留端口检查,服务端可以设置 /proc/sys/net/ipv4/ip_local_reserved_ports */ if (inet_is_reserved_local_port(rover)) goto next_nolock;/*端口加1继续*/ /*根据端口号和HASHsize从确定hash桶,并锁住它,后续便利查找*/ head = &hashinfo->bhash[inet_bhashfn(net, rover, hashinfo->bhash_size)]; spin_lock(&head->lock); inet_bind_bucket_for_each(tb, &head->chain) if (net_eq(ib_net(tb), net) && tb->port == rover) { /*判断端口是否可以复用,如果可以复用即使在链表中也一样复用*/ if (((tb->fastreuse > 0 && sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && (tb->num_owners < smallest_size || smallest_size == -1)) { /*记录下端口的使用个数和端口*/ smallest_size = tb->num_owners; smallest_rover = rover; /*系统绑定端口已经超过最大端口数了,要去检查inet_csk_bind_conflict绑定是否存在冲突*/ if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 && !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) { /*ok,绑定没有冲突*/ snum = smallest_rover; goto tb_found; } } /*端口冲突检查*/ if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) { snum = rover; goto tb_found; } /*此端口在链表中但是不能复用,继续下一个*/ goto next; } break;/*不在bind表中,端口可以使用,直接跳出循环*/ next: spin_unlock(&head->lock); next_nolock: /*已经找到最大端口了,从最小开始找*/ if (++rover > high) rover = low; } while (--remaining > 0);/*en,最大5次查找机会*/ /* Exhausted local port range during search? It is not * possible for us to be holding one of the bind hash * locks if this test triggers, because if 'remaining' * drops to zero, we broke out of the do/while loop at * the top level, not from the 'break;' statement. */ ret = 1; /*没有找到端口,那就最后一次机会*/ if (remaining <= 0) { if (smallest_size != -1) { snum = smallest_rover; goto have_snum; } if (attempt_half == 1) { /* OK we now try the upper half of the range */ attempt_half = 2; goto again; } goto fail; } /* OK, here is the one we will use. HEAD is * non-NULL and we hold it's mutex. */ /*找到可用的端口了*/ snum = rover; } else { /*指定绑定了端口,在绑定的链表中查找,如果查找到,代表已经被绑定*/have_snum: head = &hashinfo->bhash[inet_bhashfn(net, snum, hashinfo->bhash_size)]; spin_lock(&head->lock); inet_bind_bucket_for_each(tb, &head->chain) if (net_eq(ib_net(tb), net) && tb->port == snum) goto tb_found;/*端口已经被绑定*/ } /*在绑定链表中没有发现,后续会创建*/ tb = NULL; goto tb_not_found; tb_found: if (!hlist_empty(&tb->owners)) { /*要bind的sk标记SK_FORCE_REUSE可以强制复用*/ if (sk->sk_reuse == SK_FORCE_REUSE) goto success; if (((tb->fastreuse > 0 && sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size == -1) { /* 是否可以复用的判别 * fastreuseport Google添加选项&& 已经开启端口复用 && 当前socket uid和查找到的uid相符合 * 当前socket也可以放到bind hash中,后续会将其加入 */ goto success; } else { ret = 1; /*端口绑定冲突,自动分配端口绑定冲突会走到此处,在自动分配端口时候进行了下列类似判别 *所以此判断基本不会执行知道跳到tb_not_found这个时候tb不为null的 */ if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) { if (((sk->sk_reuse && sk->sk_state != TCP_LISTEN) || (tb->fastreuseport > 0 && sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && smallest_size != -1 && --attempts >= 0) { spin_unlock(&head->lock); goto again; } goto fail_unlock; } } }tb_not_found: ret = 1; /*绑定时没有发现过tb,直接创建一个*/ if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep, net, head, snum)) == NULL) goto fail_unlock; if (hlist_empty(&tb->owners)) {/*没有绑定过socket*/ if (sk->sk_reuse && sk->sk_state != TCP_LISTEN) tb->fastreuse = 1; else tb->fastreuse = 0; /*设置了SO_REUSEPORT选项*/ if (sk->sk_reuseport) { tb->fastreuseport = 1; tb->fastuid = uid; } else tb->fastreuseport = 0; } else {/*如果绑定过socket*/ if (tb->fastreuse && (!sk->sk_reuse || sk->sk_state == TCP_LISTEN)) tb->fastreuse = 0; if (tb->fastreuseport && (!sk->sk_reuseport || !uid_eq(tb->fastuid, uid))) tb->fastreuseport = 0; }success:/*找到可用端口,添加绑定表*/ if (!inet_csk(sk)->icsk_bind_hash) inet_bind_hash(sk, tb, snum);/*sk被放到tb->owners中*/ WARN_ON(inet_csk(sk)->icsk_bind_hash != tb); ret = 0;fail_unlock: spin_unlock(&head->lock);fail: local_bh_enable(); return ret;} 如果端口为0;则自动选取端口选择过程如下: 先在[low,half] or [half,high]中随机选取一个端口,作为循环获取端口的起始端口,开始以下流程 步骤1: 保留端口检查,不满足,端口加1,重试次数减1,继续从步骤1开始 步骤2: 从当前端口映射的hash桶中取出列表头,遍历检查该端口是否被使用 步骤2-1:没有被使用,直接退出循环,tb为NULL,创建tb,跳转到tb_not_found将该端口连同创建的tb加入该hash桶的链表中,sk也被放到tb->owners中管理,结束退出 步骤2-2: 端口被使用了,检查端口使用是否冲突 步骤2-2-1:没有冲突,推出循环,跳转到tb_found,复用检查成功,sk被放到tb->owners中,结束退出 步骤2-2-2:存在冲突,直接端口+1,继续循环查找 步骤3:如果上半部分已经查找完毕,继续[half,high]中选择一个端口,进行步骤1 attempt_halfsk->sk_reuse == SK_CAN_REUSE 取端口范围 [low ,half]否则 取端口范围 [half,high] 该值会影响上述选择端口的流程从上半端还是从下半端选择端口 如果sk->sk_reuse被置SK_CAN_REUSE标记则先从下半端开始选择端口 否则直接从上半端选择端口 small_size和small_rover what’s the fuck!!! 疑惑了好久small_size和small_rover在3.10的版本中根本就没有使用基本用不到3.10版本的端口查找原则是确定端口查找区间,随机选择端口,只要该端口能复用就直接使用,已经完全去除了优先选择复用端口数较小的端口这一原则了(3.2kernel)So amazing!这两个变量可以去除了 inet_get_local_port_range1234567891011void inet_get_local_port_range(struct net *net, int *low, int *high){ unsigned int seq; do { seq = read_seqbegin(&net->ipv4_sysctl_local_ports.lock); *low = net->ipv4_sysctl_local_ports.range[0]; *high = net->ipv4_sysctl_local_ports.range[1]; } while (read_seqretry(&net->ipv4_sysctl_local_ports.lock, seq));} 12sysctl -a|grep ip_local_port_rangenet.ipv4.ip_local_port_range = 32768 60999 上述读取端口范围是用户态的ip_local_port_range,默认是3w多以后的,可以调整此参数扩大端口范围 上述read_seqbegin这种方式读取数据,是一种顺序锁,适用于读多写少的方式用方式,后续专门处博文研究 tcp端口冲突检查inet_csk(sk)->icsk_af_ops->bind_conflict1234567891011121314151617181920212223242526272829303132333435const struct inet_connection_sock_af_ops ipv4_specific = { .queue_xmit = ip_queue_xmit, .send_check = tcp_v4_send_check, .rebuild_header = inet_sk_rebuild_header, .sk_rx_dst_set = inet_sk_rx_dst_set, .conn_request = tcp_v4_conn_request, .syn_recv_sock = tcp_v4_syn_recv_sock, .net_header_len = sizeof(struct iphdr), .setsockopt = ip_setsockopt, .getsockopt = ip_getsockopt, .addr2sockaddr = inet_csk_addr2sockaddr, .sockaddr_len = sizeof(struct sockaddr_in), .bind_conflict = inet_csk_bind_conflict,#ifdef CONFIG_COMPAT .compat_setsockopt = compat_ip_setsockopt, .compat_getsockopt = compat_ip_getsockopt,#endif .mtu_reduced = tcp_v4_mtu_reduced,};static int tcp_v4_init_sock(struct sock *sk){ struct inet_connection_sock *icsk = inet_csk(sk); tcp_init_sock(sk); icsk->icsk_af_ops = &ipv4_specific;#ifdef CONFIG_TCP_MD5SIG tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;#endif return 0;} 从上文得知inet_csk(sk)->icsk_af_ops->bind_conflict 函数是inet_csk_bind_conflict af_ops在tcp_v4_init_sock初始化 inet_csk_bind_conflict分析123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960int inet_csk_bind_conflict(const struct sock *sk, const struct inet_bind_bucket *tb, bool relax){ struct sock *sk2; int reuse = sk->sk_reuse; int reuseport = sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); /* * Unlike other sk lookup places we do not check * for sk_net here, since _all_ the socks listed * in tb->owners list belong to the same net - the * one this bucket belongs to. */ sk_for_each_bound(sk2, &tb->owners) { /*不会冲突情况1:socket绑定设备不同*/ if (sk != sk2 && !inet_v6_ipv6only(sk2) && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if || sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) { /* *不会冲突情况2:地址不同 */ if ((!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) && (!reuseport || !sk2->sk_reuseport || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2))))) { /* * 不会冲突情况3: * 条件A: (reuse && sk2->sk_reuse && sk2->sk_state != TCP_LISTEN) * 条件B:(reuseport * && sk2->sk_reuseport * &&(sk2->sk_state == TCP_TIME_WAIT || uid_eq(uid, sock_i_uid(sk2)))) * 条件A和条件B只要有一个成立,就不会冲突 * 条件A成立条件: * 链上sock和待检查sock开启地址复用且链上状态不是监听状态 * 条件B成立条件: * 链上sock和待检查sock开启端口复用且链表上状态为TW * 链上sock和待检查sock开启端口复用且两个sock的uid相同 */ if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr || sk2->sk_rcv_saddr == sk->sk_rcv_saddr) break; } /*没有开启relax,要绑定方不能复用,已绑定方不能复用,以绑定方处理监听状态*/ if (!relax && reuse && sk2->sk_reuse && sk2->sk_state != TCP_LISTEN) { if (!sk2->sk_rcv_saddr || !sk->sk_rcv_saddr || sk2->sk_rcv_saddr == sk->sk_rcv_saddr) break; } } } return sk2 != NULL;} 在端口自动选择时可以重用端口条件为: a设备不同b绑定ip地址不同c要绑定sock和已绑定sock地址允许重用,且已绑定socket不处于监听状态d 链上sock和待检查sock开启端口复用且链表上状态为TWe 链上sock和待检查sock开启端口复用且两个sock的uid相同 关于条件c的补充条件:即使c满足,也需要看relax的值确定,relax为TRUE时可复用,为fase时候不能复用 自动端口时候relax为false,所以条件c消失,仅仅剩下a、b、d、e四个条件","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"bind","slug":"bind","permalink":"http://vcpu.me/tags/bind/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"}]},{"title":"connect()实现源码分析","slug":"connect","date":"2017-06-09T09:30:34.000Z","updated":"2017-06-26T03:33:43.000Z","comments":true,"path":"connect/","link":"","permalink":"http://vcpu.me/connect/","excerpt":"connect()内核版本:3.10.0-514.16.1.el7.x86_64下述源码分析均以tcp socket为背景 用户态函数int connect(int sockfd, const struct sockaddr *addr,socklen_t addrlen);参数: socketfd socket文件描述索引下标addr 要连接的服务端的地址addrlen addr的长度 返回值: -1 失败 strerror(errno)可帮助获取失败原因 常见失败原因有: ETIMEOUT Connection timed out服务端一直未回复syn ack,尝试多次syn后返回 ECONNREFUSED Connection refused 服务端端口没有开启,回复rst EHOSTUNREACH No route to host 服务端在同局域网内arp请求获取办不到对方mac 0 成功 用法:123456struct sockaddr_in remote_addr;memset(&remote_addr,0,sizeof(remote_addr));remote_addr.sin_family=AF_INET;remote_addr.sin_addr.s_addr=inet_addr(\"180.97.33.108\");remote_addr.sin_port = htons(80);connect(fd,(struct sockaddr*)&remote_addr,sizeof(struct sockaddr)","text":"connect()内核版本:3.10.0-514.16.1.el7.x86_64下述源码分析均以tcp socket为背景 用户态函数int connect(int sockfd, const struct sockaddr *addr,socklen_t addrlen);参数: socketfd socket文件描述索引下标addr 要连接的服务端的地址addrlen addr的长度 返回值: -1 失败 strerror(errno)可帮助获取失败原因 常见失败原因有: ETIMEOUT Connection timed out服务端一直未回复syn ack,尝试多次syn后返回 ECONNREFUSED Connection refused 服务端端口没有开启,回复rst EHOSTUNREACH No route to host 服务端在同局域网内arp请求获取办不到对方mac 0 成功 用法:123456struct sockaddr_in remote_addr;memset(&remote_addr,0,sizeof(remote_addr));remote_addr.sin_family=AF_INET;remote_addr.sin_addr.s_addr=inet_addr(\"180.97.33.108\");remote_addr.sin_port = htons(80);connect(fd,(struct sockaddr*)&remote_addr,sizeof(struct sockaddr) 实例:12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849#include <stdio.h>#include <sys/socket.h>#include <unistd.h>#include <sys/types.h>#include <netinet/in.h>#include <stdlib.h>#include <string.h>#include <errno.h>#define SERVER_PORT 20000void usage(char *name){ printf(\"usage: %s IP\\n\", name);}int main(int argc, char **argv){ int server_fd, client_fd, length = 0; struct sockaddr_in server_addr, client_addr; socklen_t socklen = sizeof(server_addr); if(argc < 2) { usage(argv[0]); exit(1); } if((client_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) { printf(\"create socket error, exit!\\n\"); exit(1); } srand(time(NULL)); bzero(&client_addr, sizeof(client_addr)); client_addr.sin_family = AF_INET; client_addr.sin_addr.s_addr = htons(INADDR_ANY); bzero(&server_addr, sizeof(server_addr)); server_addr.sin_family = AF_INET; inet_aton(argv[1], &server_addr.sin_addr); server_addr.sin_port = htons(SERVER_PORT); if(connect(client_fd, (struct sockaddr*)&server_addr, socklen) < 0) { printf(\"can not connect to %s, exit!\\n\", argv[1]); printf(\"%s\\n\", strerror(errno)); exit(1); } return 0;} 运行方法:[root@localhost socketdemo]# gcc connect.c -o connect[root@localhost socketdemo]# ./connect 192.168.55.181 系统调用12345678910111213141516171819202122232425262728293031323334353637383940414243SYSCALL_DEFINE2(socketcall, int, call, unsigned long __user *, args){ unsigned long a[AUDITSC_ARGS]; unsigned long a0, a1; int err; unsigned int len; if (call < 1 || call > SYS_SENDMMSG) return -EINVAL; len = nargs[call]; if (len > sizeof(a)) return -EINVAL; /* copy_from_user should be SMP safe. */ if (copy_from_user(a, args, len)) return -EFAULT; err = audit_socketcall(nargs[call] / sizeof(unsigned long), a); if (err) return err; a0 = a[0]; a1 = a[1]; switch (call) { case SYS_SOCKET: err = sys_socket(a0, a1, a[2]); break; case SYS_BIND: err = sys_bind(a0, (struct sockaddr __user *)a1, a[2]); break; case SYS_CONNECT: err = sys_connect(a0, (struct sockaddr __user *)a1, a[2]); break; ... default: err = -EINVAL; break; } return err;} 系统调用sys_socketcall会携带(fd,serveraddr,serveraddrlen)参数 系统中断处理函数sys_socketcall会将参数从用户态考入到内核态局部变量a中 调用sys_connect函数 sys_connect(a0, (struct sockaddr __user *)a1, a[2]); sys_connect执行入口分析123456789101112131415161718192021222324SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,int,addrlen){ struct socket *sock; struct sockaddr_storage address; int err, fput_needed; sock = sockfd_lookup_light(fd, &err, &fput_needed); if (!sock) goto out; err = move_addr_to_kernel(uservaddr, addrlen, &address); if (err < 0) goto out_put; err = security_socket_connect(sock, (struct sockaddr *)&address, addrlen); if (err) goto out_put; err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen, sock->file->f_flags);out_put: fput_light(sock->file, fput_needed);out: return err; 根据fd描述符号从当前进程current的files指针中的struct fd_table中的fd成员取出file fdt->fd是一个数组用来管理当前进程的file指针 从file中privatedata中获取到socket变量 把connect连接的服务端地址存入内核空间中move_addr_to_kernel sock->ops->connect 以tco为例,此处会调用inet_stream_connect 函数集合中的inet_stream_connect inet_stream_connect分析12345678910int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags){ int err; lock_sock(sock->sk); err = __inet_stream_connect(sock, uaddr, addr_len, flags); release_sock(sock->sk); return err;} inet_stream_connect() 为tcp socket时候connect动作调用的函数改函数会调用__inet_stream_connect函数 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags){ struct sock *sk = sock->sk; int err; long timeo; //socket地址长度检查,不合法返回 if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; // 地址协议族检查,如果不合法则关闭连接 if (uaddr->sa_family == AF_UNSPEC) { err = sk->sk_prot->disconnect(sk, flags); sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED; goto out; } switch (sock->state) { //非法参数 default: err = -EINVAL; goto out; //该socket和对端连接已经建立 case SS_CONNECTED: err = -EISCONN; goto out; //该socket和对端连接建立中 case SS_CONNECTING: err = -EALREADY; /* Fall out of switch with err, set for this state */ break; //该socket和对未连接 case SS_UNCONNECTED: err = -EISCONN; //如果未连接,但是socket还不是TCP_CLOSE状态错误返回 if (sk->sk_state != TCP_CLOSE) goto out; //tcp调用tcp_v4_connect,发送syn err = sk->sk_prot->connect(sk, uaddr, addr_len); if (err < 0) goto out; //发送syn后sock状态从未连接更新为连接中 sock->state = SS_CONNECTING; /* Just entered SS_CONNECTING state; the only * difference is that return value in non-blocking * case is EINPROGRESS, rather than EALREADY. */ err = -EINPROGRESS; break; } //默认情况下未设置非阻塞socket标志,timeo不为0,设置非阻塞,该值为0 timeo = sock_sndtimeo(sk, flags & O_NONBLOCK); //发送syn后等待后续握手完成 /* * 阻塞socket * inet_wait_for_connect 会等待协议栈层的处理 * 1.等待超过timeo,connect返回EINPROGRESS 表明正在处理 * 2.收到信号 * 3.正常完成握手,返回0 * 非阻塞socket * 直接退出connect函数并返回EINPROGRESS,表明协议栈正在处理 */ if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { int writebias = (sk->sk_protocol == IPPROTO_TCP) && tcp_sk(sk)->fastopen_req && tcp_sk(sk)->fastopen_req->data ? 1 : 0; /* Error code is set above */ if (!timeo || !inet_wait_for_connect(sk, timeo, writebias)) goto out; err = sock_intr_errno(timeo); if (signal_pending(current)) goto out; } /* Connection was closed by RST, timeout, ICMP error * or another process disconnected us. */ if (sk->sk_state == TCP_CLOSE) goto sock_error; /* sk->sk_err may be not zero now, if RECVERR was ordered by user * and error was received after socket entered established state. * Hence, it is handled normally after connect() return successfully. */ //TCP握手完成,连接已经建立 sock->state = SS_CONNECTED; err = 0;out: return err;//异常处理,关闭连接sock_error: err = sock_error(sk) ? : -ECONNABORTED; sock->state = SS_UNCONNECTED; if (sk->sk_prot->disconnect(sk, flags)) sock->state = SS_DISCONNECTING; goto out;} __inet_stream_connect检查地址长度和协议族 检查sock状态,正常情况下状态为SS_UNCONNECTED sk->sk_prot->connect tcp_v4_connect来发送syn 在syn包发完以后会有两种处理情况 情况1:立即返回,针对于非阻塞socket,此时协议栈正在处理握手connect会返回-EINPROGRESS情况2:阻塞运行 阻塞时间超时后,connect返回-EINPROGRESS收到信号,connect返回-ERESTARTSYS,-EINTR inet_wait_for_connect函数分析1234567891011121314151617181920212223242526272829303132333435363738static long inet_wait_for_connect(struct sock *sk, long timeo, int writebias){ //初始化等待队列链表,设置队列唤醒后回调函数autoremove_wake_function DEFINE_WAIT(wait); prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); sk->sk_write_pending += writebias; /* Basic assumption: if someone sets sk->sk_err, he _must_ * change state of the socket from TCP_SYN_*. * Connect() does not allow to get error notifications * without closing the socket. */ while ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { release_sock(sk);/*等下要睡眠了释放sk锁*/ timeo = schedule_timeout(timeo); /* * 调用schedule_timeout sleep until timeout * 收到信号后,timeout值返回剩余等待时间 * 超时timeout后,返回0 */ /*进程被唤醒后新上sk锁*/ lock_sock(sk); /*进程有带处理信号,或者睡眠超时,推出循环*/ if (signal_pending(current) || !timeo) break; prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); } /*等待结束后,将进程从等待队列删除,标记为TASK_RUNNING*/ finish_wait(sk_sleep(sk), &wait); sk->sk_write_pending -= writebias; return timeo;} DEFINE_WAIT函数很重要其设置了唤醒时候删除队列成员调用的回调函数autoremove_wake_funtion 睡眠前进程被设置成TASK_INTERRUPTIBLE状态 SO_SNDTIMEO选项对上述的睡眠非常重要 SO_SNDTIMEO被设置,则睡眠时间会安装设置值 SO_SNDTIMEO没有被设置,则在没有收到信号前一只阻塞 睡眠结束,进程从睡眠队列中删除,并标记为TASK_RUNNING prepare_to_wait实现分析1234567891011void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state){ unsigned long flags; wait->flags &= ~WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) __add_wait_queue(q, wait); set_current_state(state); spin_unlock_irqrestore(&q->lock, flags);} prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); 把wait放入q队列中,设置当前进程状态为TASK_INTERRUPTIBLE TASK_INTERRUPTIBLE 是一种睡眠信号 标记TASK_INTERRUPTIBLE的信号会被唤醒并处理信号 阻塞socket唤醒机制[root@localhost stp]# stap bt.stp sock_def_wakeup WARNING: Missing unwind data for a module, rerun with ‘stap -d e1000’—————-START————————-In process [swapper/2]RIP: ffffffff81558150RSP: ffff88003fd03970 EFLAGS: 00000246RAX: 0000000000004308 RBX: ffff88003a82a6c0 RCX: 0000000000000000RDX: 0000000050000000 RSI: 0000000000ca00c8 RDI: ffff88003a82a6c0RBP: ffff88003fd03988 R08: ffff88003db89708 R09: ffff88003e001800R10: ffffffff815dabca R11: 0000000000000000 R12: ffff88001bfa3700R13: ffff880002db6762 R14: 0000000000000218 R15: ffff880002db675aFS: 0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003bCR2: 00007ffaf3049072 CR3: 000000003b0b7000 CR4: 00000000000406e0 0xffffffff81558150 : sock_def_wakeup+0x0/0x40 [kernel] 0xffffffff815cbc09 : tcp_finish_connect+0xc9/0x120 [kernel] 0xffffffff815cc297 : tcp_rcv_state_process+0x637/0xf20 [kernel] 0xffffffff815d5ffb : tcp_v4_do_rcv+0x17b/0x340 [kernel] 0xffffffff815d76d9 : tcp_v4_rcv+0x799/0x9a0 [kernel] 0xffffffff815b1094 : ip_local_deliver_finish+0xb4/0x1f0 [kernel] 0xffffffff815b1379 : ip_local_deliver+0x59/0xd0 [kernel] 0xffffffff815b0d1a : ip_rcv_finish+0x8a/0x350 [kernel] 0xffffffff815b16a6 : ip_rcv+0x2b6/0x410 [kernel] 0xffffffff815700d2 : netif_receive_skb_core+0x582/0x800 [kernel] 0xffffffff81570368 : netif_receive_skb+0x18/0x60 [kernel] 0xffffffff815703f0 : netif_receive_skb_internal+0x40/0xc0 [kernel] 0xffffffff81571578 : napi_gro_receive+0xd8/0x130 [kernel] 0xffffffffa00472fc [e1000]—————-END————————- 12345678910111213141516171819202122232425262728293031323334353637383940void tcp_finish_connect(struct sock *sk, struct sk_buff *skb){ struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); tcp_set_state(sk, TCP_ESTABLISHED); if (skb != NULL) { icsk->icsk_af_ops->sk_rx_dst_set(sk, skb); security_inet_conn_established(sk, skb); } /* Make sure socket is routed, for correct metrics. */ icsk->icsk_af_ops->rebuild_header(sk); tcp_init_metrics(sk); tcp_init_congestion_control(sk); /* Prevent spurious tcp_cwnd_restart() on first data * packet. */ tp->lsndtime = tcp_time_stamp; tcp_init_buffer_space(sk); if (sock_flag(sk, SOCK_KEEPOPEN)) inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp)); if (!tp->rx_opt.snd_wscale) __tcp_fast_path_on(tp, tp->snd_wnd); else tp->pred_flags = 0; if (!sock_flag(sk, SOCK_DEAD)) { /*握手完成唤醒所有进程*/ sk->sk_state_change(sk); sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT); }} sock_def_wakeup ->wake_up_interruptible_all 上述过程发声在三次握手完成后,TCP从syn send或者syn rcv切换到establish状态时候发生 tcp_finish_connect->sk->sk_state_change[sock_def_wakeup] 此次唤醒是全部唤醒sk上等待队列的进程","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"socket","slug":"socket","permalink":"http://vcpu.me/tags/socket/"}]},{"title":"socket()实现源码分析","slug":"socket","date":"2017-06-09T09:22:34.000Z","updated":"2017-06-22T10:13:08.000Z","comments":true,"path":"socket/","link":"","permalink":"http://vcpu.me/socket/","excerpt":"socket()内核版本:3.10.0-514.16.1.el7.x86_64 1234#include <sys/types.h> /* See NOTES */#include <sys/socket.h>int socket(int domain, int type, int protocol);fd=socket(PF_INET,SOCK_STREAM,0","text":"socket()内核版本:3.10.0-514.16.1.el7.x86_64 1234#include <sys/types.h> /* See NOTES */#include <sys/socket.h>int socket(int domain, int type, int protocol);fd=socket(PF_INET,SOCK_STREAM,0 (1).接口说明:按照顺序可传入如下参数: PF_INEAT SOCK_STREAM,SOCK_DGRAM,SOCK_RAW IPPROTO_TCP,IPPROTO_UDP,IPPROTO_IP 返回值说明 EAFNOSUPPORT 不支持地址类型 EMFILE 进程文件表溢出 ENFILE 核心内存不足无法建立新的socket EINVAL 参数domain/type/protocol不合法 EACCES 权限不允许 ENOBUFS/ENOMEM 内存不足 EPROTONOSUPPORT domain指定的类型不支持参数type或者protocol (2).内核调用栈 (3).结构体说明 struct socket 面向用户态的结构体基于虚拟文件系统创建创建socket时最先创建的结构体 struct sock 网络层socket struct inet_sock INET域socket表示提供INET域的一些属性,TTL、 组播、 地址 、端口 struct raw_socket、struct udp—sock、 struct inet_connection_sock 是对struct inet_sock的扩展struct raw_socket要处理ICMPstruct udp_sock udp协议socketstruct inet_connection_sock面向连接socketstruct tcp_sock TCP协议socket ,对inet_connection_sock扩展,增加了滑动窗口等拥塞控制属性struct inet_timewait_sock网络层超时控制使用struct tcp_timewait_sock TCP协议超时控制使用 (4).struct socket创建源码分析(4.1).sock_alloc函数123456789101112131415161718192021static struct socket *sock_alloc(void){ struct inode *inode; struct socket *sock; inode = new_inode_pseudo(sock_mnt->mnt_sb); if (!inode) return NULL; sock = SOCKET_I(inode); kmemcheck_annotate_bitfield(sock, type); inode->i_ino = get_next_ino(); inode->i_mode = S_IFSOCK | S_IRWXUGO; inode->i_uid = current_fsuid(); inode->i_gid = current_fsgid(); inode->i_op = &sockfs_inode_ops; this_cpu_add(sockets_in_use, 1); return sock;} 一起申请两块内存struct socket和struct inode 两块内存用struct socket_alloc联系起来 inode是linux用来刻画一个存放在内存中的文件的 socket是一种网络文件类型,可以通过文件描述符使用read和write等文件操作函数操作socket 有了inode就支持了虚拟文件系统的操作 (4.2).sock_alloc->new_inode_pseudo->alloc_inode12345678910111213141516171819202122232425262728293031323334struct inode *new_inode_pseudo(struct super_block *sb){ struct inode *inode = alloc_inode(sb); if (inode) { spin_lock(&inode->i_lock); inode->i_state = 0; spin_unlock(&inode->i_lock); INIT_LIST_HEAD(&inode->i_sb_list); } return inode;}static struct inode *alloc_inode(struct super_block *sb){ struct inode *inode; if (sb->s_op->alloc_inode) inode = sb->s_op->alloc_inode(sb); else inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL); if (!inode) return NULL; if (unlikely(inode_init_always(sb, inode))) { if (inode->i_sb->s_op->destroy_inode) inode->i_sb->s_op->destroy_inode(inode); else kmem_cache_free(inode_cachep, inode); return NULL; } return inode;} alloc_inode获取内存有两种方式 1.通过自己alloc_inode分配 2.从高速缓存中分配 (4.3).alloc_inode -> sock_alloc_inode12345678910111213141516171819202122232425static struct inode *sock_alloc_inode(struct super_block *sb){ struct socket_alloc *ei; struct socket_wq *wq; ei = kmem_cache_alloc(sock_inode_cachep, GFP_KERNEL); if (!ei) return NULL; wq = kmalloc(sizeof(*wq), GFP_KERNEL); if (!wq) { kmem_cache_free(sock_inode_cachep, ei); return NULL; } init_waitqueue_head(&wq->wait); wq->fasync_list = NULL; RCU_INIT_POINTER(ei->socket.wq, wq); ei->socket.state = SS_UNCONNECTED; ei->socket.flags = 0; ei->socket.ops = NULL; ei->socket.sk = NULL; ei->socket.file = NULL; return &ei->vfs_inode;} socket结构体最终会调用上述函数申请内存 该函数会在sock_init中被注册和挂载到系统上 (4.4).sock_init 中sock_allok_inode挂载过程123456789101112131415161718192021222324err = register_filesystem(&sock_fs_type); if (err) goto out_fs; sock_mnt = kern_mount(&sock_fs_type); if (IS_ERR(sock_mnt)) { err = PTR_ERR(sock_mnt); goto out_mount; ... static struct file_system_type sock_fs_type = { .name = \"sockfs\", .mount = sockfs_mount, .kill_sb = kill_anon_super,};static struct dentry *sockfs_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data){ return mount_pseudo(fs_type, \"socket:\", &sockfs_ops, &sockfs_dentry_operations, SOCKFS_MAGIC);}static const struct super_operations sockfs_ops = { .alloc_inode = sock_alloc_inode, .destroy_inode = sock_destroy_inode, .statfs = simple_statfs,}; sock_init -> register mount -> sock_fs_type->sockfs_mount->sockfs_ops->sock_alloc_node (4.5).pf->create 即TCP/IP协议族的创建函数inet_create初始化步骤(4.5.1).PF_INET协议族的create函数inet_create会被组册1234567(void)sock_register(&inet_family_ops);static const struct net_proto_family inet_family_ops = { .family = PF_INET, .create = inet_create, .owner = THIS_MODULE,}; (4.5.2).注册过程123456789101112131415161718192021int sock_register(const struct net_proto_family *ops){ int err; if (ops->family >= NPROTO) { printk(KERN_CRIT \"protocol %d >= NPROTO(%d)\\n\", ops->family, NPROTO); return -ENOBUFS; } spin_lock(&net_family_lock); if (rcu_dereference_protected(net_families[ops->family], lockdep_is_held(&net_family_lock))) err = -EEXIST; else { rcu_assign_pointer(net_families[ops->family], ops); err = 0; } spin_unlock(&net_family_lock); printk(KERN_INFO \"NET: Registered protocol family %d\\n\", ops->family); return err;} 协议族选项ops会根基协议族类型PF_INET被放置到net_families系统全局变量中 (4.5.3).__sock_create使用过程1234567891011121314151617181920socket.c/__sock_create...rcu_read_lock(); pf = rcu_dereference(net_families[family]); err = -EAFNOSUPPORT; if (!pf) goto out_release; /* * We will call the ->create function, that possibly is in a loadable * module, so we have to bump that loadable module refcnt first. */ if (!try_module_get(pf->owner)) goto out_release; /* Now protected by module ref count */ rcu_read_unlock(); err = pf->create(net, sock, protocol, kern); if (err < 0) goto out_module_put; 根据socket传输过来的协议族PF_INET查找全局变量net_families获取ops 通过ops->create调用inet_create根据具体协议创建网络层socket struct sock (4.6).inet_create都干了什么?123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140static int inet_create(struct net *net, struct socket *sock, int protocol, int kern){ struct sock *sk; struct inet_protosw *answer; struct inet_sock *inet; struct proto *answer_prot; unsigned char answer_flags; int try_loading_module = 0; int err; if (protocol < 0 || protocol >= IPPROTO_MAX) return -EINVAL; sock->state = SS_UNCONNECTED;//步骤1:设置socket状态SS_UNCONNECTED /* Look for the requested type/protocol pair. */lookup_protocol: err = -ESOCKTNOSUPPORT; rcu_read_lock();//步骤2:根据socket协议找到inet处理函数 connect、bind、accept、listen、等 list_for_each_entry_rcu(answer, &inetsw[sock->type], list) { err = 0; /* Check the non-wild match. */ if (protocol == answer->protocol) { if (protocol != IPPROTO_IP) break; } else { /* Check for the two wild cases. */ if (IPPROTO_IP == protocol) { protocol = answer->protocol; break; } if (IPPROTO_IP == answer->protocol) break; } err = -EPROTONOSUPPORT; } if (unlikely(err)) { if (try_loading_module < 2) { rcu_read_unlock(); /* * Be more specific, e.g. net-pf-2-proto-132-type-1 * (net-pf-PF_INET-proto-IPPROTO_SCTP-type-SOCK_STREAM) */ if (++try_loading_module == 1) request_module(\"net-pf-%d-proto-%d-type-%d\", PF_INET, protocol, sock->type); /* * Fall back to generic, e.g. net-pf-2-proto-132 * (net-pf-PF_INET-proto-IPPROTO_SCTP) */ else request_module(\"net-pf-%d-proto-%d\", PF_INET, protocol); goto lookup_protocol; } else goto out_rcu_unlock; } err = -EPERM; if (sock->type == SOCK_RAW && !kern && !ns_capable(net->user_ns, CAP_NET_RAW)) goto out_rcu_unlock;//步骤3: 把协协议的inet操作集合赋值给socket结构的ops sock->ops = answer->ops; answer_prot = answer->prot; answer_flags = answer->flags; rcu_read_unlock(); WARN_ON(answer_prot->slab == NULL); err = -ENOBUFS; //步骤4:申请struct sock结构体,并切把协议操作集合赋值给sock结构体 //sk->sk_prot = sk->sk_prot_creator =协议操作集合; sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot); if (sk == NULL) goto out; err = 0; if (INET_PROTOSW_REUSE & answer_flags) sk->sk_reuse = SK_CAN_REUSE;//步骤5:inet_sock进行相关初始化 inet = inet_sk(sk); inet->is_icsk = (INET_PROTOSW_ICSK & answer_flags) != 0; inet->nodefrag = 0; if (SOCK_RAW == sock->type) { inet->inet_num = protocol; if (IPPROTO_RAW == protocol) inet->hdrincl = 1; } if (net->sysctl_ip_no_pmtu_disc) inet->pmtudisc = IP_PMTUDISC_DONT; else inet->pmtudisc = IP_PMTUDISC_WANT; inet->inet_id = 0; sock_init_data(sock, sk); sk->sk_destruct = inet_sock_destruct; sk->sk_protocol = protocol; sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv; inet->uc_ttl = -1; inet->mc_loop = 1; inet->mc_ttl = 1; inet->mc_all = 1; inet->mc_index = 0; inet->mc_list = NULL; inet->rcv_tos = 0; sk_refcnt_debug_inc(sk); if (inet->inet_num) { /* It assumes that any protocol which allows * the user to assign a number at socket * creation time automatically * shares. */ inet->inet_sport = htons(inet->inet_num); /* Add to protocol hash chains. */ sk->sk_prot->hash(sk); }//步骤6:调用协议层初始化函数tcp_v4_init_sock()进行始化 if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); if (err) sk_common_release(sk); }out: return err;out_rcu_unlock: rcu_read_unlock(); goto out;} 设置socket状态SS_UNCONNECTED 根据协议类型找到具体的协议类型操作集合,例如协议处理函数tcp_proc和inet层处理函数集合inet_stream_ops socket->ops 获得协议操作集合inet_stream_ops 申请sock,并把tcp_proc赋值给它 sk->sk_prot = sk->sk_prot_creator=tcp_proc 把申请的sock和inet_sock进行初始化 sk->sk_prot->init(sk) 调用tcp_proc深度初始化TCP相关信息 尽管流程主要干了上述的事情,仍需要深入探究的问题是:a. inet_protosw inet_protosw初始化过程如何?b. inet_sock和sock是什么关系?c. 从inet_protosw获取的prot和ops哪些结构体上会记录使用? (4.6.1).inet_protosw初始化过程如何?12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485static struct inet_protosw inetsw_array[] ={ { .type = SOCK_STREAM, .protocol = IPPROTO_TCP, .prot = &tcp_prot, .ops = &inet_stream_ops, .flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_UDP, .prot = &udp_prot, .ops = &inet_dgram_ops, .flags = INET_PROTOSW_PERMANENT, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_ICMP, .prot = &ping_prot, .ops = &inet_dgram_ops, .flags = INET_PROTOSW_REUSE, }, { .type = SOCK_RAW, .protocol = IPPROTO_IP, /* wild card */ .prot = &raw_prot, .ops = &inet_sockraw_ops, .flags = INET_PROTOSW_REUSE, }};//inet_init for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q) inet_register_protosw(q); //inet_protosw放入全局inetsw管理void inet_register_protosw(struct inet_protosw *p){ struct list_head *lh; struct inet_protosw *answer; int protocol = p->protocol; struct list_head *last_perm; spin_lock_bh(&inetsw_lock); if (p->type >= SOCK_MAX) goto out_illegal; /* If we are trying to override a permanent protocol, bail. */ answer = NULL; last_perm = &inetsw[p->type]; list_for_each(lh, &inetsw[p->type]) { answer = list_entry(lh, struct inet_protosw, list); /* Check only the non-wild match. */ if (INET_PROTOSW_PERMANENT & answer->flags) { if (protocol == answer->protocol) break; last_perm = lh; } answer = NULL; } if (answer) goto out_permanent; /* Add the new entry after the last permanent entry if any, so that * the new entry does not override a permanent entry when matched with * a wild-card protocol. But it is allowed to override any existing * non-permanent entry. This means that when we remove this entry, the * system automatically returns to the old behavior. */ list_add_rcu(&p->list, last_perm);out: spin_unlock_bh(&inetsw_lock); return;out_permanent: pr_err(\"Attempt to override permanent protocol %d\\n\", protocol); goto out;out_illegal: pr_err(\"Ignoring attempt to register invalid socket type %d\\n\", p->type); goto out;} inet_init 会把inet_protosw方式inet_sw中 inet_protosw很重要,其含有协议的具体操作函数tcp_close,tcp_v4_connect,tcp_recvmsg等 inet_protosw,内还包含inet层操作函数 inet_bind,inet_accept,inet_bind,inet_listen等 (4.6.2). inet_sock和sock是什么关系?123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566struct sock *sk_alloc(struct net *net, int family, gfp_t priority, struct proto *prot){ struct sock *sk; sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family); if (sk) { sk->sk_family = family; /* * See comment in struct sock definition to understand * why we need sk_prot_creator -acme */ sk->sk_prot = sk->sk_prot_creator = prot; sock_lock_init(sk); sock_net_set(sk, get_net(net)); atomic_set(&sk->sk_wmem_alloc, 1); sock_update_classid(sk); sock_update_netprioidx(sk); } return sk;}static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family){ struct sock *sk; struct kmem_cache *slab; slab = prot->slab; if (slab != NULL) { sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO); if (!sk) return sk; if (priority & __GFP_ZERO) { if (prot->clear_sk) prot->clear_sk(sk, prot->obj_size); else sk_prot_clear_nulls(sk, prot->obj_size); } } else sk = kmalloc(prot->obj_size, priority);//申请内存大小为prot的objsize if (sk != NULL) { kmemcheck_annotate_bitfield(sk, flags); if (security_sk_alloc(sk, family, priority)) goto out_free; if (!try_module_get(prot->owner)) goto out_free_sec; sk_tx_queue_clear(sk); } return sk;out_free_sec: security_sk_free(sk);out_free: if (slab != NULL) kmem_cache_free(slab, sk); else kfree(sk); return NULL;} 从上述sk_alloc -> sk_prot_alloc -> obj_size 12345678910111213141516171819202122232425262728293031323334353637383940414243444546struct proto tcp_prot = { .name = \"TCP\", .owner = THIS_MODULE, .close = tcp_close, .connect = tcp_v4_connect, .disconnect = tcp_disconnect, .accept = inet_csk_accept, .ioctl = tcp_ioctl, .init = tcp_v4_init_sock, .destroy = tcp_v4_destroy_sock, .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, .sendpage = tcp_sendpage, .backlog_rcv = tcp_v4_do_rcv, .release_cb = tcp_release_cb, .hash = inet_hash, .unhash = inet_unhash, .get_port = inet_csk_get_port, .enter_memory_pressure = tcp_enter_memory_pressure, .stream_memory_free = tcp_stream_memory_free, .sockets_allocated = &tcp_sockets_allocated, .orphan_count = &tcp_orphan_count, .memory_allocated = &tcp_memory_allocated, .memory_pressure = &tcp_memory_pressure, .sysctl_wmem = sysctl_tcp_wmem, .sysctl_rmem = sysctl_tcp_rmem, .max_header = MAX_TCP_HEADER, .obj_size = sizeof(struct tcp_sock), .slab_flags = SLAB_DESTROY_BY_RCU, .twsk_prot = &tcp_timewait_sock_ops, .rsk_prot = &tcp_request_sock_ops, .h.hashinfo = &tcp_hashinfo, .no_autobind = true,#ifdef CONFIG_COMPAT .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt,#endif#ifdef CONFIG_MEMCG_KMEM .init_cgroup = tcp_init_cgroup, .destroy_cgroup = tcp_destroy_cgroup, .proto_cgroup = tcp_proto_cgroup,#endif}; struct tcp_sock 包含strcut inet_sock 包含 struct sock 上述结构体为互相包含的关系 实际上在申请sock时候,申请内存大小为tcp_sock大小,也就是说三个结构体共同诞生了 (4.6.3). 从inet_protosw获取的prot和ops哪些结构体上会记录使用? struct socket会在inet_create函数中获取到ops sock->ops = answer->ops;struct sock在sk_allloc函数中获取pro sk->sk_prot = sk->sk_prot_creator = prot; (5).socket与文件系统socket与文件系统关联通过sock_map_fd完成 其步骤如下: 1:获取fd get_unused_fd_flags 该函数从当前进程管理的files获取可用的fd 2:申请file sock_alloc_file 将struct socket放到file的private_data管理 file->private_data = sock 3:将file根据当前fd安装到current->files中 files有一个指针fdtfdt->fd是一个类型为file指针的数组,数组下标为fdrcu_assign_pointer(fdt->fd[fd], file); 将file安装fd为数组下标放到current->files管理","categories":[{"name":"socket","slug":"socket","permalink":"http://vcpu.me/categories/socket/"}],"tags":[{"name":"tcp/ip","slug":"tcp-ip","permalink":"http://vcpu.me/tags/tcp-ip/"},{"name":"kernel3.10.0-514.16.1","slug":"kernel3-10-0-514-16-1","permalink":"http://vcpu.me/tags/kernel3-10-0-514-16-1/"},{"name":"socket","slug":"socket","permalink":"http://vcpu.me/tags/socket/"}]},{"title":"systemtap使用调试记录(一)","slug":"systemtap使用调试记录(一)","date":"2017-06-05T10:15:52.000Z","updated":"2017-06-05T10:15:52.000Z","comments":true,"path":"systemtap使用调试记录(一)/","link":"","permalink":"http://vcpu.me/systemtap使用调试记录(一)/","excerpt":"systemtap使用调试记录(一)一、调试环境介绍Linux 3.10.0-514.16.1.el7.x86_64 kernel-devel-3.10.0-514.16.1.el7.x86_64.rpm 同版本的开发头文件 kernel-debuginfo-common-x86_64-3.10.0-514.16.1.el7.x86_64.rpm kernel-debuginfo-3.10.0-514.16.1.el7.x86_64.rpm 同版本调试数据包 linux-3.10.0-514.16.1.el7.tar.xz 同版本的源码 kernel开发头文件下载地址kernel调试包下载地址kernel调试common包下载地址根据当前虚拟机获取内核代码的方法","text":"systemtap使用调试记录(一)一、调试环境介绍Linux 3.10.0-514.16.1.el7.x86_64 kernel-devel-3.10.0-514.16.1.el7.x86_64.rpm 同版本的开发头文件 kernel-debuginfo-common-x86_64-3.10.0-514.16.1.el7.x86_64.rpm kernel-debuginfo-3.10.0-514.16.1.el7.x86_64.rpm 同版本调试数据包 linux-3.10.0-514.16.1.el7.tar.xz 同版本的源码 kernel开发头文件下载地址kernel调试包下载地址kernel调试common包下载地址根据当前虚拟机获取内核代码的方法 二、centos7安装方法yum install *.rpm 安装上述3个(debugifo,devel,debuginfo-common)rpm包 yum install systemtap stap -ve 'probe begin { log("hello world") exit() }' 测试正常结果如下: [root@localhost qinlong]# stap -ve ‘probe begin { log(“hello world”) exit() }’Pass 1: parsed user script and 120 library scripts using 227352virt/40488res/3260shr/37400data kb, in 260usr/30sys/338real ms.Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 228540virt/41804res/3420shr/38588data kb, in 10usr/0sys/6real ms.Pass 3: translated to C into “/tmp/stap5CqHmN/stap_f7a5084b8a638f5ce64a31271684ef1f_1133_src.c” using 228672virt/42408res/3996shr/38720data kb, in 0usr/0sys/0real ms.Pass 4: compiled C into “stap_f7a5084b8a638f5ce64a31271684ef1f_1133.ko” in 1000usr/330sys/1247real ms.Pass 5: starting run.hello worldPass 5: run completed in 10usr/40sys/362real ms. 三、通用案例1.函数调用栈打印123456789[root@localhost stp]# cat bt.stp probe kernel.function(@1){ print("----------------START-------------------------\\n") printf("In process [%s]\\n", execname()) print_regs() print_backtrace() print("----------------END-------------------------\\n") exit() } 打印内核函数的调用栈 [root@localhost stp]# stap bt.stp tcp_sendmsg—————-START————————-In process [sshd]RIP: ffffffff815c1ee0RSP: ffff88003d217d28 EFLAGS: 00000202RAX: ffffffff81aa20a0 RBX: ffff88003d217e38 RCX: 0000000000000024RDX: ffff88003d217da8 RSI: ffff88003b3b87c0 RDI: ffff88003d217e38RBP: ffff88003d217d50 R08: 0000000000000000 R09: 0000000000000000R10: ffff88003d217da8 R11: 0000000000000000 R12: ffff88003d217e38R13: 0000000000000001 R14: ffff88003d217e28 R15: ffff8800274d3480FS: 00007f03e5514840(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033CR2: 00007f19c6dc8000 CR3: 0000000035a5c000 CR4: 00000000000406e0 0xffffffff815c1ee0 : tcp_sendmsg+0x0/0xc40 [kernel] 0xffffffff815ed254 : inet_sendmsg+0x64/0xb0 [kernel] 0xffffffff81554e07 : sock_aio_write+0x157/0x180 [kernel] 0xffffffff811fdf3d : do_sync_write+0x8d/0xd0 [kernel] 0xffffffff811fe8a5 : vfs_write+0x1b5/0x1e0 [kernel] 0xffffffff811ff2cf : sys_write+0x7f/0xe0 [kernel] 0xffffffff81697189 : system_call_fastpath+0x16/0x1b [kernel]—————-END————————- 2.函数的调用过程1234567[root@localhost stp]# cat socket-trace.stpprobe kernel.function("*@net/socket.c").call{ printf("%s -> %s\\n",thread_indent(1),ppfunc())}probe kernel.function("*@net/socket.c").return{ printf("%s<-%s\\n",thread_indent(-1),ppfunc())} thread_indent(1) 打印程序名称(线程id)ppfunc() 打印出执行函数符号 kernel.function(“@net/socket.c”).call调用net/socket.c 文件中函数时候会触发函数体执行打印动作kernel.function(“@net/socket.c”).return调用net/socket.c文件中函数执行完成返回后会触发函数体打印动作 [root@localhost stp]# stap socket-trace.stp 0 dndX11(3295): -> SyS_recvmsg 0 dndX11(3295): -> sys_recvmsg 0 dndX11(3295): -> sockfd_lookup_light 0 dndX11(3295):<-sockfd_lookup_light 1 dndX11(3295): -> _sys_recvmsg 3 dndX11(3295): -> sock_recvmsg 7 dndX11(3295):<-sock_recvmsg 8 dndX11(3295):<-_sys_recvmsg 9 dndX11(3295):<-sys_recvmsg 10 dndX11(3295):<-SyS_recvmsg25274 dndX11(3295): -> SyS_recvmsg25279 dndX11(3295): -> sys_recvmsg25281 dndX11(3295): -> sockfd_lookup_light25284 dndX11(3295):<-sockfd_lookup_light25285 dndX11(3295): -> _sys_recvmsg25288 dndX11(3295): -> sock_recvmsg25291 dndX11(3295):<-sock_recvmsgx 3.打印协议栈函数中某一行数据/home/qinlong/rpmbuild/SOURCES/linux-3.10.0-514.16.1.el7/net/ipv4/tcp.c局部源码如下:12345678910111213141065 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,1066 size_t size)1067 {1068 struct iovec *iov;1069 struct tcp_sock *tp = tcp_sk(sk);1070 struct sk_buff *skb;1071 int iovlen, flags, err, copied = 0;1072 int mss_now = 0, size_goal, copied_syn = 0, offset = 0;1073 bool sg;1074 long timeo;10751076 lock_sock(sk);10771078 flags = msg->msg_flags; 12[root@localhost ~]# stap -L 'kernel.statement("*@net/ipv4/tcp.c:1078")'kernel.statement("tcp_sendmsg@net/ipv4/tcp.c:1078") $iocb:struct kiocb* $sk:struct sock* $msg:struct msghdr* $size:size_t $copied:int $mss_now:int $size_goal:int $copied_syn:int $offset:int $timeo:long int 执行上述函数,可确代码具体的函数局部变量12345678910$iocb:struct kiocb* $sk:struct sock* $msg:struct msghdr* $size:size_t $copied:int$mss_now:int $size_goal:int $copied_syn:int $offset:int $timeo:long int 根据以上变量打印出size值123[root@localhost ~]# stap -e 'probe kernel.statement("*@net/ipv4/tcp.c:1078") {printf("size %d \\n",$size)}'size 36size 44","categories":[{"name":"linux kernel","slug":"linux-kernel","permalink":"http://vcpu.me/categories/linux-kernel/"}],"tags":[{"name":"systemtap","slug":"systemtap","permalink":"http://vcpu.me/tags/systemtap/"}]}]}