openstack-issue
openstack 生产环境问题纪录。
1、horizon “router_gateway DOWN”
horizon 页面路由显示信息 “router_gateway DOWN”,路由-》接口-》外部网关-》状态:停止 | 默认:创建
初步认定是openstack的bug,虽然是down状态但是不影响使用
排查过程
- 1.1 页面看到外部网关“停止状态”的固定IP地址,通过如下命令过滤出来,最后获取状态看到“down”
[root@controller ~(keystone_admin)]#neutron port-list|grep 192.168.2.100
| 857833bb-ae82-40c2-a6db-498b71660f17 | | fa:16:3e:20:b8:b2 | {“subnet_id”: “e380ee29-c00f-485b-9ab2-7da3466cb9e1”, “ip_address”: “192.168.2.100”} |
- 1.1 页面看到外部网关“停止状态”的固定IP地址,通过如下命令过滤出来,最后获取状态看到“down”
[root@controller ~(keystone_admin)]# neutron port-show
857833bb-ae82-40c2-a6db-498b71660f17|grep status
| status | DOWN
[root@controller ~(keystone_admin)]# neutron router-show router|grep status
| status | ACTIVE
[root@controller ~(keystone_admin)]# neutron help | grep route
l3-agent-list-hosting-router List L3 agents hosting a router.
l3-agent-router-add Add a router to a L3 agent.
l3-agent-router-remove Remove a router from a L3 agent.
net-gateway-connect Add an internal network interface to a router.
router-create Create a router for a given tenant.
router-delete Delete a given router.
router-gateway-clear Remove an external network gateway from a router.
router-gateway-set Set the external network gateway for a router.
router-interface-add Add an internal network interface to a router.
router-interface-delete Remove an internal network interface from a router.
router-list List routers that belong to a given tenant.
router-list-on-l3-agent List the routers on a L3 agent.
router-port-list List ports that belong to a given tenant, with specified router.
router-show Show information of a given router.
router-update Update router’s information.
2、修改主机配置大小,导致数据迁移失败。
openstack 故障:Authentication required 500,页面看到是调整大小的时候报错。
修改主机大小,会新建立一个主机,把旧的数据迁移到新的主机,在完成最后的合并工作,但是由于物理机压力太大,导致迁移失败。后天看到有两个虚拟机,处于关闭状态。
手动重启主机实例,去前台的vnc客户端看到,主机启动了,没有任何问题,正常可以使用,就前台有个错误状态让人不爽。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16[root@openstack-controller ~]# virsh list --all
Id Name State
----------------------------------------------------
136 instance-0000003a running
- instance-0000003c shut off
- instance-00000079 shut off
[root@controller ~]# virsh start instance-0000003c
Domain instance-0000003c started
[root@openstack-controller ~]# virsh list --all
Id Name State
----------------------------------------------------
136 instance-0000003a running
137 instance-0000003c running
- instance-00000079 shut off虽然云主机正常使用,但是前台无法在管理到此云主机,想办法把数据和服务先迁移走。然后单独修改数据库状态,使它正常接收管理,恢复正常使用。
3、虚拟机一直处于删除状态-前台卡死-发现前后端数据不一致
检测发现,虚拟机一直处于删除状态,无法释放。
通过source keystone_admin,进入管理员权限,查看虚拟机状态,发现处于deleteing状态。1
2
3[root@controller ~(keystone_admin)]# nova list|grep deleting
| c6eb16d4-d607-4c98-b329-9defc4ba7586 | RedHadoop03 | ACTIVE | deleting | Running | private=172.16.1.242 |
| 227a1b8b-93cf-4e55-af73-db0d60e80a69 | RedHadoop04 | ACTIVE | deleting | Running | private=172.16.1.243 |
解决方法,首先reset-state这台虚拟机,然后执行
nova delte 虚拟机ID
进行删除。
reset 虚拟机
1
2[root@controller nova(keystone_admin)]# nova reset-state --active c6eb16d4-d607-4c98-b329-9defc4ba7586
Reset state for server c6eb16d4-d607-4c98-b329-9defc4ba7586 succeeded; new state is active查看虚拟机状态
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32[root@controller nova(keystone_admin)]# nova show c6eb16d4-d607-4c98-b329-9defc4ba7586
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-000001ea |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2016-07-25T07:51:36.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2016-07-25T07:50:57Z |
| flavor | m1.xlarge (5) |
| hostId | 02a55d5ba7a2ec4a8af2d59c8267befc2378810113c2e6532fe0417a |
| id | c6eb16d4-d607-4c98-b329-9defc4ba7586 |
| image | demo (a304281c-7524-444e-ab39-29a9658b8944) |
| key_name | jin |
| metadata | {} |
| name | RedHadoop03 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| status | ACTIVE |
| tenant_id | d62eac96a5294336b1f508b5757088f6 |
| updated | 2016-07-26T08:55:54Z |
| user_id | b113bbc4934e4a71882206e516f2931e |
+--------------------------------------+----------------------------------------------------------+删除虚拟机
1
2[root@controller nova(keystone_admin)]# nova delete c6eb16d4-d607-4c98-b329-9defc4ba7586
Request to delete server c6eb16d4-d607-4c98-b329-9defc4ba7586 has been accepted.可通过
virsh list --all
命令查看是否删除干净,在配合nova list
命令排查,查看是否还有一直处于删除状态的云主机。
3. 云硬盘error、error deleting、deleting状态
起因是因为,机房断电了,重启之后,openstack集群其中有一台服务器启动后,网线未成功连接,导致此主机IP无法连接;而在这个时候,我并没有发现有一台服务器未加入集群,删除了一个快底层未glusterfs集群的云硬盘,接下来就出现了一直处于删除状态!
通过cinder命令查看云硬盘状态:1
2
3
4
5
6
7
8[root@openstack-controller ~(keystone_openstack-cloud)]# cinder list
+--------------------------------------+----------------+----------------------------+------+-------------+----------+--------------------------------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+----------------+----------------------------+------+-------------+----------+--------------------------------------+
| 11cb3510-533e-4db4-9999-ff2c7b72695a | error | x86-build-glusterfs-1 | 100 | GlusterFS | true | cbc4f7ab-c167-4478-9982-64909f55e52e |
| 3d07bbcd-c648-4bda-a0ee-96b9877b7a4b | in-use | gitlib-server-glusterfs-1 | 1000 | GlusterFS | false | b232560d-af7a-47f0-80c6-f708b9bda45e |
| 751b6804-ce57-48d1-bbce-c0b9d4d10a97 | error_deleting | rancher-server-glusterfs-1 | 20 | GlusterFS | false | |
+--------------------------------------+----------------+----------------------------+------+-------------+----------+--------------------------------------+
登录数据库,user cinder;
进入库,查看volumes表信息:1
2
3
4
5
6
7MariaDB [cinder]> select deleted,status,deleted_at from volumes where id='751b6804-ce57-48d1-bbce-c0b9d4d10a97';
+---------+----------------+------------+
| deleted | status | deleted_at |
+---------+----------------+------------+
| 0 | error_deleting | NULL |
+---------+----------------+------------+
1 row in set (0.01 sec)
很多原因可能导致volume 进入error或者error_deleting状态,此时无法再执行delete操作。这种情况大体分为两类:
- 1.当执行cinder delete时,cinder连接不到glusterfs 云硬盘服务器,导致删除失败。
- 2.当执行cinder delete时,cinder连接不到数据库,此时由于没有事务同步,导致ceph已经删除对应的image,但没有同步状态到数据库中,此时volume可能处于available或者deleting状态,如果再次执行delete操作,显然cinder已经找不到对应的image,所以会抛出错误异常,此时cinder volume会把它设为error_deleting状态,并且无法通过reset-state删除。
尝试执行如下命令:1
2$ cinder reset-state volume_id
$ cinder delete volume_id
如果还是失败1
2
3[root@openstack-controller ~(keystone_openstack-cloud)]# cinder delete 751b6804-ce57-48d1-bbce-c0b9d4d10a97
Delete for volume 751b6804-ce57-48d1-bbce-c0b9d4d10a97 failed: Invalid volume: Volume status must be available or error or error_restoring or error_extending and must not be migrating, attached, belong to a consistency group or have snapshots. (HTTP 400) (Request-ID: req-1769e1bc-b0bc-4688-bff7-fc88e8785cb0)
ERROR: Unable to delete any of the specified volumes.
此故障无法通过cinder命令删除,只能修改数据库状态,标识其为已删除状态(通常不直接删除记录):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17mysql -uroot -pxxx
MariaDB [cinder]> use cinder;
MariaDB [cinder]> update volumes set deleted=1 where id = "751b6804-ce57-48d1-bbce-c0b9d4d10a97";
MariaDB [cinder]> update volumes set status='deleted' where id = "751b6804-ce57-48d1-bbce-c0b9d4d10a97";
MariaDB [cinder]> update volumes set deleted_at='now()' where id = "751b6804-ce57-48d1-bbce-c0b9d4d10a97";
MariaDB [cinder]> select deleted,status,deleted_at from volumes where id='751b6804-ce57-48d1-bbce-c0b9d4d10a97';
+---------+---------+---------------------+
| deleted | status | deleted_at |
+---------+---------+---------------------+
| 1 | deleted | 0000-00-00 00:00:00 |
+---------+---------+---------------------+
1 row in set (0.00 sec)
如上,只需要修改cinder数据库的volumes表,修改deleted字段为1(整型的1,不是字符串),status字段修改为”deleted”,deleted_at可修改为”now()”,也可以不修改。
再次查看cinder卷信息,已经没有状态为删除错误的信息:1
2[root@openstack-controller ~(keystone_openstack-cloud)]# cinder list|grep error_deleting|wc -l
0
最后,删除对应的glusterfs目录下的volume信息:1
2
3
4
5
6
7[root@openstack-controller (keystone_openstack-cloud)]# ls /var/lib/cinder/glusterfs/c44e08da624e18a5d328451413fd4c27
volume-11cb3510-533e-4db4-9999-ff2c7b72695a volume-3d07bbcd-c648-4bda-a0ee-96b9877b7a4b volume-751b6804-ce57-48d1-bbce-c0b9d4d10a97
[root@openstack-controller (keystone_openstack-cloud)]# du -s -h /var/lib/cinder/glusterfs/c44e08da624e18a5d328451413fd4c27/volume-751b6804-ce57-48d1-bbce-c0b9d4d10a97
527M volume-751b6804-ce57-48d1-bbce-c0b9d4d10a97
[root@openstack-controller (keystone_openstack-cloud)]# rm -f /lib/cinder/glusterfs/c44e08da624e18a5d328451413fd4c27/volume-751b6804-ce57-48d1-bbce-c0b9d4d10a97
如此这般,才算解决问题呀!复杂。。。
4.openstack cinder volume error
事情是这样发生的,我们机房需要断电安装空调,我们关闭openstack所有云主机后,关闭操作系统,重启之后发现,云硬盘都处于error状态,但是不影响正常使用,就是dashboard看着不舒服.1
2
3# cinder list|grep error
| 11cb3510-533e-4db4-9999-ff2c7b72695a | error | x86-build-glusterfs-1 | 100 | GlusterFS | true | cbc4f7ab-c167-4478-9982-64909f55e52e |
| 3d07bbcd-c648-4bda-a0ee-96b9877b7a4b | error | gitlib-server-glusterfs-1 | 1000 | GlusterFS | false | b232560d-af7a-47f0-80c6-f708b9bda45e |