Trafodian installation keeps on doing SQ registry key SQ_TXNSVC_READY is not set

Asked by Mohan

I have completed all the steps till 11 successfully. But step 12 is taking very long time morethan 1Hour,

Step number 12.

Run the trafodion_installer script to complete the Trafodion installation by providing the following parameters. It is recommended for a small demo cluster to use two DCS servers per node. A DCS server is needed for each client connection to the Trafodion cluster. In the example below, the --dcs_servers 12 would allow for a maximum of 12 client connections. NOTE: To configure and use the Database Connectivity Services (DCS) in the Trafodion system, see the Trafodion Database Connectivity Services Reference Guide on the Documentation page.

***********************************************************
 Updating Authentication Configuration
***********************************************************
Creating folders for storing certificates

Copying the log4j and log4cpp configuration files
***INFO: Start of DCS install
***INFO: untarring build file /home/trafodion/dcs_build.tgz
***INFO: modifying /home/trafodion/trafodion/dcs-0.8.3/conf/dcs-env.sh
***INFO: modifying /home/trafodion/trafodion/dcs-0.8.3/conf/dcs-site.xml
***INFO: creating /home/trafodion/trafodion/dcs-0.8.3/conf/servers file
***INFO: End of DCS install.
***INFO: starting Trafodion instance
Checking orphan processes.
Removing old mpijob* files from /home/trafodion/trafodion/tmp

Removing old monitor.port* files from /home/trafodion/trafodion/tmp

Executing sqipcrm (output to sqipcrm.out)
Starting the SQ Environment (Executing /home/trafodion/trafodion/sql/scripts/gomon.cold)
Background SQ Startup job (pid: 12465)

# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.

Please let me what should I do here

Question information

Language:
English Edit question
Status:
Solved
For:
Trafodion Edit question
Assignee:
No assignee Edit question
Solved by:
Mohan
Solved:
Last query:
Last reply:
Revision history for this message
Chirag Bhalgami (chirag-bhalgami) said :
#1

You can abort the running script.

Before you follow these steps, make sure your HBase environment is up and hbase shell can come up. You can execute, "hbase shell to verify it. Once HBase is up and running, follow these steps:
  sqstop
  ckillall (and type y on the prompt)
  start a new shell (or "ssh localhost")
  source in sqenv.sh
  rm $MY_SQROOT/etc/ms.env
  rm $MY_SQROOT/conf/hbase-site.xml
  sqgen
  start a new shell (or "ssh localhost")
  source in sqenv.sh
  sqstart

This should resolve the issue.

If sqstart still goes in infinite loop then it can happen because of several other reasons. You can verify below items:
1) Make sure all nodes are up by executing "sqcheck" from trafodion user

2) hbase-transaction jar file for Trafodion is not installed properly (make sure $MY_SQROOT/export/lib/hbase-* and /usr/share/cmf/lib/plugins/hbase-* are same)

3) Make sure if zookeeper quorum is set properly for DCS environment. The property for zookeeper quorum (dcs.zookeeper.quorum) should exist in <dcs-install-dir>/conf/dcs-site.xml file:
-----
  <property>
    <name>dcs.zookeeper.quorum</name>
    <value>Specify the host names comma separated</value> (e.g. <value>zknode1,zknode2</value>)
  </property>
-----

4) Look at possible log files under $MY_SQROOT/logs

5) Analyze the core files on the system

Revision history for this message
Mohan (mohantnr) said :
#2

This is the error I get in a logfile

This is the error I observe in log file.

 cd $MY_SQROOT/sql/scripts/trafodion.dtm.log

2014-08-13 09:16:32,194 INFO ipc.HBaseRPC: Problem connecting to server: cdhvm.example.com/10.0.2.15:60020
2014-08-13 09:16:32,194 DEBUG client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=-ROOT-, metaLocation={region=-ROOT-,,0.70236052, hostname=cdhvm.example.com, port=60020}, attempt=8 of 10 failed; retrying after sleep of 16131 because: Call to cdhvm.example.com/10.0.2.15:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.2.15:33268 remote=cdhvm.example.com/10.0.2.15:60020]
2014-08-13 09:16:32,197 DEBUG zookeeper.ZKUtil: hconnection-0x147ce0bb0c500b0 Retrieved 37 byte(s) of data from znode /hbase/root-region-server and set watcher; cdhvm.example.com,60020,14079...
2014-08-13 09:16:32,197 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@5be5ccd4; serverName=cdhvm.example.com,60020,1407911227023

3) Make sure if zookeeper quorum is set properly for DCS environment. The property for zookeeper quorum (dcs.zookeeper.quorum) should exist in <dcs-install-dir>/conf/dcs-site.xml file:

Content of the file ./trafodion/dcs-0.8.3/conf/dcs-site.xml.

  <property>
    <name>dcs.zookeeper.quorum</name>
    <value>22241</value>
  </property>

What should I do now..

Revision history for this message
Chirag Bhalgami (chirag-bhalgami) said :
#3

I assume you have followed the steps I mentioned at the beginning of my response (sqstop, ckillall, etc.) and you are able to query from "hbase shell". If you are getting these messages AFTER following above steps then your HBase environment seems to be unstable. Do you see any error message in hbase-regionserver log?

Make sure you are looking at the logs (both hbase-regionserver and trafodion-dtm logs) from the relevant timestamp.

Note: Trafodion currently supports HBase 0.94 and it will soon support 0.98. Which HBase version are you using?

Revision history for this message
Chirag Bhalgami (chirag-bhalgami) said :
#4

Also, the value of zookeeper quorum in dcs-site.xml should contain hostname/s of your zookeeper node/s.

Example:

  <property>
    <name>dcs.zookeeper.quorum</name>
    <value>zknode1,zknode2</value>
  </property>

Revision history for this message
Mohan (mohantnr) said :
#5

Please find hbase shell output.

[root@cdhvm hbase]# hbase shell
14/08/13 09:05:12 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6-cdh4.5.0, rUnknown, Wed Nov 20 15:46:19 PST 2013

hbase(main):001:0> scan 'TRAFODION._DTM_.TLOG1_LOG_0'
ROW COLUMN+CELL
0 row(s) in 0.1370 seconds

hbase(main):002:0>

./dcs-0.8.3/conf/dcs-site.xml Contains below entry -- I have added entry upon your request.

<configuration>
  <property>
    <name>dcs.zookeeper.quorum</name>
    <value>22241</value>
  </property>
  <property>
    <name>dcs.zookeeper.quorum</name>
    <value>cdhvm.example.com</value>
  </property>
</configuration>

I have followed your steps as you mentioned . But still through same error

Removing old monitor.port* files from /home/trafodion/trafodion/tmp

Executing sqipcrm (output to sqipcrm.out)
Starting the SQ Environment (Executing /home/trafodion/trafodion/sql/scripts/gomon.cold)
Background SQ Startup job (pid: 21614)

# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set
# of Transactions being recovered: 0 Checking.....
SQ registry key SQ_TXNSVC_READY is not set

Hbase version

Version 0.94.6-cdh4.5.0,

Revision history for this message
Mohan (mohantnr) said :
#6

Why do I get the below error when I start "sqstart" in log file cd $MY_SQROOT/sql/scripts/trafodion.dtm.log

2014-08-13 19:03:04,538 DEBUG client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=-ROOT-, metaLocation={region=-ROOT-,,0.70236052, hostname=cdhvm.example.com, port=60020}, attempt=5 of 10 failed; retrying after sleep of 4003 because: Call to cdhvm.example.com/10.0.2.15:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.2.15:43674 remote=cdhvm.example.com/10.0.2.15:60020]

Revision history for this message
Mohan (mohantnr) said :
#7

iled on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.2.15:44208 remote=cdhvm.example.com/10.0.2.15:60020]
2014-08-13 19:07:40,259 DEBUG dtm.TmAuditTlog: Creating new Tlog table TRAFODION._DTM_.TLOG0_LOG_0
2014-08-13 19:07:40,259 DEBUG dtm.TmAuditTlog: Creating the table TRAFODION._DTM_.TLOG0_LOG_0

2014-08-13 19:08:37,246 WARN client.HBaseAdmin: Creating TRAFODION._DTM_.TLOG1_LOG_0 took too long
java.net.SocketTimeoutException: Call to cdhvm.example.com/10.0.2.15:60000 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.2.15:35117 remote=cdhvm.example.com/10.0.2.15:60000]
        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1035)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1008)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
        at com.sun.proxy.$Proxy7.createTable(Unknown Source)
        at org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:526)
        at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:412)
        at org.apache.hadoop.hbase.client.HBaseAdmin.createTable(HBaseAdmin.java:348)
        at org.trafodion.dtm.TmAuditTlog.<init>(TmAuditTlog.java:364)
        at org.trafodion.dtm.HBaseTxClient.init(HBaseTxClient.java:225)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.2.15:35117 remote=cdhvm.example.com/10.0.2.15:60000]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:652)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:586)
2014-08-13 19:08:37,288 DEBUG client.MetaScanner: Scanning .META. starting at row=TRAFODION._DTM_.TLOG1_LOG_0,,00000000000000 for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@350004ad
2014-08-13 19:08:37,319 DEBUG zookeeper.ZKUtil: hconnection-0x147d0b045e6000f Retrieved 37 byte(s) of data from znode /hbase/root-region-server and set watcher; cdhvm.example.com,60020,14079...
2014-08-13 19:08:37,319 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@350004ad; serverName=cdhvm.example.com,60020,1407956134285

Revision history for this message
Chirag Bhalgami (chirag-bhalgami) said :
#8

From your Cloudera Manager, can you verify the value of RPC Timeout (hbase.rpc.timeout)? Seems like it is set to the default value of 1 minute (60000 mili-seconds). On your network, it is taking more time for channel to be ready for read (most probably caused by low network latency). Please increase this value, restart HBase, and perform following steps:

  sqstop
  ckillall (and type y on the prompt)
  start a new shell (or "ssh localhost")
  source in sqenv.sh
  rm $MY_SQROOT/etc/ms.env
  rm $MY_SQROOT/conf/hbase-site.xml
  sqgen
  start a new shell (or "ssh localhost")
  source in sqenv.sh
  sqstart

Revision history for this message
Mohan (mohantnr) said :
#9

I have increased to 120000 millis timeout and did the same steps ... but its not working

But this file is not present. Please can you give me contents of this file rm $MY_SQROOT/conf/hbase-site.xml

Revision history for this message
Chirag Bhalgami (chirag-bhalgami) said :
#10

You have to remove $MY_SQROOT/conf/hbase-site.xml file in case if it exist. You are already using Cloudera so hbase-site.xml file will not exist (you can do all the HBase related configuration through Cloduera Manager). You can just ignore the step of deleting hbase-site.xml file. Can you try increasing the value of hbase.rpc.timeout to 5 minutes and retry. Let us know if you still have issues after increasing this value.

Revision history for this message
Marvin Anderson (marvin-anderson) said :
#11

Please return the timeout value to the default setting of 60000 millis. Also, the file $MY_SQROOT/conf/hbase-site.xml should not be present on your system (this is normal), the previous instructions to remove it were incorrect. These are not a solution for this problem. Please send us the trafodion.dtm.log file, I've contacted you directly with this request.

Revision history for this message
Mohan (mohantnr) said :
#12

Hi,Kindly fine the attached /home/trafodion/trafodion/sql/scripts/trafodion.dtm.log file

> To: <email address hidden>
> From: <email address hidden>
> Subject: Re: [Question #253027]: Trafodian installation keeps on doing SQ registry key SQ_TXNSVC_READY is not set
> Date: Wed, 13 Aug 2014 20:31:49 +0000
>
> Your question #253027 on Trafodion changed:
> https://answers.launchpad.net/trafodion/+question/253027
>
> Marvin Anderson posted a new comment:
> Please return the timeout value to the default setting of 60000 millis.
> Also, the file $MY_SQROOT/conf/hbase-site.xml should not be present on
> your system (this is normal), the previous instructions to remove it
> were incorrect. These are not a solution for this problem. Please send
> us the trafodion.dtm.log file, I've contacted you directly with this
> request.
>
> --
> You received this question notification because you asked the question.

Revision history for this message
Marvin Anderson (marvin-anderson) said :
#13

This problem was resolved by a conference call with Mohan. User now has Trafodion up and running.

Revision history for this message
Mohan (mohantnr) said :
#14

Great Help by Marvin and Team

Thanks

Mohan

Revision history for this message
li hao (lihao49) said :
#15

i met the same problem, can you show me how to solve this problem? part of trafodion.dmt.log is as following:

2014-09-02 05:31:34,891 INFO ipc.HBaseRPC: Problem connecting to server: cloudera/192.168.0.83:60020
2014-09-02 05:31:34,892 DEBUG client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=-ROOT-, metaLocation={region=-ROOT-,,0.70236052, hostname=cloudera, port=60020}, attempt=4 of 10 failed; retrying after sleep of 2019 because: Call to cloudera/192.168.0.83:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.0.83:60639 remote=cloudera/192.168.0.83:60020]
2014-09-02 05:31:34,894 DEBUG zookeeper.ZKUtil: hconnection-0x14833e84b240207 Retrieved 28 byte(s) of data from znode /hbase/root-region-server and set watcher; cloudera,60020,1409633907003
2014-09-02 05:31:34,895 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@21432a60; serverName=cloudera,60020,1409633907003
2014-09-02 05:31:36,879 DEBUG zookeeper.ZKUtil: hconnection-0x14833e84b240209 Retrieved 28 byte(s) of data from znode /hbase/root-region-server and set watcher; cloudera,60020,1409633907003
2014-09-02 05:31:36,879 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@149873b7; serverName=cloudera,60020,1409633907003
2014-09-02 05:31:36,900 DEBUG zookeeper.ZKUtil: hconnection-0x14833e84b240207 Retrieved 28 byte(s) of data from znode /hbase/root-region-server and set watcher; cloudera,60020,1409633907003
2014-09-02 05:31:36,900 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@21432a60; serverName=cloudera,60020,1409633907003
2014-09-02 05:32:36,965 INFO ipc.HBaseRPC: Problem connecting to server: cloudera/192.168.0.83:60020
2014-09-02 05:32:36,965 DEBUG client.HConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=-ROOT-, metaLocation={region=-ROOT-,,0.70236052, hostname=cloudera, port=60020}, attempt=5 of 10 failed; retrying after sleep of 4002 because: Call to cloudera/192.168.0.83:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.0.83:60723 remote=cloudera/192.168.0.83:60020]
2014-09-02 05:32:36,968 DEBUG zookeeper.ZKUtil: hconnection-0x14833e84b240209 Retrieved 28 byte(s) of data from znode /hbase/root-region-server and set watcher; cloudera,60020,1409633907003
2014-09-02 05:32:36,969 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@149873b7; serverName=cloudera,60020,1409633907003

it come into a loop of retrying, and my .META. table becomes unreachable after short time. please help me check where the problem is ,thanks a lot!