Updated: April 23, 2024 11:25am

Troubleshooting Replication

This topic has various information that can be helpful when troubleshooting replication issues:

  • Replication UI elements
  • Show Successful Records
  • Reprocess Errors
  • Failed or Stopped Initialization, Replication
  • Pause/Resume Initialization
  • GuaranteedInitMessageDelivery setting
  • Pending Tab
  • Limiting the Number of Rows Fetched by GET Requests
  • Restart PrismMQ after Deleting Replication Records


Replication UI Elements

Element Description
to server batches Records that have been sent in to this server.  Click the link to display a list of the records sent in to this server.
from server batches Records that have been sent out from this server. Click the link to display a list of the records sent out from this server.
error count  Errors. Click the link to display a list of the errors.
filter installation name When several servers are available, you can filter the list. Type the name of the machine. Click the icon to filter for only servers with errors.
total link count Shows the total number of links sent to this server and the number of errors.
batches Displays the number of batches included in the replication results.


Initialization Hierarchy
You can navigate the various levels of the initialization using the buttons at the top of the UI: Server, Store, Resource, SID.

Element Description
server_list This will take you to the top level of initialization batches, showing a list of servers.
link to server resource list This button displays that server's list of resources sent in the most recent initialization batch for the server. In this case, the server is "HQ".
vendor resource link  When you drill down to the individual record label, you can use this button to go back to the list of records sent for the selected resource. In this case, the list of vendor records will be displayed.
resource link Link for an individual resource record.

Navigate between batches
When viewing the list of initialization batches for a server, use the arrow buttons to navigate back and forth among the batches. To identify which batch you are viewing, check the date and time the batch launched.

 Show Successful Messages
By default, the successful records sent during replication are not preserved. This helps keep hard drives from getting filled with the files. However, during troubleshooting, you may want to change this setting so that you can drill down and see individual records.
The default value for Show Successful Messages is False, meaning that successful records are not preserved; only errors are preserved. Keeping the default value of False is important for customers with large data sets. If set to true, large quantities of unneeded data (sometimes millions of rows) could be preserved. If you are not careful, this has the potential to overwhelm the system memory. When you run initialization, and everything is successful, but the property is set to false (default), you will see the completed message in the Status screen and that is all.
Now that you understand why success records are not retained, here is how you can change the setting to preserve success records for troubleshooting.
To show success records:

  1. Edit the PrismMQService.ini file so that the PreserveSuccessRecords setting is set to TRUE.
  2. Select the Show Successful Messages checkbox on the Prism Dashboard > Initializations screen.
  3. Initialize the server. Important! You must enable Show Successful Records in the .ini file BEFORE initializing. If you edited the .ini file before initialization, then when you click the Show Successful Messages checkbox, a list of successfully replicated resources will be displayed.

 show successful messages

Reprocess Errors
If any errors occur during initialization, you can correct the problem and then reprocess the specific links that failed. You can click a server's link to drill down. Click the column header to sort the list by the Failed column. This brings the errors to the top of the list. Next, go into the resource with the errors. Select the individual elements you want to reproces and the click the Reprocess Selected button. If you click the View button, you can see details about the selected resource. You can see a lot of info on this screen, so scroll to the right to display more columns.

If Replication, Initialization Fails or is Stopped
If replication services are interrupted (reboot, computer freezes), Day-2-Day replication will resume where it left off. Initialization can take a long time for larger databases and unfortunately, initialization can sometimes fail to complete successfully. If an initialization fails or is stopped, do this:
Create a new Sender profile that starts from the resource after the last COMPLETED resource. For example, if the initialization was in the middle of the Inventory resource when the failure occurred, the new Sender profile should include Inventory and the rest of the resources to the bottom of the list. Run initialization again using the new Sender profile.
It may take a while to process the first resource (the resource that was being initialized when the failure occurred). This is because the program must do a slower UPDATE operation on each of the resource's records that are already in the tables. Once the program finishes the updates and reaches the unprocessed records for the resource, it switches to the much faster INSERT operation. The entire resource in which the failure occurred must be sent again.

Guaranteed Init Message Delivery
This setting, when used in combination with the RESUMEINITONSTARTUP setting, ensures that if initialization fails, no messages are lost are lost and initialization automatically resumes at the point of failure.

There are three key properties in the PrismMQService.ini file that are related to this feature. By default, both INITGUARANTEEDMESSAGEDELIVERY and RESUMEINITONSTARTUP are set to True by default. You can find these settings in the [PRISM] section of the PrismMQService.ini file. The INITGUARANTEEDMESSAGEDELIVERY is also in the [RIL] section of the PrismMQService.ini file.

[PRISM]
INITGUARANTEEDMESSAGEDELIVERY=True
RESUMEINITONSTARTUP=True
[RIL]
INITGUARANTEEDMESSAGEDELIVERY=True

By setting RESUMEINITONSTARTUP to true, if a consumer does down during an initialization, when it is restarted it will resume initialization automatically. In tandem with this property the user must also set the INIGUARANTEEDMESSAGEDELIVERY to true on the sending server for whichever initialization type the user wants to guarantee that if RabbitMQ goes down that no messages are lost.
Here's a typical use case: Let's say you start an initialization and realize the consumer's 20 thread default is too low for the power of the machine. You can pause that consumer, change the thread count and then resume the consumer. Initialization will pick up where it left off with the new thread count. If GMD is set to False, then some messages may be lost if there is a RabbitMQ Failure (not just a PMQ failure). In such a case, even a restart of initialization will likely get stuck and not finish, and the initialization will have to be restarted from the last completed resource. When turning on GMD for a system that has both sender and consumer on same system (i.e., RIL and Prism on same system) will slow down initialization for this system and any others that might be included in an initialization batch with this system.

Pause/Resume Initialization
You can pause/resume the initialization process. You cannot pause the Sender, but each downstream system that is consuming the resources being sent can be paused and/or resumed. Start initialization. The Server List is displayed. Click on a server. Click the Pause button.  The Initialization consumer will stop consuming messages. (THIS WILL NOT STOP THE SENDER). The button caption changes to Resume. Click it again and the initialization consumer will Resume.

Cancel or Delete Batch
You can cancel or delete a batch. Select the batch and then click the Cancel or Delete button as needed.
If you cancel an initialization batch, all running initializations will be stopped. If you delete an initialization batch, all messages currently in the queue will be lost.

Pending Tab
On the Day to Day tab of both the V9 Dashboard and Prism Dashboard is a pair of tabs: Completed and Pending. By default, the Completed tab is selected, showing formation for completed Day-to-Day replication records. The Pending tab, on the other hand can be useful for verifying the messages that are about to be put on the RabbitMQ bus.
The screen shows summary information (for all connections):

  • Total number of new messaged pending
  • Messages being processed on the RabbitMQ bus (i.e. "in process")
  • Messaged placed on the RabbitMQ bus (i.e. completed)

Limiting the Number of Rows Fetched by GET Requests
You can setup limiting for a resource by adding a section to either PrismBackOffice.ini or PrismCommon.ini file (depending on which Windows Service the resource is name spaced to).
[GETLIMIT]
# >0: limit number of rows fetched
#  0: no limit (default)
# -1: fetch but log a warning
# -2: do not fetch and log a warning
# -3: error out
customer=100
document=100

Example
Adding this to the backoffice.ini file will limit the TransferSlip top level resource.
[GETLIMIT]
Transferslip=100    Limits slip GET to max of 100 slips

Restart PrismMQ after Deleting Replication Records
The key service involved with replication, PrismMQ, caches records aggressively to improve performance. If the record is in the cache, it is assumed that the record exists and PrismMQ doesn't need to confirm its existence in the database. This is usually not a problem; however, it can become a problem if you have deleted replication-related records (e.g. replication_status) from tables while troubleshooting. The record you removed as part of your troubleshooting efforts may still be in the cache. Therefore, you should restart PrismMQ as soon as possible after deleting replication-related records. Restarting PrismMQ will clear the cache.

Common Replication Issues
This is a table of replication troubleshooting provided from the 2022 Retail Pro Prism 2 Workshop.

ISSUE POTENTIAL ROOT CAUSE IDENTIFY SOLUTION
Missing/stuck data Initialization process is running/stuck
  • Look at the Connection Manager to identify where data is (sent?, processing?) and verify the producer /  consumer cache tables for data
  • Check the replication status table on the store. See if the init session is in progress, paused, canceled
  • If data is still moving & processing, you will need to wait.  Init takes priority over D2D.
  • If data isn't moving/processing, determine why if possible.
  • Cancel initialization if paused or can't resume.
Missing/stuck data Error - constraint or potentially DB optimistic lock, if exceeds defined retry count
  • Enable log level 3 and resend document to capture additional details in order to determine what exactly is the constraint.
  • Possible server performance issue
  • Optimistic lock is usually the same record sent more than once (ex: same customer being resent over and over)
Correct issue on document or resend data
Missing/stuck data Backlog of data on the POA/RIL producer tables Identify root cause: Performance, init session in progress, locked queue cycling.
  • Depends on root cause.
  • If locking queue is the issue, then address this. Avoiding this is key.
  • Init session in progress or stuck. See comments on init priority and issues with stuck queues above.
Missing/stuck data Malformed custom JSON file (integrations)
  • Identify in the PrismMQ logs
Example: !Error | Data was not readable, likely a serializer error. Cannot report replication status details
Contact the developer.
Missing/stuck data DB Data file size capacity (OS file size limit) Seen in DB logs (Oracle alert_rproods.log file) and possibly in PrismMQ logs. Add additional data file (see RIL TTK)
Missing/stuck data RabbitMQ Mnesia DB files corrupt Logging into the RMQ management console gives an error, even after restarting RabbitMQ service.
  • Delete the RabbitMQ queues in the queue folder until you find the corrupt queue/queues.
  • C:\ProgramData\RetailPro\Server\RabbitMQ\db\

rabbit@"hostname"mnesia

\msg_stores\vhosts\628WB

79CIFDYO9LJI6DKMI09L\queues
Process out of memory Memory limitation: 32bit memory address limits - 1.8+ GB max Task manager (details - peak or current memory usage), noting memory usage for PMQ processes Depending on what process. Restart service in most cases.
Process out of memory
  • Customer UDF
  • Extremely large document
Memory limit: Identify message size in producer cache/consumer cache table Reduce number of consuming threads. Don't send the offending data.
RabbitMQ Lost connections (repeatedly). Possible lost messages or initialization failures.
  • Known issue (fixed in the  latest release of Retail Pro Prism 1.14.7)
  • RabbitMQ queue setup: Heartbeat check is out of sync with connection timeout.
  • Seen in RMQ logs every couple seconds. This occurs over and over on all the connected systems.
  • Client unexpectedly closed TCP connection
Upgrade to latest Retail Pro Prism 1.14.7.2153 or later.
Preferences overwritten Core resources replicated from store to POA
  • New store was published before joining the enterprise.
  • Changes made at store replicate to the POA.
  • Scheduler will trigger core resources to be sent with some tasks (Update active season).
  • Retail Pro Prism 2.1 release has the ability to turn off core resources (in PMQ config file)
  • Disable scheduler tasks "update active season" (set active = 0) on all store servers (ideally before joining the enterprise)
  • Clean out producer_cache before joining the enterprise.
Data stuck in RabbitMQ on the sending side Firewall See if you can establish telnet or verify that ports are open. Correct firewall setup
Data stuck in RabbitMQ on the sending side Store server networking issue Ping or attempt to establish any connection to the store/receiving server Correct network issue
Join Enterprise Error - Invalid controller Data Invalid controller data (restoring or reinstalling a system previously joined) Likely the controller table has a record of this system with a different SID or same SID and controller ID. Other possible issues could also exist (see KB).  KB: Resolving Invalid Controller Data Error in RP Prism's Enterprise Manager
Join Enterprise Error - Invalid controller Data Init of core resources failed/stuck
  • Join fails or gets stuck on the last step where it is initializing the core resources.
  • Checking the replication_status table and producer/consumer cache tables to determine if data is really stuck or has completed and is just missing end of init message.
Kill the TTK session and clean up the init session if it remains.  Then initialize the core resources manually.

Retail Pro Prism Replication Tables

*replace rps with rpsods on MySQL

rps.pub_dataevent_queue

Staging location for D2D data before messages are posted to the producer_cache table with the message (payload).

You can get some idea of how and what is pending by querying the table, to see the messages which are being updated and which are waiting to be published to the rps.producer_cache table.

Select * from rps.pub_dataevent_queue;

rps.producer_cache

Producer_cache: 

Contains the message which each subscriber will get.

It notes if the messages is an initialization and the status of that message (locked or not).

Producer_cache_destination:

Contains the relationship of which queues (stores) the related message will be delivered to.

* Status: 0 = normal, 1 = locked | INIT: 0 = D2D, 1 = Initialization

Select pc.resource_name, pc.status, pc.init, count(*) from RPS.producer_cache pc

group by pc.resource_name, pc.status, pc.init;

 

Select pd.to_server, pc.resource_name, pc.status, pc.init, count(*) from RPS.producer_cache pc,

RPS.producer_cache_destination pd

where pc.sid = pd.producer_cache_sid

group by pd.to_server, pc.resource_name, pc.status, pc.init;

rps.initialization_status_header

Identify and monitor init process.

The following will give you a list of init sessions and the their state.

You may need to change the state value if the init process is stuck.

The Connection Manager UI gets the init status and progress details from these tables.

Those updates for records processed come from the store (replication_status table) via mgmt queue.

* Status: 0 = canceled, 1 = in progress, 2 = complete

Select ic.sid, ic.created_datetime, c.controller_name, ic.total_processed, ic.total_failed, ic.status

from RPS.remote_connection rc, RPS.init_status_connection ic, RPS.controller c

where ic.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid

order by ic.created_datetime desc;

rps.replication_status

D2D sessions. 

Used to see and monitor those sessions. Connection Manager UI gets data from here.

This will give a general idea of the data here.

Select c.controller_name, rs.session_type, rs.state, rs.init_status, rs.messages_expected,

rs.messages_received, rs.messages_sent, rs.messages_failed

from RPS.remote_connection rc, RPS.replication_status rs, RPS.controller c

where rs.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid;

Session: (0 = V9 init, 2 = POA init, 1 = V9 D2D, 3 = POA D2D)

State: (0 = canceled, 1 = in progress, 2 = complete, 4 = paused)

Init_status: (1 = end of init message processed)

rps.consumer_cache

consumer_cache contains the message received from POA and Prism Stores.

* INIT: 0 = D2D, 1 = Initialization

Select resource_name, status, pc.init, count(*) from rps.consumer_cache

Group by resource_name, status, init;

You can filter and look for messages from a particular connection (from_server) or for a particular message type (resource_name) to determine if the messages are just pending processing.

Messages are processed based on the "process_order" value (lowest to highest) and by oldest to newest messages (created_datetime).

You can also filter for a specific message (resource_data). If you know what the document SID is or some unique value for that Document you could filter for that document by using a "like" statement.

If there is a lot of data here, this could be slow.

EXAMPLE: Select * from rps.consumer_cachewhere resource_data like ‘%123%'; -- where 123 is the sid value

rps.prism_resource

Provides you with a list of the resources and the process order.

EXAMPLE: Select resource_name, process_order from RPS.prism_resource

where process_order > 0

order by process_order

rps.replication_locked_queue

Lists those queues that are locked.

EXAMPLE: Select * from RPS.replication_locked_queue;