Troubleshooting Replication
This topic has various information that can be helpful when troubleshooting replication issues:
- Replication UI elements
- Show Successful Records
- Reprocess Errors
- Failed or Stopped Initialization, Replication
- Pause/Resume Initialization
- GuaranteedInitMessageDelivery setting
- Pending Tab
- Limiting the Number of Rows Fetched by GET Requests
- Restart PrismMQ after Deleting Replication Records
Replication UI Elements
Element | Description |
---|---|
Records that have been sent in to this server. Click the link to display a list of the records sent in to this server. | |
Records that have been sent out from this server. Click the link to display a list of the records sent out from this server. | |
Errors. Click the link to display a list of the errors. | |
When several servers are available, you can filter the list. Type the name of the machine. Click the icon to filter for only servers with errors. | |
Shows the total number of links sent to this server and the number of errors. | |
Displays the number of batches included in the replication results. |
Initialization Hierarchy
You can navigate the various levels of the initialization using the buttons at the top of the UI: Server, Store, Resource, SID.
Element | Description |
---|---|
This will take you to the top level of initialization batches, showing a list of servers. | |
This button displays that server's list of resources sent in the most recent initialization batch for the server. In this case, the server is "HQ". | |
When you drill down to the individual record label, you can use this button to go back to the list of records sent for the selected resource. In this case, the list of vendor records will be displayed. | |
Link for an individual resource record. |
Navigate between batches
When viewing the list of initialization batches for a server, use the arrow buttons to navigate back and forth among the batches. To identify which batch you are viewing, check the date and time the batch launched.
Show Successful Messages
By default, the successful records sent during replication are not preserved. This helps keep hard drives from getting filled with the files. However, during troubleshooting, you may want to change this setting so that you can drill down and see individual records.
The default value for Show Successful Messages is False, meaning that successful records are not preserved; only errors are preserved. Keeping the default value of False is important for customers with large data sets. If set to true, large quantities of unneeded data (sometimes millions of rows) could be preserved. If you are not careful, this has the potential to overwhelm the system memory. When you run initialization, and everything is successful, but the property is set to false (default), you will see the completed message in the Status screen and that is all.
Now that you understand why success records are not retained, here is how you can change the setting to preserve success records for troubleshooting.
To show success records:
- Edit the PrismMQService.ini file so that the PreserveSuccessRecords setting is set to TRUE.
- Select the Show Successful Messages checkbox on the Prism Dashboard > Initializations screen.
- Initialize the server. Important! You must enable Show Successful Records in the .ini file BEFORE initializing. If you edited the .ini file before initialization, then when you click the Show Successful Messages checkbox, a list of successfully replicated resources will be displayed.
Reprocess Errors
If any errors occur during initialization, you can correct the problem and then reprocess the specific links that failed. You can click a server's link to drill down. Click the column header to sort the list by the Failed column. This brings the errors to the top of the list. Next, go into the resource with the errors. Select the individual elements you want to reproces and the click the Reprocess Selected button. If you click the View button, you can see details about the selected resource. You can see a lot of info on this screen, so scroll to the right to display more columns.
If Replication, Initialization Fails or is Stopped
If replication services are interrupted (reboot, computer freezes), Day-2-Day replication will resume where it left off. Initialization can take a long time for larger databases and unfortunately, initialization can sometimes fail to complete successfully. If an initialization fails or is stopped, do this:
Create a new Sender profile that starts from the resource after the last COMPLETED resource. For example, if the initialization was in the middle of the Inventory resource when the failure occurred, the new Sender profile should include Inventory and the rest of the resources to the bottom of the list. Run initialization again using the new Sender profile.
It may take a while to process the first resource (the resource that was being initialized when the failure occurred). This is because the program must do a slower UPDATE operation on each of the resource's records that are already in the tables. Once the program finishes the updates and reaches the unprocessed records for the resource, it switches to the much faster INSERT operation. The entire resource in which the failure occurred must be sent again.
Guaranteed Init Message Delivery
This setting, when used in combination with the RESUMEINITONSTARTUP setting, ensures that if initialization fails, no messages are lost are lost and initialization automatically resumes at the point of failure.
There are three key properties in the PrismMQService.ini file that are related to this feature. By default, both INITGUARANTEEDMESSAGEDELIVERY and RESUMEINITONSTARTUP are set to True by default. You can find these settings in the [PRISM] section of the PrismMQService.ini file. The INITGUARANTEEDMESSAGEDELIVERY is also in the [RIL] section of the PrismMQService.ini file.
[PRISM]
INITGUARANTEEDMESSAGEDELIVERY=True
RESUMEINITONSTARTUP=True
[RIL]
INITGUARANTEEDMESSAGEDELIVERY=True
By setting RESUMEINITONSTARTUP to true, if a consumer does down during an initialization, when it is restarted it will resume initialization automatically. In tandem with this property the user must also set the INIGUARANTEEDMESSAGEDELIVERY to true on the sending server for whichever initialization type the user wants to guarantee that if RabbitMQ goes down that no messages are lost.
Here's a typical use case: Let's say you start an initialization and realize the consumer's 20 thread default is too low for the power of the machine. You can pause that consumer, change the thread count and then resume the consumer. Initialization will pick up where it left off with the new thread count. If GMD is set to False, then some messages may be lost if there is a RabbitMQ Failure (not just a PMQ failure). In such a case, even a restart of initialization will likely get stuck and not finish, and the initialization will have to be restarted from the last completed resource. When turning on GMD for a system that has both sender and consumer on same system (i.e., RIL and Prism on same system) will slow down initialization for this system and any others that might be included in an initialization batch with this system.
Pause/Resume Initialization
You can pause/resume the initialization process. You cannot pause the Sender, but each downstream system that is consuming the resources being sent can be paused and/or resumed. Start initialization. The Server List is displayed. Click on a server. Click the Pause button. The Initialization consumer will stop consuming messages. (THIS WILL NOT STOP THE SENDER). The button caption changes to Resume. Click it again and the initialization consumer will Resume.
Cancel or Delete Batch
You can cancel or delete a batch. Select the batch and then click the Cancel or Delete button as needed.
If you cancel an initialization batch, all running initializations will be stopped. If you delete an initialization batch, all messages currently in the queue will be lost.
Pending Tab
On the Day to Day tab of both the V9 Dashboard and Prism Dashboard is a pair of tabs: Completed and Pending. By default, the Completed tab is selected, showing formation for completed Day-to-Day replication records. The Pending tab, on the other hand can be useful for verifying the messages that are about to be put on the RabbitMQ bus.
The screen shows summary information (for all connections):
- Total number of new messaged pending
- Messages being processed on the RabbitMQ bus (i.e. "in process")
- Messaged placed on the RabbitMQ bus (i.e. completed)
Limiting the Number of Rows Fetched by GET Requests
You can setup limiting for a resource by adding a section to either PrismBackOffice.ini or PrismCommon.ini file (depending on which Windows Service the resource is name spaced to).
[GETLIMIT]
# >0: limit number of rows fetched
# 0: no limit (default)
# -1: fetch but log a warning
# -2: do not fetch and log a warning
# -3: error out
customer=100
document=100
Example
Adding this to the backoffice.ini file will limit the TransferSlip top level resource.
[GETLIMIT]
Transferslip=100 Limits slip GET to max of 100 slips
Restart PrismMQ after Deleting Replication Records
The key service involved with replication, PrismMQ, caches records aggressively to improve performance. If the record is in the cache, it is assumed that the record exists and PrismMQ doesn't need to confirm its existence in the database. This is usually not a problem; however, it can become a problem if you have deleted replication-related records (e.g. replication_status) from tables while troubleshooting. The record you removed as part of your troubleshooting efforts may still be in the cache. Therefore, you should restart PrismMQ as soon as possible after deleting replication-related records. Restarting PrismMQ will clear the cache.
Common Replication Issues
This is a table of replication troubleshooting provided from the 2022 Retail Pro Prism 2 Workshop.
ISSUE | POTENTIAL ROOT CAUSE | IDENTIFY | SOLUTION |
---|---|---|---|
Missing/stuck data | Initialization process is running/stuck |
|
|
Missing/stuck data | Error - constraint or potentially DB optimistic lock, if exceeds defined retry count |
|
Correct issue on document or resend data |
Missing/stuck data | Backlog of data on the POA/RIL producer tables | Identify root cause: Performance, init session in progress, locked queue cycling. |
|
Missing/stuck data | Malformed custom JSON file (integrations) |
|
Contact the developer. |
Missing/stuck data | DB Data file size capacity (OS file size limit) | Seen in DB logs (Oracle alert_rproods.log file) and possibly in PrismMQ logs. | Add additional data file (see RIL TTK) |
Missing/stuck data | RabbitMQ Mnesia DB files corrupt | Logging into the RMQ management console gives an error, even after restarting RabbitMQ service. |
rabbit@"hostname"mnesia \msg_stores\vhosts\628WB 79CIFDYO9LJI6DKMI09L\queues |
Process out of memory | Memory limitation: 32bit memory address limits - 1.8+ GB max | Task manager (details - peak or current memory usage), noting memory usage for PMQ processes | Depending on what process. Restart service in most cases. |
Process out of memory |
|
Memory limit: Identify message size in producer cache/consumer cache table | Reduce number of consuming threads. Don't send the offending data. |
RabbitMQ Lost connections (repeatedly). Possible lost messages or initialization failures. |
|
|
Upgrade to latest Retail Pro Prism 1.14.7.2153 or later. |
Preferences overwritten | Core resources replicated from store to POA |
|
|
Data stuck in RabbitMQ on the sending side | Firewall | See if you can establish telnet or verify that ports are open. | Correct firewall setup |
Data stuck in RabbitMQ on the sending side | Store server networking issue | Ping or attempt to establish any connection to the store/receiving server | Correct network issue |
Join Enterprise Error - Invalid controller Data | Invalid controller data (restoring or reinstalling a system previously joined) | Likely the controller table has a record of this system with a different SID or same SID and controller ID. Other possible issues could also exist (see KB). | KB: Resolving Invalid Controller Data Error in RP Prism's Enterprise Manager |
Join Enterprise Error - Invalid controller Data | Init of core resources failed/stuck |
|
Kill the TTK session and clean up the init session if it remains. Then initialize the core resources manually. |
Retail Pro Prism Replication Tables
*replace rps with rpsods on MySQL
Staging location for D2D data before messages are posted to the producer_cache table with the message (payload). You can get some idea of how and what is pending by querying the table, to see the messages which are being updated and which are waiting to be published to the rps.producer_cache table. Select * from rps.pub_dataevent_queue; |
|
Producer_cache: Contains the message which each subscriber will get. It notes if the messages is an initialization and the status of that message (locked or not). Producer_cache_destination: Contains the relationship of which queues (stores) the related message will be delivered to. * Status: 0 = normal, 1 = locked | INIT: 0 = D2D, 1 = Initialization Select pc.resource_name, pc.status, pc.init, count(*) from RPS.producer_cache pc group by pc.resource_name, pc.status, pc.init;
Select pd.to_server, pc.resource_name, pc.status, pc.init, count(*) from RPS.producer_cache pc, RPS.producer_cache_destination pd where pc.sid = pd.producer_cache_sid group by pd.to_server, pc.resource_name, pc.status, pc.init; |
|
Identify and monitor init process. The following will give you a list of init sessions and the their state. You may need to change the state value if the init process is stuck. The Connection Manager UI gets the init status and progress details from these tables. Those updates for records processed come from the store (replication_status table) via mgmt queue. * Status: 0 = canceled, 1 = in progress, 2 = complete Select ic.sid, ic.created_datetime, c.controller_name, ic.total_processed, ic.total_failed, ic.status from RPS.remote_connection rc, RPS.init_status_connection ic, RPS.controller c where ic.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid order by ic.created_datetime desc; |
|
D2D sessions. Used to see and monitor those sessions. Connection Manager UI gets data from here. This will give a general idea of the data here. Select c.controller_name, rs.session_type, rs.state, rs.init_status, rs.messages_expected, rs.messages_received, rs.messages_sent, rs.messages_failed from RPS.remote_connection rc, RPS.replication_status rs, RPS.controller c where rs.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid; Session: (0 = V9 init, 2 = POA init, 1 = V9 D2D, 3 = POA D2D) State: (0 = canceled, 1 = in progress, 2 = complete, 4 = paused) Init_status: (1 = end of init message processed) |
|
consumer_cache contains the message received from POA and Prism Stores. * INIT: 0 = D2D, 1 = Initialization Select resource_name, status, pc.init, count(*) from rps.consumer_cache Group by resource_name, status, init; You can filter and look for messages from a particular connection (from_server) or for a particular message type (resource_name) to determine if the messages are just pending processing. Messages are processed based on the "process_order" value (lowest to highest) and by oldest to newest messages (created_datetime). You can also filter for a specific message (resource_data). If you know what the document SID is or some unique value for that Document you could filter for that document by using a "like" statement. If there is a lot of data here, this could be slow. EXAMPLE: Select * from rps.consumer_cachewhere resource_data like ‘%123%'; -- where 123 is the sid value |
|
Provides you with a list of the resources and the process order. EXAMPLE: Select resource_name, process_order from RPS.prism_resource where process_order > 0 order by process_order |
|
Lists those queues that are locked. EXAMPLE: Select * from RPS.replication_locked_queue; |