Documentation | My Retail Pro

Updated: April 23, 2024 11:25am

Troubleshooting Replication

This topic has various information that can be helpful when troubleshooting replication issues:

Replication UI elements
Show Successful Records
Reprocess Errors
Failed or Stopped Initialization, Replication
Pause/Resume Initialization
GuaranteedInitMessageDelivery setting
Pending Tab
Limiting the Number of Rows Fetched by GET Requests
Restart PrismMQ after Deleting Replication Records

Replication UI Elements

Element	Description
	Records that have been sent in to this server. Click the link to display a list of the records sent in to this server.
	Records that have been sent out from this server. Click the link to display a list of the records sent out from this server.
	Errors. Click the link to display a list of the errors.
	When several servers are available, you can filter the list. Type the name of the machine. Click the icon to filter for only servers with errors.
	Shows the total number of links sent to this server and the number of errors.
	Displays the number of batches included in the replication results.

Initialization Hierarchy
You can navigate the various levels of the initialization using the buttons at the top of the UI: Server, Store, Resource, SID.

Element	Description
	This will take you to the top level of initialization batches, showing a list of servers.
	This button displays that server's list of resources sent in the most recent initialization batch for the server. In this case, the server is "HQ".
	When you drill down to the individual record label, you can use this button to go back to the list of records sent for the selected resource. In this case, the list of vendor records will be displayed.
	Link for an individual resource record.

Navigate between batches
When viewing the list of initialization batches for a server, use the arrow buttons to navigate back and forth among the batches. To identify which batch you are viewing, check the date and time the batch launched.

Show Successful Messages
By default, the successful records sent during replication are not preserved. This helps keep hard drives from getting filled with the files. However, during troubleshooting, you may want to change this setting so that you can drill down and see individual records.
The default value for Show Successful Messages is False, meaning that successful records are not preserved; only errors are preserved. Keeping the default value of False is important for customers with large data sets. If set to true, large quantities of unneeded data (sometimes millions of rows) could be preserved. If you are not careful, this has the potential to overwhelm the system memory. When you run initialization, and everything is successful, but the property is set to false (default), you will see the completed message in the Status screen and that is all.
Now that you understand why success records are not retained, here is how you can change the setting to preserve success records for troubleshooting.
To show success records:

Edit the PrismMQService.ini file so that the PreserveSuccessRecords setting is set to TRUE.
Select the Show Successful Messages checkbox on the Prism Dashboard > Initializations screen.
Initialize the server. Important! You must enable Show Successful Records in the .ini file BEFORE initializing. If you edited the .ini file before initialization, then when you click the Show Successful Messages checkbox, a list of successfully replicated resources will be displayed.

show successful messages

Reprocess Errors
If any errors occur during initialization, you can correct the problem and then reprocess the specific links that failed. You can click a server's link to drill down. Click the column header to sort the list by the Failed column. This brings the errors to the top of the list. Next, go into the resource with the errors. Select the individual elements you want to reproces and the click the Reprocess Selected button. If you click the View button, you can see details about the selected resource. You can see a lot of info on this screen, so scroll to the right to display more columns.

If Replication, Initialization Fails or is Stopped
If replication services are interrupted (reboot, computer freezes), Day-2-Day replication will resume where it left off. Initialization can take a long time for larger databases and unfortunately, initialization can sometimes fail to complete successfully. If an initialization fails or is stopped, do this:
Create a new Sender profile that starts from the resource after the last COMPLETED resource. For example, if the initialization was in the middle of the Inventory resource when the failure occurred, the new Sender profile should include Inventory and the rest of the resources to the bottom of the list. Run initialization again using the new Sender profile.
It may take a while to process the first resource (the resource that was being initialized when the failure occurred). This is because the program must do a slower UPDATE operation on each of the resource's records that are already in the tables. Once the program finishes the updates and reaches the unprocessed records for the resource, it switches to the much faster INSERT operation. The entire resource in which the failure occurred must be sent again.

Guaranteed Init Message Delivery
This setting, when used in combination with the RESUMEINITONSTARTUP setting, ensures that if initialization fails, no messages are lost are lost and initialization automatically resumes at the point of failure.

There are three key properties in the PrismMQService.ini file that are related to this feature. By default, both INITGUARANTEEDMESSAGEDELIVERY and RESUMEINITONSTARTUP are set to True by default. You can find these settings in the [PRISM] section of the PrismMQService.ini file. The INITGUARANTEEDMESSAGEDELIVERY is also in the [RIL] section of the PrismMQService.ini file.

[PRISM]
INITGUARANTEEDMESSAGEDELIVERY=True
RESUMEINITONSTARTUP=True
[RIL]
INITGUARANTEEDMESSAGEDELIVERY=True

By setting RESUMEINITONSTARTUP to true, if a consumer does down during an initialization, when it is restarted it will resume initialization automatically. In tandem with this property the user must also set the INIGUARANTEEDMESSAGEDELIVERY to true on the sending server for whichever initialization type the user wants to guarantee that if RabbitMQ goes down that no messages are lost.
Here's a typical use case: Let's say you start an initialization and realize the consumer's 20 thread default is too low for the power of the machine. You can pause that consumer, change the thread count and then resume the consumer. Initialization will pick up where it left off with the new thread count. If GMD is set to False, then some messages may be lost if there is a RabbitMQ Failure (not just a PMQ failure). In such a case, even a restart of initialization will likely get stuck and not finish, and the initialization will have to be restarted from the last completed resource. When turning on GMD for a system that has both sender and consumer on same system (i.e., RIL and Prism on same system) will slow down initialization for this system and any others that might be included in an initialization batch with this system.

Pause/Resume Initialization
You can pause/resume the initialization process. You cannot pause the Sender, but each downstream system that is consuming the resources being sent can be paused and/or resumed. Start initialization. The Server List is displayed. Click on a server. Click the Pause button. The Initialization consumer will stop consuming messages. (THIS WILL NOT STOP THE SENDER). The button caption changes to Resume. Click it again and the initialization consumer will Resume.

Cancel or Delete Batch
You can cancel or delete a batch. Select the batch and then click the Cancel or Delete button as needed.
If you cancel an initialization batch, all running initializations will be stopped. If you delete an initialization batch, all messages currently in the queue will be lost.

Pending Tab
On the Day to Day tab of both the V9 Dashboard and Prism Dashboard is a pair of tabs: Completed and Pending. By default, the Completed tab is selected, showing formation for completed Day-to-Day replication records. The Pending tab, on the other hand can be useful for verifying the messages that are about to be put on the RabbitMQ bus.
The screen shows summary information (for all connections):

Total number of new messaged pending
Messages being processed on the RabbitMQ bus (i.e. "in process")
Messaged placed on the RabbitMQ bus (i.e. completed)

Limiting the Number of Rows Fetched by GET Requests
You can setup limiting for a resource by adding a section to either PrismBackOffice.ini or PrismCommon.ini file (depending on which Windows Service the resource is name spaced to).
[GETLIMIT]
# >0: limit number of rows fetched
# 0: no limit (default)
# -1: fetch but log a warning
# -2: do not fetch and log a warning
# -3: error out
customer=100
document=100

Example
Adding this to the backoffice.ini file will limit the TransferSlip top level resource.
[GETLIMIT]
Transferslip=100 Limits slip GET to max of 100 slips

Restart PrismMQ after Deleting Replication Records
The key service involved with replication, PrismMQ, caches records aggressively to improve performance. If the record is in the cache, it is assumed that the record exists and PrismMQ doesn't need to confirm its existence in the database. This is usually not a problem; however, it can become a problem if you have deleted replication-related records (e.g. replication_status) from tables while troubleshooting. The record you removed as part of your troubleshooting efforts may still be in the cache. Therefore, you should restart PrismMQ as soon as possible after deleting replication-related records. Restarting PrismMQ will clear the cache.

Common Replication Issues
This is a table of replication troubleshooting provided from the 2022 Retail Pro Prism 2 Workshop.

ISSUE	POTENTIAL ROOT CAUSE	IDENTIFY	SOLUTION
Missing/stuck data	Initialization process is running/stuck	Look at the Connection Manager to identify where data is (sent?, processing?) and verify the producer / consumer cache tables for data Check the replication status table on the store. See if the init session is in progress, paused, canceled	If data is still moving & processing, you will need to wait. Init takes priority over D2D. If data isn't moving/processing, determine why if possible. Cancel initialization if paused or can't resume.
Missing/stuck data	Error - constraint or potentially DB optimistic lock, if exceeds defined retry count	Enable log level 3 and resend document to capture additional details in order to determine what exactly is the constraint. Possible server performance issue Optimistic lock is usually the same record sent more than once (ex: same customer being resent over and over)	Correct issue on document or resend data
Missing/stuck data	Backlog of data on the POA/RIL producer tables	Identify root cause: Performance, init session in progress, locked queue cycling.	Depends on root cause. If locking queue is the issue, then address this. Avoiding this is key. Init session in progress or stuck. See comments on init priority and issues with stuck queues above.
Missing/stuck data	Malformed custom JSON file (integrations)	Identify in the PrismMQ logs Example: !Error \| Data was not readable, likely a serializer error. Cannot report replication status details	Contact the developer.
Missing/stuck data	DB Data file size capacity (OS file size limit)	Seen in DB logs (Oracle alert_rproods.log file) and possibly in PrismMQ logs.	Add additional data file (see RIL TTK)
Missing/stuck data	RabbitMQ Mnesia DB files corrupt	Logging into the RMQ management console gives an error, even after restarting RabbitMQ service.	Delete the RabbitMQ queues in the queue folder until you find the corrupt queue/queues. C:\ProgramData\RetailPro\Server\RabbitMQ\db\ rabbit@"hostname"mnesia \msg_stores\vhosts\628WB 79CIFDYO9LJI6DKMI09L\queues
Process out of memory	Memory limitation: 32bit memory address limits - 1.8+ GB max	Task manager (details - peak or current memory usage), noting memory usage for PMQ processes	Depending on what process. Restart service in most cases.
Process out of memory	Customer UDF Extremely large document	Memory limit: Identify message size in producer cache/consumer cache table	Reduce number of consuming threads. Don't send the offending data.
RabbitMQ Lost connections (repeatedly). Possible lost messages or initialization failures.	Known issue (fixed in the latest release of Retail Pro Prism 1.14.7) RabbitMQ queue setup: Heartbeat check is out of sync with connection timeout.	Seen in RMQ logs every couple seconds. This occurs over and over on all the connected systems. Client unexpectedly closed TCP connection	Upgrade to latest Retail Pro Prism 1.14.7.2153 or later.
Preferences overwritten	Core resources replicated from store to POA	New store was published before joining the enterprise. Changes made at store replicate to the POA. Scheduler will trigger core resources to be sent with some tasks (Update active season).	Retail Pro Prism 2.1 release has the ability to turn off core resources (in PMQ config file) Disable scheduler tasks "update active season" (set active = 0) on all store servers (ideally before joining the enterprise) Clean out producer_cache before joining the enterprise.
Data stuck in RabbitMQ on the sending side	Firewall	See if you can establish telnet or verify that ports are open.	Correct firewall setup
Data stuck in RabbitMQ on the sending side	Store server networking issue	Ping or attempt to establish any connection to the store/receiving server	Correct network issue
Join Enterprise Error - Invalid controller Data	Invalid controller data (restoring or reinstalling a system previously joined)	Likely the controller table has a record of this system with a different SID or same SID and controller ID. Other possible issues could also exist (see KB).	KB: Resolving Invalid Controller Data Error in RP Prism's Enterprise Manager
Join Enterprise Error - Invalid controller Data	Init of core resources failed/stuck	Join fails or gets stuck on the last step where it is initializing the core resources. Checking the replication_status table and producer/consumer cache tables to determine if data is really stuck or has completed and is just missing end of init message.	Kill the TTK session and clean up the init session if it remains. Then initialize the core resources manually.

Retail Pro Prism Replication Tables

*replace rps with rpsods on MySQL

	Staging location for D2D data before messages are posted to the producer_cache table with the message (payload). You can get some idea of how and what is pending by querying the table, to see the messages which are being updated and which are waiting to be published to the rps.producer_cache table. Select * from rps.pub_dataevent_queue;
	Producer_cache: Contains the message which each subscriber will get. It notes if the messages is an initialization and the status of that message (locked or not). Producer_cache_destination: Contains the relationship of which queues (stores) the related message will be delivered to. * Status: 0 = normal, 1 = locked \| INIT: 0 = D2D, 1 = Initialization Select pc.resource_name, pc.status, pc.init, count() from RPS.producer_cache pc group by pc.resource_name, pc.status, pc.init; Select pd.to_server, pc.resource_name, pc.status, pc.init, count() from RPS.producer_cache pc, RPS.producer_cache_destination pd where pc.sid = pd.producer_cache_sid group by pd.to_server, pc.resource_name, pc.status, pc.init;
	Identify and monitor init process. The following will give you a list of init sessions and the their state. You may need to change the state value if the init process is stuck. The Connection Manager UI gets the init status and progress details from these tables. Those updates for records processed come from the store (replication_status table) via mgmt queue. * Status: 0 = canceled, 1 = in progress, 2 = complete Select ic.sid, ic.created_datetime, c.controller_name, ic.total_processed, ic.total_failed, ic.status from RPS.remote_connection rc, RPS.init_status_connection ic, RPS.controller c where ic.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid order by ic.created_datetime desc;
	D2D sessions. Used to see and monitor those sessions. Connection Manager UI gets data from here. This will give a general idea of the data here. Select c.controller_name, rs.session_type, rs.state, rs.init_status, rs.messages_expected, rs.messages_received, rs.messages_sent, rs.messages_failed from RPS.remote_connection rc, RPS.replication_status rs, RPS.controller c where rs.remote_connection_sid = rc.sid and rc.remote_controller_sid = c.sid; Session: (0 = V9 init, 2 = POA init, 1 = V9 D2D, 3 = POA D2D) State: (0 = canceled, 1 = in progress, 2 = complete, 4 = paused) Init_status: (1 = end of init message processed)
	consumer_cache contains the message received from POA and Prism Stores. * INIT: 0 = D2D, 1 = Initialization Select resource_name, status, pc.init, count() from rps.consumer_cache Group by resource_name, status, init; You can filter and look for messages from a particular connection (from_server) or for a particular message type (resource_name) to determine if the messages are just pending processing. Messages are processed based on the "process_order" value (lowest to highest) and by oldest to newest messages (created_datetime). You can also filter for a specific message (resource_data). If you know what the document SID is or some unique value for that Document you could filter for that document by using a "like" statement. If there is a lot of data here, this could be slow. EXAMPLE: Select from rps.consumer_cachewhere resource_data like ‘%123%'; -- where 123 is the sid value
	Provides you with a list of the resources and the process order. EXAMPLE: Select resource_name, process_order from RPS.prism_resource where process_order > 0 order by process_order
	Lists those queues that are locked. EXAMPLE: Select * from RPS.replication_locked_queue;

Troubleshooting Replication

Retail Pro Prism Replication Tables

Search Documentation

Enterprise Connection Manager