如何使矩阵乘法Java代码更具故障安全性?

问题描述:

我正在做一个项目,为我提供了一个Java矩阵乘法程序,该程序可以在分布式系统中运行,该系统的运行方式如下:

I am working on a project, where I was provided a Java matrix-multiplication program which can run in a distributed system , which is run like so :

usage: java Coordinator maxtrix-dim number-nodes coordinator-port-num

例如:

java blockMatrixMultiplication.Coordinator 25  25 54545

以下是输出的快照:

我想用某种故障安全功能来扩展此代码-并对如何在运行的矩阵乘法计算中创建检查点感到好奇.一般的想法是恢复到计算中的位置(但不必太细粒度-只需恢复到开始,即row 0 column 0)

I want to extend this code with some kind of failsafe ability - and am curious about how I would create checkpoints within a running matrix multiplication calculation. The general idea is to recover to where it was in a computation (but it doesn't need to be so fine grained - just recover to beginning, i.e row 0 column 0 )

我的第一个想法是使用日志文件(例如Apache log4j),我将在其中记录相关的矩阵状态.然后,如果我们在计算过程中强行关闭该应用程序,则可以恢复到合理的检查点.

My first idea is to use log files (like Apache log4j ), where I would be logging the relevant matrix status. Then, if we forcibly shut down the app in the middle of a calculation, we could recover to a reasonable checkpoint.

我应该将MySQL用于此类任务(或者也许是更轻量级的数据库)吗?还是一个基本的日志文件(并使用一些有用的Apache库)就足够了?任何提示表示赞赏,谢谢

Should I use MySQL for such a task (or maybe a more lightweight database)? Or would a basic log file ( and using some useful Apache libraries) be good enough ? any tips appreciated, thanks

源代码:

MatrixMultiple

协调器

连接

DataIO

工作人员

如果我正确理解了问题,那么您需要做的就是在发生崩溃或退出应用程序的情况下,在单个矩阵计算中恢复自己的位置.半途而废.

If I understand the problem correctly, all you need to do is recover your place in a single matrix calculation in the event of a crash or if the application is quit half way through.

最简单的方法是仅恢复您正在积极相乘的两个矩阵,但不恢复任何进度,并在下次加载应用程序时从一开始就将它们相乘.

The simplest approach would be to recover just the two matrixes you were actively multiplying, but none of your progress, and multiply them from the beginning next time you load the application.

过程:

  1. MatrixMultiple类中public static int[][] multiplyMatrix(int[][] a, int[][] b)的开头,创建一个文件,我们将其命名为recovery_data.txt,将两个数组的状态相乘(参数ab).或者,您可以为此使用一个简单的数据库.
  2. MatrixMultiple类中public static int[][] multiplyMatrix(int[][] a, int[][] b)的末尾,正好在您返回之前,清除文件的内容,或擦除数据库.
  3. 程序首次运行时,很可能在main(String[] args)开头附近,您应检查文本文件的内容是否为非null,在这种情况下,应将文件内容相乘,并显示输出,否则照常进行.
  1. At the beginning of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, create a file, let's call it recovery_data.txt with the state of the two arrays being multiplied (parameters a and b). Alternatively, you could use a simple database for this.
  2. At the end of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, right before you return, clear the contents of the file, or wipe you database.
  3. When the program is initially run, most likely near the beginning of the main(String[] args) you should check to see if the contents of the text file is non-null, in which case you should multiply the contents of the file, and display the output, otherwise proceed as usual.

实施注意事项:

  • 使用简单的文本文件或完善的关系数据库是您必须做出的决定,主要是基于只有您真正知道的现实世界数据,但在我看来,纺织品在大多数情况下都是胜出的情况,这就是我的原因.您将要顺序读取数据以重建矩阵,因此保持关系不是那么有用.数据库更难使用,而不是太难使用,但是与文本文件相比,没有问题,而且由于您不会过多地使用查询,因此无法通过通常使程序员的工作变得更轻松的方式来平衡数据库. .
  • 考虑如何存储阵列.在文本文件中,您有几种选择,我的建议是将每一行存储在一行文本中,并用空格或逗号或其他字符隔开,然后在第二个矩阵之前多留一行空白.我认为 crAlexander在这里回答,但我尚未测试他的代码.另外,您可以使用更复杂的内容,例如JSON,但我认为这太繁琐了,无法证明其合理性.如果您使用的是数据库,则关系结构也应使数据的几种逻辑排列也很明显.
  • Using a simple text file or a full fledged relational database is a decision you are going to have to make, mostly based on the real world data that only you could really know, but in my mind, a textile wins out in most situations, and here are my reasons why. You are going to want to read the data sequentially to rebuild your matrix, and so being relational is not that useful. Databases are harder to work with, not too hard, but compared to a text file there is no question, and since you would not be much use of querying, that isn't balanced out by the ways they usually might make a programmers life easier.
  • Consider how you are going to store your arrays. In a text file, you have several options, my recommendation would be to store each row in a line of text, separated by spaces or commas, or some other character, and then put an extra line of blank space before the second matrix. I think a similar approach is used in crAlexander's Answer here, but I have not tested his code. Alternatively, you could use something more complicated like JSON, but I think that would be too heavy handed to justify. If you are using a database, then the relational structure should make several logical arrangements for your data apparent as well.

您表示有兴趣通过利用程序上次运行时已经处理过某些计算的可能性来保存一些计算.首先让我们看一下处理完每一行后添加检查点的利弊,最好能看到它们.

You expressed interest in saving some calculations by taking advantage of the possibility that some of the calculations will have already been handled on last time the program ran. Lets look first look at the Pros and Cons of adding in checkpoints after every row has been processed, best I can see them.

优点:

  • 如果系统已关闭,则在下次运行程序时节省计算时间.

缺点:

  • 进行额外的写操作会分散使用更多的节点(稍后会再介绍),或者会增加计算的总体延迟,因为现在您必须为每个检查点进行数据库写操作
  • 实施起来比较复杂(但可能不会太多)
  • 如果我对最小可行解决方案"的实现提出的关于能够摆脱文本文件的评论使您确信不必在RDBMS中进行添加,那么我会收回有关不利用查询以及正在访问的所有内容的部分.因此,数据库现在可能是一个更明智的选择.

我并不是说检查点绝对不是更好的解决方案,只是我不知道它们是否值得,但这是我要考虑的:

I'm not saying that checkpoints are definitely not the better solution, just that I don't know if they are worth it, but here is what I would consider:

  • 您是否希望人们相对于他们将要运行的计算总量经常退出计算的一半?如果您认为此功能将被广泛使用,那么添加检查点的优点相对于其使整体计算速度变慢的缺点更为重要.
  • 完成人们正在提供程序的典型计算是否需要很长时间?如果是这样,我在缺点中提到的增加的延迟会更小(按百分比计算),因此可能更可容忍,但是用户对性能的满意度已经降低,因此可以抵消那里的某些影响.这也使检查点的论点更加重要,因为它有可能节省更多时间.

因此,仅当您期望发生这种情况的实例数量相对较多且完成计算所需的时间较长时,我才建议使用这种检查点.

And so I would only recommend checkpointing like this if you expect a relatively large amount of instances where this is happening, and if it takes a relatively large amount of time to complete a calculation.

如果您决定使用检查点,则将方法修改为:

If you decide to go with checkpoints, then modify the approach to:

  • 在阵列上处理完每一行后,您会将该行的内容生成到数据库中;或者,如果使用了纺织品,则在纺织品的末尾,在另一个空行之后将其与最后一个矩阵.

  • after every row has been processed on the array that you produce the content of that row to your database, or if you use the textile, at the end of the textile, after another empty line to separate it from the last matrix.

在启动时,如果您需要完成已经开始的计算,请解决并仅分配尚未考虑的行,然后从数据库中检索其他行的内容.

on startup if you need to finish a calculation that has already been begun, solve out and distribute only the rows that have yet to be considered, and retrieve the content of the other rows from your database.

实现频繁检查点的快速点:通过将此任务推送到其他线程,可以大大减少因添加频繁检查点而导致的额外延迟.这样做会使用更多的进程,并且在实际生成进程或线程时总会存在一些延迟,但是您不必在继续进行之前就等待整个写入操作完成.

A quick point on implementing frequent checkpoints: You could greatly reduce the extra latency from adding in frequent checkpoints by pushing this task out to an additional thread. Doing this would use more processes, and there is always some latency in actually spawning the process or thread, but you do not have to wait for the entire write operation to be completed before proceeding.

如果存在未经检查的边缘情况,这意味着某种无效矩阵会使程序崩溃,则此故障保护功能现在可以通过在每次启动时再次尝试将其完全砌成程序.为了解决这个问题,我看到了一些显而易见的解决方案,但是也许您会稍作思考,就可以将我的方法修改为您喜欢的方法:

If there is an unchecked edge case that would mean some sort of invalid matrix would crash the program, this failsafe now bricks the program it entirely by trying it again on every start. To combat this, I see some obvious solutions, but perhaps a bit of thought would let you modify my approaches to something you prefer:

  • 使用很多try和catch语句,如果您发现某种似乎是由格式错误的数据引起的错误,请擦除恢复文件,或对其进行修改以添加一条注释,以告知程序将其视为特殊错误.案子.对这种特殊情况的一种很好的处理方法可能是在开始时显示两个矩阵,并说明您的程序可能由于内容格式错误而无法将它们相乘.
  • 在解决当前问题时,在文件/数据库中添加有关程序退出次数的数据,如果这不是第一次恢复,请像上述选项中的特殊情况一样对待它.

我希望这为您提供了足够的信息,以使您能够以考虑到实际用途的最合理的方式实施故障保护,并请注意,也许还有其他方法也可以解决此问题,而且这些都可以考虑各自的优缺点.

I hope that this provided enough information for you to implement your failsafe in the way that makes the most sense given what you suspect the realistic use to be, and note that there are perhaps other ways you could approach this problem as well, and these could equally have their own lists of pros and cons to take into consideration.