62,096 reads

How to avoid duplicates when you insert hundreds of thousands of entries into the same MySQL table

by Youghourta BenaliNovember 23rd, 2017

Too Long; Didn't Read

let’s assume that you are working on an application that requires all entries on a table (lets call it <code class="markup--code markup--p-code">entries</code>) to be unique.<br> If we are writing our application with PHP/Laravel the migration (more precisely its <code class="markup--code markup--p-code">up()</code> method) of the table would look like this:

Company Mentioned

featured image - How to avoid duplicates when you insert hundreds of thousands of entries into the same MySQL table

let’s assume that you are working on an application that requires all entries on a table (lets call it entries) to be unique. If we are writing our application with PHP/Laravel the migration (more precisely its up() method) of the table would look like this:

public function up()    {        Schema::create('entries', function (Blueprint $table) {            $table->increments('id');            $table->string('parameters_001')->nullable();            $table->string('parameters_002')->nullable();            $table->string('parameters_003')->nullable();            $table->string('parameters_004')->nullable();            $table->timestamps();        });    }

One way to solve this issue (if not the most obvious one that comes to mind) is the following:

use a simple check, create a method called isDuplicated($entry) that searches for the entry in the table, if it doesn’t exist insert it, otherwise, throw an exception.

This method could look something like this (it should be added to the Entry model class):

public static function isDuplicated($inputs)    {        $exists = self::where('parameter_001', $inputs['parameter_001'])                       ->where('parameter_002', $inputs['parameter_002'])                       ->where('parameter_003', $inputs['parameter_003'])                       ->where('parameter_004', $inputs['parameter_004'])                       ->count();

        return $exists ? true : false;    }

PS: we could add a listener to the insertion, and do the check automatically when we use the Entry::create() method, but for the sake of simplicity we won’t do it in this tutorial.

This solution will work just fine, and it might be a good solution for the majority of cases, but lets assume that your application is big and it inserts millions of entries each day/hour. After a while you’ll definitely notice some duplicate entries.

How would this happen? You might ask, You are already checking the table before inserting. It turns out the problem arise when the application tries to insert the same entry twice on the exact same moment. Since the app checks at the same time whether the entry already exists on the table via the isDuplicated($entry) method, this one will return false for both operations, and we end up inserting the same entry twice.

The first solution that comes to mind now is the following: use some locking mechanism (at MySql level):

either by using transactions: or Pessimistic Locking in this case, we will make sure that we can’t insert the same row twice, right? This seems like a very good solution, but if you think about it, when you try to insert a new entry, there is nothing to lock in the first place, we are not updating an existing row (a situation where locking would work just fine) but we are inserting a brand new one. So the lock mechanism would be worthless in this case. Another solution that you might think of, is to add an index to the entire table, a key composed of all the fields. There is a problem that arise here (remember, we are talking about hundreds of thousands of entries) we might experience some performance issues really quickly, having an index on the whole table will slow down operations on it significantly. Even if we don’t consider performance issues, we still have another problem. Let say that we are trying to insert this entry twice:

['parameter_001'=> 'value_001','parameter_002'=> 'value_002','parameter_003'=> 'value_003','parameter_004'=> 'value_004'];

as expected, the first one will get inserted, and the second one will get rejected (as expected as well) due to violation of the index/key we are using. But if we try to insert the following entry twice:

['parameter_001'=> 'value_001','parameter_002'=> 'value_002','parameter_003'=> 'value_003'];

it will get inserted twice without any problem. But why would this happen? Remember that our fields in this table are nullable, this mean, when we try to insert a new entry, our “global” index will guard against entry that have all sub-keys present and that was already inserted. But if a sub-key is omitted, they check fails, and we end up with a duplicate entry. even though this solution is not the right one, it gets us close to what we should do, in the sense that we should let MySql handle the issue instead of doing it in the code, we can use the same concept without compromising performance and without failing if a “sub-key” was omitted.

The solution for this problem would be to create an additional row that represents the hash of the entry, and lets call it hash, and this hash would play the role of a unique key, so when we try to insert the entry, we add its hash to it, and try to insert, if the operation goes through, i.e Mysql doesn’t throw any exception, then the entry is indeed unique, otherwise, MySql will throw an exception since it can’t insert two rows with the same unique key.

First we need to add this line to the up() method of the Entry table migration:

$table→string('hash')→unique();

now, there is one more thing we should take care of in this situation, we need to make sure that the hash is unique, and there are no collusions. In other words, we can’t use one way hashing like md5 since we will eventually end up with two entries that have the same key, but something like base64 (or even just concatenate all the fields) would work

PS: don’t forget to add the fillable property to model class

protected $fillable = ['parameter_001', 'parameter_002', 'parameter_003', 'parameter_004', $hash];

another thing that we should definitely consider when we create the hash is to add a separator between the fields before we generate the hash, in other words, when we try to insert the following entry:

['parameter_001'=> 'value_001','parameter_002'=> 'value_002','parameter_003'=> 'value_003'];

we should generate the hash for something like value_001-value_002-value_003

PS: do not use a simple implode() method here, since this will ignore fields that are not present in the entry.

the reason behind this, is that if we don’t add any separator, we would have false positives, when we have some missing parameters, and we are using the same value with different fields. For instance, these following entries are not duplicates (i.e should be inserted without any problem) but MySql will reject them:

['parameter_001'=> 'value_001','parameter_002'=> 'value_002', 'parameter_003'=> 'value_003'];

and

['parameter_001'=> 'value_001','parameter_002'=> 'value_002','parameter_004'=> 'value_003',];

because if we concatenate without any separators, we will end up with the hash of value_001value_002value_003 in the two cases, whilst when we add a separator, we will be generating the hash of value_001-value_002-value_003 for the first entry, and value_001value_002--value_003 (notice the double - sign) for the second one, which will generate a different hash than the first one.

To sum up, each time we want to insert a new entry, we will generate a unique hash for this entry, and we insert the hash with the entry, and since the hash has an index on it, duplicate entry won’t get inserted and Mysql will throw an exception instead.

What about the other exception that mysql might throw and that has nothing to do with the deduplication?

If we want to catch the exception related to this dedpulication mechanism (we might want to log it somewhere, or execute some logic when this happen), then catching all Exception won’t be a good solution, since MySQL might be throwing an exception for a completely different reason other than the duplcation exception.

To solve this issue, we need to analyze the exception thrown and execute the right logic accordingly. If you take a look at Mysql documentation regarding the exception, you can find information about the duplication exception that we want to handle:

as you can see, when we try to insert a duplicate entry, MySQL throws an exception that has the sql state 23000 and error code 1062, so we can write a method to verify this:

private function isDuplicateEntryException(QueryException $e)  {

      $sqlState = $e->errorInfo[0];      $errorCode  = $e->errorInfo[1];      if ($sqlState === "23000" && $errorCode === 1062) {

        return true;      }

      return false;  }

and then we can add this check in our code as follow:

try{// insert the entry }    catch (QueryException $e) {           if ($this->isDuplicateEntryException($e)) {            throw new DuplicateEntryException('Duplicate Entry');           }

           throw $e;

    }

This Article was first published here:

L O A D I N G
. . . comments & more!