2010-09-19 7 views
3

J'ai une question sur MySql. J'ai une table avec 7.479.194 dossiers. Certains enregistrements sont dupliqués. Je voudrais faire:Supprimer les doublons dans une grande table MySql

insert into new_table 
    select * 
    from old_table 
group by old_table.a, old_table.b 

donc je prendrais les entrées dupliquées ... mais le problème est que cela est une grande quantité de données . La table est MyIsam.

Ceci est par exemple je voudrais DATA- groupe par ville, short_ccode ...

id   city  post_code  short_ccode 
---------------------------------------------------- 
4732875  Celje  3502    si 
4733306  Celje  3502    si 
4734250  Celje  3502    si 

Je suppose que je dois modifier le fichier my.ini pour de la mémoire pour le groupe par la déclaration ... quels paramètres sont responsables de cela?

J'ai une machine avec 3bg de RAM et un processeur 2Ghz.

Mon fichier ini:


# aaaMySQL Server Instance Configuration File 
# ---------------------------------------------------------------------- 
# Generated by the MySQL Server Instance Configuration Wizard 
# 
# 
# Installation Instructions 
# ---------------------------------------------------------------------- 
# 
# On Linux you can copy this file to /etc/my.cnf to set global options, 
# mysql-data-dir/my.cnf to set server-specific options 
# (@[email protected] for this installation) or to 
# ~/.my.cnf to set user-specific options. 
# 
# On Windows you should keep this file in the installation directory 
# of your server (e.g. C:\Program Files\MySQL\MySQL Server 4.1). To 
# make sure the server reads the config file use the startup option 
# "--defaults-file". 
# 
# To run run the server from the command line, execute this in a 
# command line shell, e.g. 
# mysqld --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini" 
# 
# To install the server as a Windows service manually, execute this in a 
# command line shell, e.g. 
# mysqld --install MySQL41 --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini" 
# 
# And then execute this in a command line shell to start the server, e.g. 
# net start MySQL41 
# 
# 
# Guildlines for editing this file 
# ---------------------------------------------------------------------- 
# 
# In this file, you can use all long options that the program supports. 
# If you want to know the options a program supports, start the program 
# with the "--help" option. 
# 
# More detailed information about the individual options can also be 
# found in the manual. 
# 
# 
# CLIENT SECTION 
# ---------------------------------------------------------------------- 
# 
# The following options will be read by MySQL client applications. 
# Note that only client applications shipped by MySQL are guaranteed 
# to read this section. If you want your own MySQL client program to 
# honor these values, you need to specify it as an option during the 
# MySQL client library initialization. 
# 
[client] 

port=3306 


# SERVER SECTION 
# ---------------------------------------------------------------------- 
# 
# The following options will be read by the MySQL Server. Make sure that 
# you have installed the server correctly (see above) so it reads this 
# file. 
# 
[wampmysqld] 

# The TCP/IP Port the MySQL Server will listen on 
port=3306 


#Path to installation directory. All paths are usually resolved relative to this. 
basedir=d:/wamp/bin/mysql/mysql5.0.45 

#log file 
log-error=d:/wamp/logs/mysql.log 

#Path to the database root 
datadir=d:/wamp/bin/mysql/mysql5.0.45/data 

# The default character set that will be used when a new schema or table is 
# created and no character set is defined 
default-character-set=utf8 

# The default storage engine that will be used when create new tables when 
default-storage-engine=MyISAM 

# The maximum amount of concurrent sessions the MySQL server will 
# allow. One of these connections will be reserved for a user with 
# SUPER privileges to allow the administrator to login even if the 
# connection limit has been reached. 
max_connections=1000 

# Query cache is used to cache SELECT results and later return them 
# without actual executing the same query once again. Having the query 
# cache enabled may result in significant speed improvements, if your 
# have a lot of identical queries and rarely changing tables. See the 
# "Qcache_lowmem_prunes" status variable to check if the current value 
# is high enough for your load. 
# Note: In case your tables change very often or if your queries are 
# textually different every time, the query cache may result in a 
# slowdown instead of a performance improvement. 
query_cache_size=16M 

# The number of open tables for all threads. Increasing this value 
# increases the number of file descriptors that mysqld requires. 
# Therefore you have to make sure to set the amount of open files 
# allowed to at least 4096 in the variable "open-files-limit" in 
# section [mysqld_safe] 
table_cache=500 

# Maximum size for internal (in-memory) temporary tables. If a table 
# grows larger than this value, it is automatically converted to disk 
# based table This limitation is for a single table. There can be many 
# of them. 
tmp_table_size=32M 


# How many threads we should keep in a cache for reuse. When a client 
# disconnects, the client's threads are put in the cache if there aren't 
# more than thread_cache_size threads from before. This greatly reduces 
# the amount of thread creations needed if you have a lot of new 
# connections. (Normally this doesn't give a notable performance 
# improvement if you have a good thread implementation.) 
thread_cache_size=12 

#*** MyISAM Specific options 

# The maximum size of the temporary file MySQL is allowed to use while 
# recreating the index (during REPAIR, ALTER TABLE or LOAD DATA INFILE. 
# If the file-size would be bigger than this, the index will be created 
# through the key cache (which is slower). 
myisam_max_sort_file_size=100G 

# If the temporary file used for fast index creation would be bigger 
# than using the key cache by the amount specified here, then prefer the 
# key cache method. This is mainly used to force long character keys in 
# large tables to use the slower key cache method to create the index. 
myisam_max_extra_sort_file_size=100G 

# If the temporary file used for fast index creation would be bigger 
# than using the key cache by the amount specified here, then prefer the 
# key cache method. This is mainly used to force long character keys in 
# large tables to use the slower key cache method to create the index. 
myisam_sort_buffer_size=32M 

# Size of the Key Buffer, used to cache index blocks for MyISAM tables. 
# Do not set it larger than 30% of your available memory, as some memory 
# is also required by the OS to cache rows. Even if you're not using 
# MyISAM tables, you should still set it to 8-64M as it will also be 
# used for internal temporary disk tables. 
key_buffer_size=64M 

# Size of the buffer used for doing full table scans of MyISAM tables. 
# Allocated per thread, if a full scan is needed. 
read_buffer_size=2M 
read_rnd_buffer_size=8MK 

# This buffer is allocated when MySQL needs to rebuild the index in 
# REPAIR, OPTIMZE, ALTER table statements as well as in LOAD DATA INFILE 
# into an empty table. It is allocated per thread so be careful with 
# large settings. 
sort_buffer_size=256M 


#*** INNODB Specific options *** 


# Use this option if you have a MySQL server with InnoDB support enabled 
# but you do not plan to use it. This will save memory and disk space 
# and speed up some things. 
#skip-innodb 

# Additional memory pool that is used by InnoDB to store metadata 
# information. If InnoDB requires more memory for this purpose it will 
# start to allocate it from the OS. As this is fast enough on most 
# recent operating systems, you normally do not need to change this 
# value. SHOW INNODB STATUS will display the current amount used. 
innodb_additional_mem_pool_size=20M 

# If set to 1, InnoDB will flush (fsync) the transaction logs to the 
# disk at each commit, which offers full ACID behavior. If you are 
# willing to compromise this safety, and you are running small 
# transactions, you may set this to 0 or 2 to reduce disk I/O to the 
# logs. Value 0 means that the log is only written to the log file and 
# the log file flushed to disk approximately once per second. Value 2 
# means the log is written to the log file at each commit, but the log 
# file is only flushed to disk approximately once per second. 
innodb_flush_log_at_trx_commit=1 

# The size of the buffer InnoDB uses for buffering log data. As soon as 
# it is full, InnoDB will have to flush it to disk. As it is flushed 
# once per second anyway, it does not make sense to have it very large 
# (even with long transactions). 
innodb_log_buffer_size=8M 

# InnoDB, unlike MyISAM, uses a buffer pool to cache both indexes and 
# row data. The bigger you set this the less disk I/O is needed to 
# access data in tables. On a dedicated database server you may set this 
# parameter up to 80% of the machine physical memory size. Do not set it 
# too large, though, because competition of the physical memory may 
# cause paging in the operating system. Note that on 32bit systems you 
# might be limited to 2-3.5G of user level memory per process, so do not 
# set it too high. 
innodb_buffer_pool_size=512M 

# Size of each log file in a log group. You should set the combined size 
# of log files to about 25%-100% of your buffer pool size to avoid 
# unneeded buffer pool flush activity on log file overwrite. However, 
# note that a larger logfile size will increase the time needed for the 
# recovery process. 
innodb_log_file_size=10M 

# Number of threads allowed inside the InnoDB kernel. The optimal value 
# depends highly on the application, hardware as well as the OS 
# scheduler properties. A too high value may lead to thread thrashing. 
innodb_thread_concurrency=8 



[mysqld] 
port=3306 

+0

Vous pouvez simplement supprimer les doublons in-situ. – NullUserException

+1

Afin de filtrer les doublons, vous devez nous dire ce qu'est un doublon dans 'OLD_TABLE'. Des exemples de données aideraient. –

+0

@ user430997: J'ai édité le titre pour dire quelque chose à propos de la question, vous pouvez l'améliorer si vous pensez que j'ai manqué quelque chose d'important. –

Répondre

2

Ce peupleront NEW_TABLE avec des valeurs uniques, et la valeur id est le premier identifiant du groupe:

INSERT INTO NEW_TABLE 
    SELECT MIN(ot.id), 
     ot.city, 
     ot.post_code, 
     ot.short_ccode 
    FROM OLD_TABLE ot 
GROUP BY ot.city, ot.post_code, ot.short_ccode 

Si vous veulent l'identifiant le plus par groupe:

INSERT INTO NEW_TABLE 
    SELECT MAX(ot.id), 
     ot.city, 
     ot.post_code, 
     ot.short_ccode 
    FROM OLD_TABLE ot 
GROUP BY ot.city, ot.post_code, ot.short_ccode 
+0

+1 – Unreason

0

Vous n'avez pas besoin de regrouper des données. Essayez ceci:

delete from old_table 
    USING old_table, old_table as vtable 
    WHERE (old_table.id > vtable.id) 
    AND (old_table.city=vtable.city AND 
old_table.post_code=vtable.post_code 
AND old_table.short_code=vtable.short_code) 

Je ne peux pas commenter les messages becouse de mes points ... table de réparation ancienne_table; suivant: show:

EXPLAIN SELECT old_table.id FROM old_table, old_table as vtable 
     WHERE (old_table.id > vtable.id) 
     AND (old_table.city=vtable.city AND 
    old_table.post_code=vtable.post_code 
    AND old_table.short_code=vtable.short_code 

Afficher: os ~> ulimit -a; mysql> AFFICHER LES VARIABLES COMME 'open_files_limit';

suivant: Supprimez toutes les restrictions du processus mysql.

ulimit -n 1024 etc.

0

Pour éviter le problème de mémoire, d'éviter la grande sélection en ayant un petit programme externe, en utilisant la logique ci-dessous. D'abord, sauvegardez votre base de données. Puis:

do { 
# find a record 
x=sql: select * from table1 limit 1; 
if (null x) 
then 
exit # no more data in table1 
fi 
insert x into table2 

# find the value of the field that should NOT be duplicated 
a=parse(x for table1.a) 
# delete all such entries from table1 
sql: delete * from table1 where a='$a'; 

} 
1

Un peu sale peut-être, mais il a fait l'affaire pour moi quelques fois que j'ai besoin: il Remove duplicate entries in MySQL. Fondamentalement, vous créez simplement un index unique composé de toutes les colonnes que vous ne voulez pas être unique dans la table.

Comme toujours avant ce type de procédure, une sauvegarde avant de continuer est recommandée.

1

MySQL a un INSERT IGNORE.De la documentation:

[...] Toutefois, lorsque INSERT IGNORE est utilisé, l'opération d'insertion échoue en silence de la ligne contenant la valeur inégalée , mais toutes les lignes qui sont appariés sont insérés .

vous pouvez donc utiliser votre requête en haut b simple ajout d'une IGNORE

+0

Fonctionne avec MyISAM et pas avec InnoDB –

0

De mon expérience lorsque votre table se développe à plusieurs millions de disques et plus la façon la plus efficace pour traiter les doublons: 1) l'exportation données fichiers texte 2) trier dans le fichier 3) supprimer les doublons dans le fichier 4) charge vers la base de données

Avec la taille croissante des données cette approche fonctionne finalement plus rapidement que toute requête SQL que vous pouvez inventer

Questions connexes