2017-03-28 2 views
3

J'ai un gros fichier JSON, environ 5 millions d'enregistrements et une taille de fichier d'environ 32 Go, que j'ai besoin d'être chargé dans notre entrepôt de données Snowflake. J'ai besoin de décomposer ce fichier en blocs d'environ 200 000 enregistrements (environ 1,25 Go) par fichier. Je voudrais faire ceci dans Node.JS ou Python pour le déploiement à une fonction d'AWS Lambda, malheureusement je n'ai pas encore codé. J'ai C# et beaucoup d'expérience SQL, et l'apprentissage des deux nœuds et python est sur ma liste de tâches, alors pourquoi ne pas plonger, pas vrai?Diviser un grand fichier json en plusieurs fichiers plus petits

Ma première question est "Quelle langue servirait mieux cette fonction? Python, ou Node.JS?"

Je sais que je ne veux pas lire tout ce fichier JSON en mémoire (ou même le fichier de sortie plus petit). Je dois être en mesure de "diffuser" dans et dans le nouveau fichier basé sur un nombre record (200k), correctement fermer les objets json, et continuer dans un nouveau fichier pour un autre 200k, et ainsi de suite. Je sais que Node peut le faire, mais si Python peut aussi le faire, je pense qu'il serait plus facile de commencer rapidement à utiliser pour d'autres trucs ETL que je ferai bientôt. Ma deuxième question est "En vous basant sur votre recommandation ci-dessus, pouvez-vous également recommander les modules que je devrais utiliser/importer pour m'aider à démarrer? Principalement en ce qui concerne le retrait de tout le fichier json en mémoire? ou si vous vous sentez vraiment généreux, un exemple de code pour m'aider à entrer dans le vif du sujet? Merci d'avance d'avoir pris le temps de m'aider!

I ne peut pas inclure un échantillon des données JSON, car il contient des informations personnelles, mais je peux fournir le schéma JSON ...

{ 
    "$schema": "http://json-schema.org/draft-04/schema#", 
    "items": { 
    "properties": { 
     "activities": { 
     "properties": { 
      "activity_id": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "frontlineorg_id": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "import_id": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "is_source": { 
      "items": { 
       "type": "boolean" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "address": { 
     "properties": { 
      "city": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "congress_dist_name": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "congress_dist_number": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "congress_end_yr": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "congress_number": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "congress_start_yr": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "county": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "formatted": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "latitude": { 
      "items": { 
       "type": "number" 
      }, 
      "type": "array" 
      }, 
      "longitude": { 
      "items": { 
       "type": "number" 
      }, 
      "type": "array" 
      }, 
      "number": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "observes_dst": { 
      "items": { 
       "type": "boolean" 
      }, 
      "type": "array" 
      }, 
      "post_directional": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "pre_directional": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "school_district": { 
      "items": { 
       "properties": { 
       "school_dist_name": { 
        "items": { 
        "type": "string" 
        }, 
        "type": "array" 
       }, 
       "school_dist_type": { 
        "items": { 
        "type": "string" 
        }, 
        "type": "array" 
       }, 
       "school_grade_high": { 
        "items": { 
        "type": "string" 
        }, 
        "type": "array" 
       }, 
       "school_grade_low": { 
        "items": { 
        "type": "string" 
        }, 
        "type": "array" 
       }, 
       "school_lea_code": { 
        "items": { 
        "type": "integer" 
        }, 
        "type": "array" 
       } 
       }, 
       "type": "object" 
      }, 
      "type": "array" 
      }, 
      "secondary_number": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "secondary_unit": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "state": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "state_house_dist_name": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "state_house_dist_number": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "state_senate_dist_name": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "state_senate_dist_number": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "street": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suffix": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "timezone": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "utc_offset": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "zip": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "age": { 
     "type": "integer" 
     }, 
     "anniversary": { 
     "properties": { 
      "date": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "baptism": { 
     "properties": { 
      "church_id": { 
      "type": "null" 
      }, 
      "date": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "birth_dd": { 
     "type": "integer" 
     }, 
     "birth_mm": { 
     "type": "integer" 
     }, 
     "birth_yyyy": { 
     "type": "integer" 
     }, 
     "church_attendance": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "likelihood": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "cohabiting": { 
     "properties": { 
      "confidence": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "likelihood": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "dating": { 
     "properties": { 
      "bool": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "divorced": { 
     "properties": { 
      "bool": { 
      "items": { 
       "type": "null" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "likelihood_considering": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "education": { 
     "properties": { 
      "est_level": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "email": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "is_work_school": { 
      "items": { 
       "type": "boolean" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "engaged": { 
     "properties": { 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "likelihood": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "est_income": { 
     "properties": { 
      "est_level": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "ethnicity": { 
     "type": "string" 
     }, 
     "first_name": { 
     "type": "string" 
     }, 
     "formatted_birthdate": { 
     "type": "string" 
     }, 
     "gender": { 
     "type": "string" 
     }, 
     "head_of_household": { 
     "properties": { 
      "bool": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "home_church": { 
     "properties": { 
      "church_id": { 
      "type": "null" 
      }, 
      "group_participant": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "is_coaching": { 
      "type": "null" 
      }, 
      "is_giving": { 
      "type": "null" 
      }, 
      "is_serving": { 
      "type": "null" 
      }, 
      "membership_date": { 
      "type": "null" 
      }, 
      "regular_attendee": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "hub_poid": { 
     "type": "integer" 
     }, 
     "insert_datetime_utc": { 
     "type": "string" 
     }, 
     "ip_address": { 
     "properties": { 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "string": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "last_name": { 
     "type": "string" 
     }, 
     "marriage_segment": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "married": { 
     "properties": { 
      "bool": { 
      "items": { 
       "type": "boolean" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "middle_name": { 
     "type": "string" 
     }, 
     "miscellaneous": { 
     "properties": { 
      "attribute": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "value": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "name_suffix": { 
     "type": "null" 
     }, 
     "name_title": { 
     "type": "null" 
     }, 
     "newlywed": { 
     "properties": { 
      "bool": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "parent": { 
     "properties": { 
      "bool": { 
      "items": { 
       "type": "boolean" 
      }, 
      "type": "array" 
      }, 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "likelihood_expecting": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "person_id": { 
     "type": "integer" 
     }, 
     "phone": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "number": { 
      "items": { 
       "type": "integer" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "type": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "property_rights": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "psychographic_cluster": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "religion": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "religious_segment": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     }, 
     "separated": { 
     "properties": { 
      "bool": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "significant_other": { 
     "properties": { 
      "first_name": { 
      "type": "null" 
      }, 
      "insert_datetime_utc": { 
      "type": "null" 
      }, 
      "last_name": { 
      "type": "null" 
      }, 
      "middle_name": { 
      "type": "null" 
      }, 
      "name_suffix": { 
      "type": "null" 
      }, 
      "name_title": { 
      "type": "null" 
      }, 
      "suppressed_datetime_utc": { 
      "type": "null" 
      } 
     }, 
     "type": "object" 
     }, 
     "suppressed_datetime_utc": { 
     "type": "string" 
     }, 
     "target_group": { 
     "properties": { 
      "insert_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "string": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      }, 
      "suppressed_datetime_utc": { 
      "items": { 
       "type": "string" 
      }, 
      "type": "array" 
      } 
     }, 
     "type": "object" 
     } 
    }, 
    "type": "object" 
    }, 
    "type": "array" 
} 
+0

Y at-il quelque chose de spécial dans votre format JSON? Par exemple, chaque enregistrement est-il sur une nouvelle ligne, ou chaque enregistrement commence-t-il par une ligne contenant seulement {{et se termine par '}', avec une indentation à l'intérieur?Si oui, un script d'analyse de fichier trivial pourrait aider :) –

+0

Mon code pour diviser le JSON par chaque groupe valide est 'csplit -n 6 -f _ '/ \ {(?: [^ {} | (? R) ]) * \}/''Le' -f' ajoute juste un préfixe aux fichiers de sortie –

Répondre

1

Ans La question de savoir si Python ou Node sera meilleur pour la tâche serait une opinion et nous ne sommes pas autorisés à exprimer nos opinions sur Stack Overflow. Vous devez décider vous-même ce que vous avez le plus d'expérience et ce que vous voulez travailler avec - Python ou Node.

Si vous y allez avec le noeud, il y a des modules qui peuvent vous aider avec cette tâche, qui font l'analyse JSON en continu. Par exemple. ces modules:

Si vous allez avec Python, il y a en streaming parseurs JSON ici aussi: