Big data has brought many new terms, but these terms are often more difficult to understand. Therefore, this article we give a common glossary of big data, initiate, for your insight. Some of which is defined with reference to the corresponding blog article. Of course, this is not 100% glossary contains all terms, if you think there is anything omissions, please let us know.
Polymerization (Aggregation) - search, merged and the process data
Algorithm (Algorithms) - mathematical formulas can be done in some data analysis
Used to discover the inner meaning of the data - analysis (Analytics)
Anomaly Detection (Anomaly detection) - search dataset with the expected pattern or behavior that does not match the data item. In addition to "Anomalies", the word used to indicate anomalies are the following:. Outliers, exceptions, surprises, contaminants they can often provide key actionable information
Anonymize (Anonymization) - make the data anonymous, which removes all data related to personal privacy
Application (Application) - achieve a particular function of the computer software
AI (Artificial Intelligence) - research and development of intelligent machines and intelligent software, these smart devices can sense the surrounding environment and respond accordingly, upon request, even self-learning
Behavior Analysis (Behavioural Analytics) - This analysis is based on the user's behavior, such as "how to do", "Why did you do" and "what has been done" to conclude that, rather than just an analysis of the characters and time discipline, it looks at the data in the user-friendly mode
Big Data Scientist (Big Data Scientist) - able to design algorithm makes Big Data Big Data become a useful person
Big data startups (Big data startup) - refers to the development of the latest big data technology startups
Bioassay technique (Biometrics) - based on personal characteristics identification
B Byte (BB: Brontobytes) - equal to about 1000 YB (Yottabytes), equivalent to the size of the future digital universe. 1 B contains a 27 byte 0!
BI (Business Intelligence) - is a series of theories, methodologies and processes, making the data easier to understand
Classification analysis (Classification analysis) - a systematic process-related access to critical information from the data; such data is also referred to as metadata (meta data), the data that describes data
Cloud computing (Cloud computing) - built on a network of distributed computing systems, data is stored in a room outside (ie the cloud)
Cluster analysis (Clustering analysis) - it is the coming together similar objects, each combination of class of similar objects into a cluster (also called clusters) process. This analysis aims to analyze the differences and similarities between data
Cold data storage (Cold data storage) - The old data storage is hardly used in low-power servers. But it will be very time-consuming data retrieval
Comparative analysis (Comparative analysis) - at very large data sets for pattern matching, comparisons and calculations to get a step by step analysis
Complex data structures (Complex structured data) - consisting of two or more complex and interrelated parts of the data, such data can not simply be resolved by the structured query language or tool (SQL)
Computer-generated data (Computer generated data) - such as log files generated by a computer data
Concurrent (Concurrency) - simultaneously perform multiple tasks or run multiple processes
Correlation analysis (Correlation analysis) - is a data analysis method for the existence of a positive correlation between the variables analyzed, or negative
Customer Relationship Management (CRM: Customer Relationship Management) - used to manage sales, a technology business process, big data will affect the company's customer relationship management strategy
Dashboard (Dashboard) - Analysis of the data using an algorithm, and the results are graphically displayed in the dashboard
Data aggregation tool (Data aggregation tools) - dispersed in numerous data source data is converted into a new data source process
Data Analyst (Data analyst) - in data analysis, modeling, cleaning, processing professionals
Database (Database) - a technology to store a particular set of data warehouse
Database as a Service (Database-as-a-Service) - deployed in the cloud database-you-go, cloud services such as Amazon (AWS: Amazon Web Services)
Database management system (DBMS: Database Management System) - to collect, store data, and provide access to data
Data Center (Data centre) - a solid place, a place to store the data server
Data cleaning (Data cleansing) - re-examination of the data and the verification process aimed at removing duplicate information, to correct for errors, and provide data consistency
Data Administrator (Data custodian) - responsible for maintaining the data storage technology environment required professional and technical personnel
Data of Ethics (Data ethical guidelines) - These guidelines will help organizations to make data transparent to ensure simple data security and privacy
Data subscription (Data feed) - A data stream, such as Twitter and subscribe to RSS
Data marts (Data marketplace) - traded datasets online market place
Data Mining (Data mining) - discover a specific pattern or process information from the data set
Data Modeling (Data modelling) - using data modeling techniques to analyze the data object, the data in order to discern the inner meaning
Dataset (Data set) - a collection of large amounts of data
Data virtualization (Data virtualization) - data integration process, in order to obtain more data, this process will usually introduce other technologies, such as databases, applications, file systems, web technology, big data technology, etc.
To identity (De-identification) - also known as anonymized (anonymization), to ensure that individuals will not be identified by data
Discriminant analysis (Discriminant analysis) - classify data; according to different classifications, data can be assigned to different groups, classes or directories. It is a statistical analysis, the data information can be known in some group or cluster analysis, and derive classification rules.
Distributed File System (Distributed File System) - provides a simplified, highly available way to store, analyze, data processing system
File storage database (Document Store Databases) - also known as the document database (document-oriented database), for the storage, management, document data recovery specially designed database, such document data, also known as semi-structured data
Exploratory analysis (Exploratory analysis) - discover patterns from the data in the absence of standard processes or methods. It is a method to explore the data and the main characteristics of the data set
E byte (EB: Exabytes) - equal to about 1000 PB (petabytes), equivalent to about 1 million GB. Now the new global amount of information produced every day about 1 EB
Extraction - Transformation - Loading (ETL: Extract, Transform and Load) - is a database or data warehouse for processing. That is extracted from a variety of different data sources (E) data, and convert (T) to meet the business needs of the data, and finally load (L) to the database
Failover (Failover) - When a server fails, the system can automatically run a task switch to another available server or node
Fault-tolerant design (Fault-tolerant design) - a fault-tolerant design support system should be able to do a certain part of the failure can continue to run when
The game of (Gamification) - using game thinking and mechanisms in other non-gaming, this method can be in a very friendly way to create and detect data is very effective.
Graph database (Graph Databases) - use graphical structure (for example, a limited set of ordered pairs, or some entity) to store data, such a pattern storage structure includes an edge, attributes and nodes. It provides free indexing function between adjacent nodes, that is, the database is associated directly with other adjacent element between each element.
Grid computing (Grid computing) - The number of computers in different locations linked together to deal with a specific problem, typically through the cloud to connect with your computer.
Hadoop - an open source distributed system based framework for developing distributed applications, computing and large data storage.
Hadoop Database (HBase) - an open source, non-relational, distributed database, used in conjunction with Hadoop framework
HDFS - Hadoop Distributed File System (Hadoop Distributed File System); a is designed to be suitable for running on general-purpose hardware (commodity hardware) on the Distributed File System
High-performance computing (HPC: High-Performance-Computing) - use supercomputers to solve extremely complex computational problems
Memory database (IMDB: In-memory) - database management systems, and general database management system differs in that it uses main memory to store data instead of the hard disk. Characterized in that the process can be performed at high speed and access to data.
IOT (Internet of Things) - In the conventional sensor devices are installed, so that these devices can be connected to the network at any time and any place.
Data consistency (Juridical data compliance) legal - When you use a cloud computing solution, your data is stored in different countries or different continents, will be something to do with the concept of relationship. You need to pay attention to whether the data stored in these different countries in accordance with local laws.
Key database (KeyValue Databases) - storage of data is to use a specific key, point to a specific data record, this approach makes finding data faster and more convenient. Key data stored in the database programming language is usually basic data types.
Delay (Latency) - indicates the system time delay
Legacy systems (Legacy system) - is a legacy application, or the old technology or old computing system and are no longer supported.
Load balancing (Load balancing) - will be assigned to work on multiple computers or servers to achieve optimum results and maximum system utilization.
Location information (Location data) - GPS information, the location information.
Log files (Log file) - automatically generated by a computer system files, running of the recording system.
Content exchange between two or more machines and transmissions - M2M data (Machine2Machine data)
Machine data (Machine data) - data from the sensors on the machine or algorithm generated
Machine Learning (Machine learning) - part of the artificial intelligence, referring to a machine capable of self-learning from their completed tasks, through long-term accumulation of self-improvement.
MapReduce - large-scale data processing is a software framework (Map: maps, Reduce: induction).
Massively parallel processing (MPP: Massively Parallel Processing) - the use of multiple processors (or multiple computers) working on the same computing tasks.
Metadata (Metadata) - is called describes data that describes the data attributes (what data) information.
MongoDB - an open source non-relational databases (NoSQL database)
Multidimensional database (Multi-Dimensional Databases) - used to optimize data online analytical processing (OLAP) program, Optimizing Data Warehouse database.
Multi-value database (MultiValue Databases) - is a non-relational databases (NoSQL), a special kind of multi-dimensional database: three dimensions can handle data. Mainly for very long string of perfectly handle HTML and XML in string.
NLP (Natural Language Processing) - is a branch of computer science that studies how humans interact with computer languages.
Network analysis (Network analysis) - Analysis of the relationship between nodes or network graph theory, which analyzes network connections and strength of the relationship between nodes.
NewSQL - an elegant, well-defined database system, easier to learn and use than SQL, much later than the proposed new NoSQL database
NoSQL - As the name suggests, is "do not use SQL" database. Such a database refers to other types of databases outside the traditional relational database. Such databases have greater consistency, capable of handling very large scale and high concurrent data.
Object Database (Object Databases) - (also referred to as surface object database) in the form of data storage objects for object-oriented programming. It differs from relational databases and graph databases, most of the objects are to provide a database query language that allows the use of declarative programming (declarative programming) to access the object.
Digital image analysis methods for each pixel data for analysis, and the analysis method based on image analysis of relevant objects only pixel data, these pixels are referred to the relevant object or image - an object image analysis (Object-based Image Analysis) based on object.
Operational Database (Operational Databases) - such a database can be completed routine operations of an organization, is very important for business operations, the general use of online transaction processing, allowing users to access, collect, retrieve specific information within the company.
Optimization Analysis (Optimization analysis) - optimization process relies on algorithms in the product design cycle to achieve, in the process, the company can design various products and test these products meets a preset value.
Ontology (Ontology) - represents the ontology, a philosophy for the definition of the relationship between a set of concepts in the field and between concepts. (Translator's Note: The data is raised to a height of philosophy, was given a sense of the world body, became an independent and objective data from the World)
Outlier detection (Outlier detection) - abnormal value is a serious deviation from the target data set or a combination of the total average data, the object and it is far from the other dataset, therefore, means that the system appears a problem of outliers requires additional analysis of this.
Pattern Recognition (Pattern Recognition) - to identify patterns in the data by the algorithm, the same data to predict new data source
P byte (PB: Petabytes) - equal to about 1000 TB (terabytes), equivalent to about 1 million GB (gigabytes). European Nuclear Research Centre (CERN) Large Hadron Collider number of particles generated per second to about 1 PB
Platform as a Service (PaaS: Platform-as-a-Service) - cloud computing solutions provide a basis for all necessary services platform
Prediction Analysis (Predictive analysis) - Big data analysis is the most valuable of an analytical method that can help predict the future (near future) behavior of individuals, for example, someone is likely to buy some goods may visit some sites do certain things, or some kind of behavior. By using a variety of different data sets, such as historical data, transaction data, social data, or the customer's personal information data to identify risks and opportunities
Privacy (Privacy) - having the identified data with other data separate from personal information to ensure user privacy.
Public Data (Public data) - public information or public data sets created by public funds.
Digital self (Quantified Self) - using the application to track user movements a day, in order to better understand its associated behavior
Query (Query) - find a answer infos
Re-identification (Re-identification) - multiple data sets, and together, identify the personal information of the data from anonymous
Regression analysis (Regression analysis) - to determine the dependence between two variables. This approach assumes the existence of one-way causal relationship between two variables (Translator's Note: the independent variable, dependent variable, the two are not interchangeable)
RFID - radio frequency identification; this identification technology uses a wireless non-contact radio frequency electromagnetic field sensor to transmit data
Real-time data (Real-time data) - means to be created in a few milliseconds, process, store, analyze and display data
Recommended engine (Recommendation engine) - recommended engine algorithms recommend a product to users based on purchasing behavior or other user before purchase
Path Analysis (Routing analysis) - for certain transportation methods by using a variety of different variables analyzed to find an optimal path to achieve lower fuel costs, improve efficiency
Semi-structured data (Semi-structured data) - semi-structured data with structured data is not strictly storage structure, but it can use labels or other forms of marking the way to ensure that the hierarchical data
Sentiment Analysis (Sentiment Analysis) - algorithmically analyze how people look at certain topics
Signal Analysis (Signal analysis) - refers to measure changes over time or space to analyze the physical properties of the product. Particularly the use of sensor data.
Similarity search (Similarity searches) - Discover the most similar objects in the database, where the said data objects can be any type of data
Simulation (Simulation analysis) - a virtual reality simulation refers to the operating environment in the process or system. Simulation can consider a variety of different variables in the simulation, to ensure optimal performance
Smart grid (Smart grid) - refers to the use of energy in the sensor network to monitor real-time status of their operation, help to improve efficiency
Software as a Service (SaaS: Software-as-a-Service) - based Web browser by using a software application
Spatial analysis (Spatial analysis) - Analysis of geographic information spatial analysis or topology information such spatial data, draw data is distributed in geographic space of patterns and rules
SQL - In a relational database, a programming language used to retrieve data
Structured data (Structured data) - can be organized into the ranks of the structure, identifiable data. Such data is usually a record or a file, or are properly marked over the data in a particular field, and can be precisely targeted to.
T bytes (TB: Terabytes) - equal to about 1000 GB (gigabytes). 1 TB capacity to store about 300 hours of HD video.
Timing Analysis (Time series analysis) - Analysis of repeated measures defined in the time to get good data. Analysis of the data must be well-defined, and to successive time points from the same time interval.
Topology Data Analysis (Topological Data Analysis) - Topological data analysis focuses on three points: recognition complex data models, clustering and statistical significance of the data.
Transactions (Transactional data) - dynamic data changes over time
Transparency (Transparency) - Consumers want to know what the role of their data, for what has been processed and the organization put this information is made transparent.
Unstructured data (Un-structured data) - Unstructured data is typically considered to be a large number of plain text data, which may also contain dates, numbers and examples.
Value (Value) - (Translator's Note: One of the big data 4V Features) all available data, can create tremendous value for the organization, the community, consumers. This means that major companies and the entire industry will benefit from big data.
Variability (Variability) - That is, the meaning of the data is always (fast) change. For example, a word in the same tweet can have completely different meanings.
Variety (Variety) - (Translator's Note: One of the big data 4V Features) data is always presented in various forms, such as structured data, semi-structured data, unstructured data, and even complex structured data
High-speed (Velocity) - (Translator's Note: One of the characteristics of large data 4V) in the era of big data, data creation, storage, analysis, virtualization require high speed processing.
Authenticity (Veracity) - organizations need to ensure the authenticity of the data, in order to ensure the correctness of the data analysis. Therefore, the authenticity (Veracity) refers to the accuracy of the data.
Visualization (Visualization) - Only the correct visualization of the raw data before it can be put into use. Here the "visualization" is not a normal pattern or pie, visualization refers to a complex chart, the chart contains a large amount of data, but can be easily read and understood.
A large number of (Volume) - (Translator's Note: One of the big data 4V Features) refers to the amount of data, ranging from Megabytes to Brontobytes
Weather data (Weather data) - is an important open public data sources, if synthesized together with other data sources that can provide in-depth analysis on the basis of relevant organizations
XML database (XML Databases) - XML database is a database used to store data in XML format. XML database and document-oriented database is usually associated, developers can query XML database data, and export the format specified by the serialization
Y bytes (Yottabytes) - equivalent to about 1000 ZB (Zettabytes), equivalent to about 250 trillion DVD data capacity. Today, the amount of data for the entire digital universe is 1 YB, and will double every 18 years.
Z byte (ZB: Zettabytes) - equal to about 1000 EB (Exabytes), equal to about 1 million TB. It is predicted that information into the 2016 worldwide through a network of about a day to reach 1 ZB.
Attachment: storage unit conversion table
1 Bit (bit) = Binary Digit
8 Bits = 1 Byte (byte)
1,000 Bytes = 1 Kilobyte
1,000 Kilobytes = 1 Megabyte
1,000 Megabytes = 1 Gigabyte
1,000 Gigabytes = 1 Terabyte
1,000 Terabytes = 1 Petabyte
1,000 Petabytes = 1 Exabyte
1,000 Exabytes = 1 Zettabyte
1,000 Zettabytes = 1 Yottabyte
1,000 Yottabytes = 1 Brontobyte
1,000 Brontobytes = 1 Geopbyte