Glusterfs hacker guide（一）

xsw222 · 发表于 2019-2-1 10:54:03

1  Translator 101 Lesson
1.1 Translator 101 Lesson 1: Setting the Stage
　　This is the first postin a series that will explain some of the details of writing a GlusterFStranslator, using some actual code to illustrate.
　　Before we begin, aword about environments. GlusterFS is over 300K lines of code spread across afew hundred files. That’s no Linux kernel or anything, but you’re still goingto be navigating through a lot of code in every code-editing session, so somekind of cross-referencing is essential. I use cscope with the vimbindings, and if I couldn’t do “crtl-\ g” and such to jump between definitionsall the time my productivity would be cut in half. You may prefer differenttools, but as I go through these examples you’ll need something functionallysimilar to follow on. OK, on with the show.
　　The first thing youneed to know is that translators are not just bags of functions and variables.They need to have a very definite internal structure so that thetranslator-loading code can figure out where all the pieces are. The way itdoes this is to use dlsym to look for specific names within your shared-objectfile, as follow (from xlator.c):
　　if(!(xl->fops = dlsym (handle,"fops")))
　　{
　　gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fops) on %s", dlerror ());goto out;
　　}
　　if(!(xl->cbks = dlsym (handle,"cbks")))
　　{
　　gf_log  ("xlator", GF_LOG_WARNING,"dlsym(cbks) on %s", dlerror ());goto out;
　　}
　　if(!(xl->init = dlsym (handle,"init")))
　　{
　　gf_log  ("xlator", GF_LOG_WARNING,"dlsym(init) on %s", dlerror ());goto out;
　　}
　　if(!(xl->fini = dlsym (handle,"fini")))
　　{
　　gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fini) on %s", dlerror ());goto out;
　　}
　　In this example, xl isa pointer to the in-memory object for the translator we’re loading. As you cansee, it’s looking up various symbols by name in the sharedobject it just loaded, and storing pointers to those symbols. Some of them(e.g. init are functions, while others e.g. fops aredispatch tables containing pointers to many functions. Together, these make upthe translator’s public interface.
　　Most of this glue orboilerplate can easily be found at the bottom of one of the source files thatmake up each translator. We’re going to use the rot-13 translator just for fun,so in this case you’d look in rot-13.c to see this:
　　struct xlator_fops fops ={
　　.readv= rot13_readv,
　　.writev= rot13_writev
　　};
　　struct xlator_cbks cbks ={};
　　struct volume_options options[]={
　　{ .key={"encrypt-write"}, .type= GF_OPTION_TYPE_BOOL },
　　{ .key={"decrypt-read"}, .type= GF_OPTION_TYPE_BOOL },
　　{ .key={NULL}},
　　};
　　The fops table,defined in xlator.h, is one of the most important pieces. This table contains apointer to each of the filesystem functions that your translator mightimplement – open, read, stat, chmod, and so on. There are 82 such functions inall, but don’t worry; any that you don’t specify here will be see as null andfilled with defaults from defaults.c when your translator is loaded. In thisparticular example, since rot-13 is an exceptionally simple translator, we onlyfill in two entries for readv and writev.

　　There are actually twoother tables, also required to have predefined names, that are also used tofind translator functions: cbks (which is empty in thissnippet) and dumpops (which is missing entirely). The first ofthese specify entry points for when inodes are forgotten or file descriptorsare>
　　The last piece I’llcover today is options. As you can see, this is a table oftranslator-specific option names and some information about their types.GlusterFS actually provides a pretty rich set of types (volume_option_type_t inoptions.h) which includes paths, translator names, percentages, and times inaddition to the obvious integers and strings. Also, the volume_option_t structurecan include information about>　　{ .key={"data-self-heal-algorithm"},
　　.type= GF_OPTION_TYPE_STR,
　　.default_value="",
　　.description="Select between \"full\", \"diff\". The ""\"full\" algorithm copies the entire file from ""source to sink. The \"diff\" algorithm copies to ""sink only those blocks whose checksums  don't match ""with those of source.", .value={"diff","full",""}},
　　{ .key={"data-self-heal-window-size"},
　　.type= GF_OPTION_TYPE_INT, .min=1, .max=1024,
　　.default_value="1", .description="Maximum number blocks per file for  which self-heal ""process would be applied  simultaneously."},
　　When your translatoris loaded, all of this information is used to parse the options actuallyprovided in the volfile, and then the result is turned into a dictionary andstored as xl->options. This dictionary is then processed byyour init function, which you can see being looked up in thefirst code fragment above. We’re only going to look at a small part of therot-13′s init for now.
　　priv->decrypt_read =1; priv->encrypt_write =1;
　　data = dict_get (this->options,"encrypt-write");
　　if(data){
　　if(gf_string2boolean  (data->data,&priv->encrypt_write)==-1)
　　{
　　gf_log (this->name, GF_LOG_ERROR,"encrypt-write  takes only boolean options");
　　return-1;
　　}}

　　What we can see hereis that we’re setting some defaults in our priv structure,then looking to see if an “encrypt-write” option was actually provided. If so,we convert and store it. This is a pretty>　　So far we’ve coveredthe basic of how a translator gets loaded, how we find its various parts, andhow we process its options. In my next Translator 101 post, we’ll go a littledeeper into other things that init and its companion fini mightdo, and how some other fields in our xlator_t structure(commonly referred to asthis) are commonly used.
1.2 Translator 101 Lesson 2: init, fini, and private context
　　In the previousTranslator 101 post, we looked at some of the dispatch tables and optionsprocessing in a translator. This time we’re going to cover the rest of the“shell” of a translator – i.e. the other global parts not specific to handlinga particular request.

　　Let’s start by looking at the>　　132
　　133
　　134
　　135
　　136
　　137
　　138
　　139
　　int32_t init (xlator_t *this) { data_t *data = NULL; rot_13_private_t *priv  = NULL; if (!this->children  || this->children->next) { gf_log ("rot13",  GF_LOG_ERROR, "FATAL:  rot13 should have exactly one child"); return -1; } if (!this->parents) { gf_log (this->name, GF_LOG_WARNING,  "dangling volume. check volfile "); }
　　priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); if (!priv) return -1;

　　At the very top, we see the function signature – we get a pointer tothe xlator_t object that we’re initializing, and we returnan int32_t status. As with most functions in the translatorAPI, this should be zero to indicate success. In this case it’s safe to return-1 for failure, but watch out: in dispatch-table functions, the return valuemeans the status of the function call rather than the request.A request error should be reflected as a callback with a non-zero op_retvalue,but the dispatch function itself should still return zero. In fact, thehandling of a non-zero return from a dispatch function is not all that robust(we recently had a bug report in HekaFS>　　The first thing this init function does is check that thetranslator is being set up in the right kind of environment. Translators arecalled by parents and in turn call children. Some translators are “initial”translators that inject requests into the system from elsewhere – e.g.mount/fuse injecting requests from the kernel, protocol/server injectingrequests from the network. Those translators don’t need parents, but rot-13does and so we check for that. Similarly, some translators are “final”translators that (from the perspective of the current process) terminaterequests instead of passing them on – e.g. protocol/client passing them toanother node, storage/posix passing them to a local filesystem. Othertranslators “multiplex” between multiple children – passing each parent requeston to one (cluster/dht), some (cluster/stripe), or all (cluster/afr) of thosechildren. Rot-13 fits into none of those categories either, so it checks thatit has exactly one child. It might be more convenient orrobust if translator shared libraries had standard variables describing theserequirements, to be checked in a consistent way by the translator-loadinginfrastructure itself instead of by each separate init function,but this is the way translators work today.

　　The last thing we see in this fragment is allocating our private dataarea. This can literally be anything we want; the infrastructure just providesthe priv pointer as a convenience but takes no responsibilityfor how it’s used. In this case we’re using GF_CALLOC toallocate our own rot_13_private_t structure. This gets us allthe benefits of GlusterFS’s memory-leak detection infrastructure, but the waywe’re calling it is not quite>　　To finish our tour of standard initialization/termination, let’s look atthe end of init and the beginning of fini
　　174
　　175
　　176
　　177
　　this->private  = priv;  gf_log ("rot13", GF_LOG_DEBUG,  "rot13 xlator loaded"); return 0; } void fini (xlator_t  *this) { rot_13_private_t *priv  = this->private; if (!priv) return; this->private  = NULL;  GF_FREE (priv);
　　At the end of init we’re just storing our private-datapointer in the priv field of our xlator_t, thenreturning zero to indicate that initialization succeeded. As is usually thecase, our fini is even simpler. All it really has to dois GF_FREE our private-data pointer, which we do in a slightlyroundabout way here. Notice how we don’t even have a return value here, sincethere’s nothing obvious and useful that the infrastructure could do if fini failed.
　　That’s practically everything we need to know to get our translatorthrough loading, initialization, options processing, and termination. If we haddefined no dispatch functions, we could actually configure a daemon to use ourtranslator and it would work as a basic pass-through from its parent to asingle child. In the next post I’ll cover how to build the translator andconfigure a daemon to use it, so that we can actually step through it in adebugger and see how it all fits together before we actually start addingfunctionality.
1.3 Translator 101 Lesson 3: This Time For Real
　　In the first two parts of this series, we learned how to write a basictranslator skeleton that can get through loading, initialization, and optionprocessing. This time we’ll cover how to build that translator, configure avolume to use it, and run the glusterfs daemon in debug mode.
　　Unfortunately, there’s not much direct support for writing newtranslators. You can check out a GlusterFS tree and splice in your owntranslator directory, but that’s a bit painful because you’ll have to updatemultiple makefiles plus a bunch of autoconf garbage. As part of the HekaFSproject, I basically reverse engineered the truly necessary parts of thetranslator-building process and then pestered one of the Fedora glusterfspackage maintainers (thanks daMaestro!) to add a glusterfs-devel package withthe required headers. Since then the complexity level in the HekaFS tree hascrept back up a bit, but I still remember the simple method and still considerit the easiest way to get started on a new translator. For the sake of thosenot using Fedora, I’m going to describe a method that doesn’t depend on thatheader package. What it does depend on is a GlusterFS source tree, much as youmight have cloned fromGitHub or the Gluster review site. This treedoesn’t have to be fully built, but you do need to run autogen.sh and configure init. Then you can take the following simple makefile and put it in a directorywith your actual source.
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　8
　　9
　　10
　　11
　　12
　　13
　　14
　　15
　　# Change these to match your source code. TARGET = rot-13.so
　　OBJECTS = rot-13.o
　　# Change these to match your  environment. GLFS_SRC = /play/glusterfs
　　GLFS_LIB = /opt/glusterfs/3git/lib64
　　HOST_OS = GF_LINUX_HOST_OS
　　# You shouldn't need to change  anything below here.
　　CFLAGS = -fPIC  -Wall -O0  -g \ -DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \ -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \ -I$(GLFS_SRC)/contrib/uuid
　　LDFLAGS = -shared  -nostartfiles -L$(GLFS_LIB) -lglusterfs  -lpthread
　　$(TARGET): $(OBJECTS) $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
　　Yes, it’s still Linux-specific. Mea culpa. As you can see, we’re stickingwith the rot-13 example, so you can just copy the files from…/xlators/encryption/rot-13/src in your GlusterFS tree to follow on. Type“make” and you should be rewarded with a nice little .so file.
　　1
　　2
　　[jeff@gfs-i8c-01 xlator_example]$ ls -l rot-13.so
　　-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so
　　Notice that we’ve built with optimization level zero and debugging symbolsincluded, which would not typically be the case for a packaged version ofGlusterFS. Let’s put our version of rot-13.so into a slightly different file onour system, so that it doesn’t stomp on the installed version (not that you’dever want to use that anyway).
　　1
　　2
　　3
　　[root@gfs-i8c-01 xlator_example]# ls /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/
　　crypt.so  crypt.so.0  crypt.so.0.0.0  rot-13.so rot-13.so.0  rot-13.so.0.0.0
　　[root@gfs-i8c-01 xlator_example]# cp rot-13.so  /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so
　　These paths represent the current Gluster filesystem layout, which islikely to be deprecated in favor of the Fedora layout; your paths may vary. Atthis point we’re ready to configure a volume using our new translator. To dothat, I’m going to suggest something that’s strongly discouraged except duringdevelopment (the Gluster guys are going to hate me for this): write our ownvolfile. Here’s just about the simplest volfile you’ll ever see.
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　8
　　9
　　volume my-posix
　　type storage/posix
　　option directory /play/export
　　end-volume
　　volume my-rot13
　　type encryption/my-rot-13
　　subvolumes my-posix
　　end-volume
　　All we have here is a basic brick using /play/export for its data, andthen an instance of our translator layered on top – no client or server isnecessary for what we’re doing, and the system will automatically push amount/fuse translator on top if there’s no server translator. To try this out,all we need is the following command (assuming the directories involved alreadyexist).
　　1
　　[jeff@gfs-i8c-01 xlator_example]$ glusterfs --debug -f my.vol  /play/import
　　You should be rewarded with a whole lot of log output, including the textof the volfile (this is very useful for debugging problems in the field). Ifyou go to another window on the same machine, you can see that you have a newfilesystem mounted.
　　1
　　2
　　3
　　4
　　[jeff@gfs-i8c-01 ~]$ df /play/import
　　Filesystem          1K-blocks    Used Available Use% Mounted on
　　/play/xlator_example/my.vol
　　114506240 2706176  105983488 3% /play/import
　　Just for fun, write something into a file in /play/import, then look atthe corresponding file in /play/export to see it all rot-13′ed for you.
　　1
　　2
　　3
　　4
　　[jeff@gfs-i8c-01 ~]$ echo hello > /play/import/a_file
　　[jeff@gfs-i8c-01 ~]$ cat /play/export/a_file
　　uryyb

　　There you have it – functionality you control, implemented easily, layeredon top of local storage. Now you could start adding functionality – realencryption, perhaps – and inevitably having to debug it. You could do that theold-school way, with gf_log (preferred) or even plain old printf, or you couldrun daemons under gdb instead.>1.4 Translator 101 Lesson 4: Debugging a Translator
　　Now that we’ve learned what a translator looks like and how to build one,it’s time to run one and actually watch it work. The best way to do this isgood old-fashioned gdb, as follows (using some of the examples from last time).
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　[root@gfs-i8c-01 xlator_example]# gdb glusterfs
　　GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
　　...
　　(gdb) r --debug -f my.vol /play/import
　　Starting program: /usr/sbin/glusterfs --debug -f my.vol /play/import
　　...
　　[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init]  0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel  7.13
　　If you get to this point, your glusterfs client process is alreadyrunning. You can go to another window to see the mountpoint, do fileoperations, etc.
　　[root@gfs-i8c-01 ~]# df /play/import
　　Filesystem          1K-blocks    Used Available Use% Mounted on
　　/root/xlator_example/my.vol
　　114506240 2643968 106045568 3% /play/import
　　[root@gfs-i8c-01 ~]# ls /play/import
　　a_file
　　[root@gfs-i8c-01 ~]# cat /play/import/a_file
　　hello
　　Now let’s interrupt the process and see where we are.
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　8
　　9
　　10
　　11
　　^C
　　Program received signal SIGINT, Interrupt.
　　0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0
　　(gdb) info threads
　　5 Thread 0x7fffeffff700 (LWP  27206)  0x0000003a002dd8c7 in readv ()
　　from /lib64/libc.so.6
　　4 Thread 0x7ffff50e3700 (LWP  27205)  0x0000003a0060b75b in  pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
　　3 Thread 0x7ffff5f02700 (LWP 27204)  0x0000003a0060b3dc in  pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
　　2 Thread 0x7ffff6903700 (LWP  27203)  0x0000003a0060f245 in sigwait  ()
　　from /lib64/libpthread.so.0
　　* 1 Thread 0x7ffff7957700 (LWP 27196) 0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0
　　Like any non-toy server, this one has multiple threads. What are they alldoing? Honestly, even I don’t know. Thread 1 turns out to be inevent_dispatch_epoll,which means it’s the one handling all of our network I/O. Note that with socket multi-threading patch thiswill change, with one thread insocket_poller per connection. Thread2 is in glusterfs_sigwaiter which means signals will be isolatedto that thread. Thread 3 is in syncenv_task, so it’s a workerprocess for synchronous requests such as those used by the rebalance and repaircode. Thread 4 is in janitor_get_next_fd, so it’s waiting for achance to close no-longer-needed file descriptors on the local filesystem. (Iadmit I had to look that one up, BTW.) Lastly, thread 5 is in fuse_thread_proc,so it’s the one fetching requests from our FUSE interface. You’ll often seemany more threads than this, but it’s a pretty good basic set. Now, let’s set abreakpoint so we can actually watch a request.
　　1
　　2
　　3
　　4
　　(gdb) b rot13_writev
　　Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.
　　(gdb) c
　　Continuing.
　　At this point we go into our other window and do something that willinvolve a write.
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　[root@gfs-i8c-01 ~]# echo goodbye > /play/import/another_file
　　(back to the first window)
　　[Switching to Thread 0x7fffeffff700 (LWP 27206)]
　　Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440,  fd=0x7ffff409802c,
　　vector=0x7fffe8000cd8,  count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:119
　　119          rot_13_private_t  *priv = (rot_13_private_t *)this->private;
　　Remember how we built with debugging symbols enabled and no optimization?That will be pretty important for the next few steps. As you can see, we’re inrot13_writev,with several parameters.
　　frame is our always-present frame pointer for this request. Also,frame->local will point to any local data we created and attached to therequest ourselves.
　　this is a pointer to our instance of the rot-13 translator. You canexamine it if you like to see the name, type, options, parent/children, inodetable, and other stuff associated with it.
　　fd is a pointer to a file-descriptor object (fd_t, not just a file-descriptorindex which is what most people use “fd” for). This in turn points to an inodeobject (inode_t) and we can associate our own rot-13-specific data with eitherof these.
　　vector and count together describe the data buffers for this write, whichwe’ll get to in a moment.
　　offset is the offset into the file at which we’re writing.

　　iobref is a buffer-reference object, which is used to track the life cycleof buffers containing read/write data. If you look closely, you’ll noticethatvector[0].iov_base points to the same address as iobref->iobrefs[0]NaNr,which should give you some>　　OK, now what about that vector? We can use it to examine the data beingwritten, like this.
　　(gdb) p vector[0]
　　$2 = {iov_base = 0x7ffff7936000, iov_len = 8}
　　(gdb) x/s 0x7ffff7936000
　　0x7ffff7936000: "goodbye\n"
　　It’s not always safe to view this data as a string, because it might justas well be binary data, but since we’re generating the write this time it’ssafe and convenient. With that knowledge, let’s step through things a bit.
　　1
　　2
　　3
　　4
　　5
　　6
　　7
　　8
　　9
　　10
　　11
　　12
　　13
　　14
　　15
　　16
　　(gdb) s
　　120          if  (priv->encrypt_write)
　　(gdb)
　　121                   rot13_iovec  (vector, count);
　　(gdb)
　　rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
　　57             for (i = 0; i <  count; i++) {
　　(gdb)
　　58                   rot13  (vector.iov_base, vector.iov_len);
　　(gdb)
　　rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45
　　45             for (i = 0; i <  len; i++) {
　　(gdb)
　　46                   if (buf  >= 'a' && buf
　　They weigh in at 224 and 229 lines respectively, with some of that takenup by licenses and white space. Each took less than a day to write. Please bearin mind, though, that these are only prototypes. They exist to teach and tomake a point, not – in their current form – to be used in production. Makingthem suitable for real-world use would at least double their>　　negative-lookup /negative.h
　　#ifndef __NEGATIVE_H__
　　#define __NEGATIVE_H__
　　#ifndef _CONFIG_H
　　#define _CONFIG_H
　　#include "config.h"
　　#endif
　　#include "mem-types.h"
　　#include "hashfn.h"
　　#define GHOST_BUCKETS 64
　　#define GHOST_HASH(x) (SuperFastHash(x,strlen(x)) %  GHOST_BUCKETS)
　　typedef struct _ghost {
　　struct  _ghost *next;
　　char *path;
　　} ghost_t;
　　typedef struct {
　　ghost_t  *ghosts[GHOST_BUCKETS];
　　} negative_private_t;
　　enum gf_negative_mem_types_ {
　　gf_negative_mt_priv  = gf_common_mt_end + 1,
　　gf_negative_mt_ghost,
　　gf_negative_mt_end
　　};
　　#endif /* __NEGATIVE_H__ */
　　negative-lookup /  negative.c
　　#include
　　#include
　　#ifndef _CONFIG_H
　　#define _CONFIG_H
　　#include "config.h"
　　#endif
　　#include "glusterfs.h"
　　#include "xlator.h"
　　#include "logging.h"
　　#include "negative.h"
　　void
　　exorcise (xlator_t *this,  char *spirit)
　　{
　　negative_private_t  *priv = this->private;
　　ghost_t  *gp = NULL;
　　ghost_t  **gpp = NULL;
　　uint32_t bucket =  0;
　　bucket =  GHOST_HASH(spirit);
　　for  (gpp = &priv->ghosts[bucket]; *gpp; gpp =  &(*gpp)->next) {
　　gp  = *gpp;
　　if  (!strcmp(gp->path,spirit)) {
　　*gpp  = gp->next;
　　GF_FREE(gp->path);
　　GF_FREE(gp);
　　gf_log(this->name,GF_LOG_DEBUG,"removed  %s",spirit);
　　break;
　　}
　　}
　　}
　　int32_t
　　negative_lookup_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
　　int32_t op_ret, int32_t op_errno,  inode_t *inode,
　　struct  iatt *buf, dict_t *dict, struct iatt *postparent)
　　{
　　negative_private_t  *priv = this->private;
　　ghost_t  *gp = NULL;
　　uint64_t ctx = 0;
　　uint32_t bucket =  0;
　　inode_ctx_get(inode,this,&ctx);
　　if  (op_ret < 0) {
　　gp  = GF_CALLOC(1,sizeof(ghost_t),gf_negative_mt_ghost);
　　if  (gp) {
　　gp->path  = (char *)ctx;
　　bucket  = GHOST_HASH(gp->path);
　　/* TBD:  locking */
　　gp->next  = priv->ghosts[bucket];
　　priv->ghosts[bucket]  = gp;
　　gf_log(this->name,GF_LOG_DEBUG,"added  %s",
　　(char *)ctx);
　　goto  unwind;
　　}
　　}
　　else  {
　　gf_log(this->name,GF_LOG_DEBUG,"found  %s", (char *)ctx);
　　exorcise(this,(char *)ctx);
　　}
　　/* Both  positive result and allocation failure come here. */
　　GF_FREE((void *)ctx);
　　unwind:
　　STACK_UNWIND_STRICT  (lookup, frame, op_ret, op_errno, inode, buf,
　　dict,  postparent);
　　return  0;
　　}
　　int32_t
　　negative_lookup (call_frame_t  *frame, xlator_t *this, loc_t *loc,
　　dict_t  *xattr_req)
　　{
　　negative_private_t  *priv = this->private;
　　ghost_t  *gp = NULL;
　　uint32_t bucket =  0;
　　bucket =  GHOST_HASH(loc->path);
　　for  (gp = priv->ghosts[bucket]; gp; gp = gp->next)  {
　　if  (!strcmp(gp->path,loc->path)) {
　　gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => HIT",
　　loc->path,  loc->inode);
　　STACK_UNWIND_STRICT  (lookup, frame, -1, ENOENT,
　　NULL, NULL, NULL, NULL);
　　return  0;
　　}
　　}
　　gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => MISS",
　　loc->path,  loc->inode);
　　inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
　　STACK_WIND  (frame, negative_lookup_cbk, FIRST_CHILD(this),
　　FIRST_CHILD(this)->fops->lookup,  loc, xattr_req);
　　return  0;
　　}
　　int32_t
　　negative_create_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
　　int32_t op_ret, int32_t op_errno,  fd_t *fd, inode_t *inode,
　　struct  iatt *buf, struct iatt *preparent,
　　struct  iatt *postparent)
　　{
　　uint64_t ctx = 0;
　　inode_ctx_get(inode,this,&ctx);
　　exorcise(this,(char *)ctx);
　　GF_FREE((void *)ctx);
　　STACK_UNWIND_STRICT  (create, frame, op_ret, op_errno, fd, inode, buf,
　　preparent,  postparent);
　　return  0;
　　}
　　int32_t
　　negative_create (call_frame_t  *frame, xlator_t *this, loc_t *loc, int32_t flags,
　　mode_t  mode, fd_t *fd, dict_t *params)
　　{
　　inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
　　STACK_WIND  (frame, negative_create_cbk, FIRST_CHILD(this),
　　FIRST_CHILD(this)->fops->create,  loc, flags, mode, fd,
　　params);
　　return  0;
　　}
　　int32_t
　　negative_mkdir_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
　　int32_t op_ret, int32_t op_errno,  inode_t *inode,
　　struct  iatt *buf, struct iatt *preparent,
　　struct  iatt *postparent)
　　{
　　uint64_t ctx = 0;
　　inode_ctx_get(inode,this,&ctx);
　　exorcise(this,(char *)ctx);
　　GF_FREE((void *)ctx);
　　STACK_UNWIND_STRICT  (mkdir, frame, op_ret, op_errno, inode,
　　buf,  preparent, postparent);
　　return  0;
　　}
　　int
　　negative_mkdir (call_frame_t  *frame, xlator_t *this, loc_t *loc, mode_t mode,
　　dict_t  *params)
　　{
　　inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
　　STACK_WIND  (frame, negative_mkdir_cbk, FIRST_CHILD(this),
　　FIRST_CHILD(this)->fops->mkdir,  loc, mode, params);
　　return  0;
　　}
　　int32_t
　　init (xlator_t *this)
　　{
　　negative_private_t *priv = NULL;
　　if (!this->children  || this->children->next) {
　　gf_log ("negative",  GF_LOG_ERROR,
　　"FATAL:  negative should have exactly one child");
　　return -1;
　　}
　　if (!this->parents)  {
　　gf_log (this->name,  GF_LOG_WARNING,
　　"dangling  volume. check volfile ");
　　}

　　priv =  GF_CALLOC (1,>　　if  (!priv)
　　return  -1;
　　this->private  = priv;
　　gf_log ("negative",  GF_LOG_DEBUG, "negative xlator loaded");
　　return 0;
　　}
　　void
　　fini (xlator_t *this)
　　{
　　negative_private_t *priv =  this->private;
　　if  (!priv)
　　return;
　　this->private  = NULL;
　　GF_FREE (priv);
　　return;
　　}
　　struct xlator_fops fops = {
　　.lookup  = negative_lookup,
　　.create  = negative_create,
　　.mkdir =  negative_mkdir,
　　};
　　struct xlator_cbks cbks = {
　　};
　　struct volume_options options[] = {
　　{ .key =  {NULL} },
　　};
　　negative-lookup /Makefile
　　# Change these to match your source code.
　　TARGET = negative.so
　　OBJECTS = negative.o
　　# Change these to match your environment.
　　GLFS_SRC = /root/glusterfs_patches
　　GLFS_ROOT = /opt/glusterfs
　　GLFS_VERS = 3git
　　GLFS_LIB = $(GLFS_ROOT)/$(GLFS_VERS)/lib64
　　HOST_OS = GF_LINUX_HOST_OS
　　# You shouldn't need to change anything below here.
　　CFLAGS = -fPIC -Wall -O0 -g \
　　-DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \
　　-I$(GLFS_SRC)  -I$(GLFS_SRC)/libglusterfs/src \
　　-I$(GLFS_SRC)/contrib/uuid  -I.
　　LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread
　　$(TARGET): $(OBJECTS)
　　$(CC)  $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
　　install: $(TARGET)
　　cp $(TARGET)  $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
　　clean:
　　rm -f $(TARGET)  $(OBJECTS)
　　negative-lookup /README.md
　　This is a very simple translator to cache  "negative lookups" for workloads in which the same file is looked  up many times in places where it doesn't exist. In particular, web script  files with many includes/requires and long paths can generate hundreds of  such lookups per front-end request. If we don't cache the negative results,  this can mean hundreds of back-end network round trips per front-end request.  So we cache. Very simple tests for this kind of workload on two machines  connected via GigE show an approximately 3x performance improvement.
　　This code is nowhere near ready for production use yet.  It was originally developed as a pedagogical example, but one that could  lead to something truly useful as well. Among other things, the  following features need to be added.
　　·       Support for other namespace-modifying operations -  link, symlink, mknod, rename, even funky xattr requests.
　　·       Time-based cache expiration to cover the case  where another client creates a file that's in our cache  because it wasn't there when we first looked it up. This might even include  periodic pruning of entries that are already stale but will never be looked  up (and therefore never reaped in-line) again.
　　·       Locking on the cache for when we're called  concurrently.
　　This is intended to be a learning tool. I might not get  back to this code myself for a long time, but I always have time to help  anyone who's learning to write translators. If you want to help move it  along, please fork and send me pull requests.
　　For more information on writing GlusterFS translators,  check out my "Translator 101" series:
　　·       http://hekafs.org/index.php/2011/11/translator-101-class-1-setting-the-stage/
　　·       http://hekafs.org/index.php/2011/11/translator-101-lesson-2-init-fini-and-private-context/
　　·       http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/
　　·       http://hekafs.org/index.php/2011/11/translator-101-lesson-4-debugging-a-translator/
　　模块二：
　　bypass / bypass.h
　　/*
　　* Copyright (c)  2011 Red Hat
　　*/
　　#ifndef __bypass_H__
　　#define __bypass_H__
　　#ifndef _CONFIG_H
　　#define _CONFIG_H
　　#include "config.h"
　　#endif
　　#include "mem-types.h"
　　/* Deal with casts for 32-bit architectures. */
　　#define CAST2INT(x) ((uint64_t)(long)(x))
　　#define CAST2PTR(x) ((void *)(long)(x))
　　typedef struct {
　　xlator_t  *target;
　　} bypass_private_t;
　　enum gf_bypass_mem_types_ {
　　gf_bypass_mt_priv_t  = gf_common_mt_end + 1,
　　gf_by_mt_int32_t,
　　gf_bypass_mt_end
　　};
　　#endif /* __bypass_H__ */
　　bypass / bypass.c
　　/*
　　* Copyright (c)  2011 Red Hat
　　*/
　　#include
　　#include
　　#ifndef _CONFIG_H
　　#define _CONFIG_H
　　#include "config.h"
　　#endif
　　#include "glusterfs.h"
　　#include "call-stub.h"
　　#include "defaults.h"
　　#include "logging.h"
　　#include "xlator.h"
　　#include "bypass.h"
　　int32_t

　　bypass_readv (call_frame_t  *frame, xlator_t *this, fd_t *fd,>　　off_t offset)
　　{
　　bypass_private_t  *priv = this->private;
　　STACK_WIND  (frame, default_readv_cbk, priv->target,

　　priv->target->fops->readv,  fd,>　　return  0;
　　}
　　dict_t *
　　get_pending_dict (xlator_t *this)
　　{
　　dict_t *dict = NULL;
　　xlator_list_t *trav = NULL;
　　char *key = NULL;
　　int32_t *value = NULL;
　　xlator_t  *afr = NULL;
　　bypass_private_t  *priv = this->private;
　　dict = dict_new();
　　if (!dict) {
　　gf_log (this->name, GF_LOG_WARNING, "failed  to allocate dict");
　　return  NULL;
　　}
　　afr =  this->children->xlator;
　　for (trav = afr->children;  trav; trav = trav->next) {
　　if  (trav->xlator == priv->target) {
　　continue;
　　}
　　if (gf_asprintf(&key,"trusted.afr.%s",trav->xlator->name)  < 0) {
　　gf_log (this->name, GF_LOG_WARNING,
　　"failed to allocate key");
　　goto free_dict;
　　}
　　value = GF_CALLOC(3,sizeof(*value),gf_by_mt_int32_t);
　　if (!value) {
　　gf_log (this->name, GF_LOG_WARNING,
　　"failed to allocate value");
　　goto free_key;
　　}
　　/* Amazingly,  there's no constant for this. */
　　value[0] =  htons(1);
　　if (dict_set_dynptr(dict,key,value,3*sizeof(*value))  < 0) {
　　gf_log (this->name, GF_LOG_WARNING,
　　"failed to set up dict");
　　goto free_value;
　　}
　　}
　　return  dict;
　　free_value:
　　GF_FREE(value);
　　free_key:
　　GF_FREE(key);
　　free_dict:
　　dict_unref(dict);
　　return  NULL;
　　}
　　int32_t
　　bypass_set_pending_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
　　int32_t op_ret, int32_t op_errno,  dict_t *dict)
　　{
　　if (op_ret < 0) {
　　goto unwind;
　　}
　　call_resume(cookie);
　　return0;
　　unwind:
　　STACK_UNWIND_STRICT  (writev, frame, op_ret, op_errno, NULL, NULL);
　　return  0;
　　}
　　int32_t
　　bypass_writev_resume (call_frame_t  *frame, xlator_t *this, fd_t *fd,
　　struct  iovec *vector, int32_t count, off_t off,
　　struct  iobref *iobref)
　　{
　　bypass_private_t  *priv = this->private;
　　STACK_WIND  (frame, default_writev_cbk, priv->target,
　　priv->target->fops->writev,  fd, vector, count, off,
　　iobref);
　　return  0;
　　}
　　int32_t
　　bypass_writev (call_frame_t  *frame, xlator_t *this, fd_t *fd,
　　struct  iovec *vector, int32_t count, off_t off,
　　struct  iobref *iobref)
　　{
　　dict_t *dict = NULL;
　　call_stub_t *stub = NULL;
　　bypass_private_t  *priv = this->private;
　　/*
　　* I wish we  could just create the stub pointing to the target's
　　* writev  function, but then we'd get into another translator's code
　　* with  "this" pointing to us.
　　*/
　　stub = fop_writev_stub(frame,  bypass_writev_resume,
　　fd, vector, count, off, iobref);
　　if (!stub) {
　　gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");
　　goto wind;
　　}
　　dict =  get_pending_dict(this);
　　if  (!dict) {
　　gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");
　　goto  free_stub;
　　}
　　STACK_WIND_COOKIE (frame, bypass_set_pending_cbk, stub,
　　priv->target,  priv->target->fops->fxattrop,
　　fd,  GF_XATTROP_ADD_ARRAY, dict);
　　return0;
　　free_stub:
　　call_stub_destroy(stub);
　　wind:
　　dict_unref(dict);
　　STACK_WIND  (frame, default_writev_cbk, FIRST_CHILD(this),
　　FIRST_CHILD(this)->fops->writev,  fd, vector, count, off,
　　iobref);
　　return  0;
　　}
　　/*
　　* Even  applications that only read seem to call this, and it can force an
　　* unwanted  self-heal.
　　* TBD: there are  probably more like this - stat, open(O_RDONLY), etc.
　　*/
　　int32_t
　　bypass_fstat (call_frame_t  *frame, xlator_t *this, fd_t *fd)
　　{
　　bypass_private_t  *priv = this->private;
　　STACK_WIND  (frame, default_fstat_cbk, priv->target,
　　priv->target->fops->fstat,  fd);
　　return  0;
　　}
　　int32_t
　　init (xlator_t *this)
　　{
　　xlator_t *tgt_xl = NULL;
　　bypass_private_t *priv = NULL;
　　if (!this->children ||  this->children->next) {
　　gf_log (this->name, GF_LOG_ERROR,
　　"FATAL: bypass should have exactly one child");
　　return -1;
　　}
　　tgt_xl = this->children->xlator;
　　/* TBD: check for cluster/afr as well */
　　if (strcmp(tgt_xl->type,"cluster/replicate")) {
　　gf_log (this->name, GF_LOG_ERROR,
　　"%s must be loaded above cluster/replicate",
　　this->type);
　　return -1;
　　}
　　/* TBD: pass  target-translator name as an option (instead of first) */
　　tgt_xl = tgt_xl->children->xlator;

　　priv = GF_CALLOC (1,>　　if  (!priv)
　　return  -1;
　　priv->target = tgt_xl;
　　this->private = priv;
　　gf_log (this->name, GF_LOG_DEBUG, "bypass  xlator loaded");
　　return0;
　　}
　　void
　　fini (xlator_t *this)
　　{
　　bypass_private_t *priv = this->private;
　　if  (!priv)
　　return;
　　this->private  = NULL;
　　GF_FREE (priv);
　　return;
　　}
　　struct xlator_fops fops = {
　　.readv =  bypass_readv,
　　.writev = bypass_writev,
　　.fstat =  bypass_fstat
　　};
　　struct xlator_cbks cbks = {
　　};
　　struct volume_options options[] = {
　　{ .key = {NULL} },
　　};
　　bypass / Makefile
　　# Change these to match your source code.
　　TARGET = bypass.so
　　OBJECTS = bypass.o
　　# Change these to match your environment.
　　GLFS_SRC = /root/glusterfs_patches
　　GLFS_ROOT = /opt/glusterfs
　　GLFS_VERS = 3git
　　GLFS_LIB = `ls -d $(GLFS_ROOT)/$(GLFS_VERS)/lib*`
　　HOST_OS = GF_LINUX_HOST_OS
　　# You shouldn't need to change anything below here.
　　CFLAGS = -fPIC -Wall -O0 -g \
　　-DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64  -D_GNU_SOURCE -D$(HOST_OS) \
　　-I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \
　　-I$(GLFS_SRC)/contrib/uuid -I.
　　LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread
　　$(TARGET): $(OBJECTS)
　　$(CC) $(CFLAGS) $(OBJECTS)  $(LDFLAGS) -o $(TARGET)
　　install: $(TARGET)
　　cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
　　clean:
　　rm -f $(TARGET) $(OBJECTS)
　　bypass /bytest-fuse.vol
　　volume bytest-posix-0
　　type storage/posix
　　option directory  /export/bytest1
　　end-volume
　　volume bytest-locks-0
　　type features/locks
　　subvolumes bytest-posix-0
　　end-volume
　　volume bytest-client-1
　　type protocol/client
　　option remote-host gfs1
　　option remote-subvolume  /export/bytest2
　　option transport-type tcp
　　end-volume
　　volume bytest-replicate-0
　　type cluster/replicate
　　subvolumes bytest-locks-0  bytest-client-1
　　end-volume
　　volume bytest-bypass
　　type features/bypass
　　subvolumes bytest-replicate-0
　　end-volume
　　volume bytest
　　type debug/io-stats
　　option latency-measurement off
　　option count-fop-hits off
　　subvolumes bytest-bypass
　　end-volume
　　bypass /README.md

　　This is a proof-of-concept translator for an>　　·       If multiple clients try to write the same file with  bypass turned on, you'll get massive split-brain problems. Solutions  might include honoring AFR's quorum-enforcement rules, auto-issuing locks  during open to prevent such concurrent access, or simply documenting the fact  that users must do such locking themselves. The last might sound like a  cop-out, but such locking is already common for the likely use case of  serving virtual-machine images.
　　·       We only intercept readv, writev, and fstat. There are  many other calls that can trigger self-heal, including plain old lookup. The  only way to prevent a lookup self-heal would be to put another  translator below AFR to intercept xattr requests and pretend  everything's OK. Ick. Remember, though, that this is only a proof of concept.  If we really wanted to get serious about this, we could implement the same  technique within AFR and do all the necessary coordination there.
　　·       It would be nice if the AFR subvolume to use could be  specified as an option (instead of just picking the first child), if bypass  could be made selective, etc.
　　The coolest direction to go here would be to put information  about writes we've seen onto a queue, with a separate process listening on  that queue to perform assynchronous but nearly immediate self-heal on just  those files. As long as the other consistency issues are handled properly,  this might be a really easy way to get near-local performance for  virtual-machine-image use cases without introducing consistency/recovery  nightmares.

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] Glusterfs hacker guide（一）

扫码加入运维网微信交流群